Corpora
Poliqarp is designed to be a universal suite of utilities for large corpora processing.
You can use this accessible tool to create corpora of texts written in almost any language in its native script - be it English, Polish, Japanese or Thai - as long as they are encoded in the UTF-8...
Platforms: Linux
License: Freeware | Size: 1.6 MB | Download (45): Poliqarp for Linux Download |
JBootCat is a Java implemention of the BootCat scripts written by Marco Baroni et al for generating corpora from the Internet. JBootCats main goal is to encapsulate the BootCat functionality within a user-friendly desktop application. The advantage of using the Java platform is that JBootCat...
Platforms: *nix
License: Freeware | Size: 1013.76 KB | Download (93): JBootCat Download |
Poliqarp is a utility for searching large corpora..
Platforms: *nix
License: Freeware | Size: 798.72 KB | Download (88): Poliqarp Download |
Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora. Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, tokenizer, and external part-of-speech tagger and shallow parsers. The...
Platforms: *nix
License: Freeware | Size: 21.9 MB | Download (108): Uplug Download |
CorpusFiltergraph is a framework installed on every edition of DoMY that empowers users with "Graphs" of "Plug-ins".
CorpusFiltergraph allows you to extract, filter, align and transform text data from multilingual documents into parallel training corpora. The application has already transformed...
Platforms: Windows
License: Freeware | Download (44): CorpusFiltergraph Download |
The NITE XML Toolkit supports the creation, analysis, and browsing of annotated multimodal, text, or spoken language corpora, and represents both timing and rich linguistic structure. It contains libraries for developers and some end user tools.
Platforms: Windows, Mac, Linux
License: Freeware | Size: 38.79 MB | Download (47): The NITE XML Toolkit Download |
ABNER is a software tool for molecular biology text analysis. It began as a user-friendly interface for a system developed as part of the NLPBA/BioNLP 2004 Shared Task challenge. The details of that system are described in the paper below (Settles, 2004). At ABNER's core is a statistical machine...
Platforms: Mac
License: Shareware | Cost: $0.00 USD | Size: 9.5 MB | Download (37): ABNER Download |
This is the Open Source and UIMA-based application drawn out from the European project TTC Terminology Extraction, Translation Tools and Comparable Corpora. This project aims at leveraging machine translation, computer-assisted translation and multilingual content management tools by...
Platforms: Mac
License: Freeware | Size: 4.68 MB | Download (38): TTC Term Suite Download |
Stanford NER (also known as CRFClassifier) is a Java implementation of a Named Entity Recognizer. Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names. The software provides a general...
Platforms: Mac
License: Shareware | Cost: $0.00 USD | Size: 59.59 MB | Download (35): Stanford Named Entity Recognizer Download |
DeSR is a multilingual statistical dependency parser. It produces dependency parse trees for natural language sentences using a parsing model learned from annotated corpora.
Platforms: *nix
License: Freeware | Size: 3.26 MB | Download (36): DeSR Download |
Iracema is a Named Entity Recognition and Classification (NERC) library that aims to provide algorithms and commonly used functionality for both implementing and evaluating NERC systems. It is implemented in Java. Iracema features: * A Flexible architecture for implementing and evaluating NERC...
Platforms: Mac
License: Freeware | Size: 41.91 MB | Download (41): iracema Download |
Knowtator is a general-purpose text annotation tool that is integrated with the Prot?*A*g?*A* knowledge representation system. Knowtator facilitates the manual creation of training and evaluation corpora for a variety of biomedical language processing tasks. Building on the strengths of the...
Platforms: Mac
License: Freeware | Size: 1.45 MB | Download (37): Knowtator Download |
Emdros is an Open-Source text database engine for storage and retrieval of analyzed or annotated text. Emdros has a powerful query-language for asking relevant questions of the data. Emdros has wide applicability in fields that deal with analyzed or annotated text. Application domains include...
Platforms: *nix
License: Freeware | Size: 8.33 MB | Download (47): Emdros for linux Download |
libleipzig-python provides a wrapper to the web services provided by the Deutscher Wortschatz project of the University of Leipzig. Deutscher Wortschatz is a German database of text corpora and can be utilized to analyze and contextualize words in the thesaurus. libleipzig currently supports all...
Platforms: *nix
License: Freeware | Size: 10.24 KB | Download (38): libleipzig Download |
CorpusSearch is a tool that finds syntactic structures in a corpus of annotated sentence trees. It can be used as a research tool on a corpus, or as a development tool for building the corpus. CorpusSearch 2 is a Java program that supports research in corpus linguistics. It is useful both for...
Platforms: *nix
License: Freeware | Size: 2.92 MB | Download (36): CorpusSearch for Linux Download |