Corpus
The Arabic Corpus is composed of arabic texts for text categorization. The corpus Khaleej-2004 contains 5690 documents. It is divided to 4 topics (categories). The corpus Watan-2004 contains 20291 documents organized in 6 topics (categories).
Platforms: *nix
License: Freeware | Size: 13.76 MB | Download (32): Arabic Corpus Download |
Bitextor is an application created to generate translation memories using multilingual websites as a corpus source. It downloads an entire website and applies a set of heuristics (based mainly on HTML tag structure and text block length) to find bitexts.
Platforms: *nix
License: Freeware | Size: 204.8 KB | Download (35): Bitextor Download |
CorpusSearch is a tool that finds syntactic structures in a corpus of annotated sentence trees. It can be used as a research tool on a corpus, or as a development tool for building the corpus. CorpusSearch 2 is a Java program that supports research in corpus linguistics. It is useful both for...
Platforms: *nix
License: Freeware | Size: 2.92 MB | Download (36): CorpusSearch for Linux Download |
Uplug is a collection of tools for linguistic corpus processing, word alignment, and term extraction from parallel corpora. Several tools have been integrated in Uplug. Pre-processing tools include a sentence splitter, tokenizer, and external part-of-speech tagger and shallow parsers. The...
Platforms: *nix
License: Freeware | Size: 21.9 MB | Download (108): Uplug Download |
An open-source corpus analysis class library written in C#. GUI of Tenka Text 0.1.3 comes with Wordlister - an advanced, extremely fast graphical wordlist tool and a simple regex concordance tool. Tenka Text - the open-source answer to WordSmith Tool
Platforms: Windows, Mac, BSD, Solaris, Linux
License: Freeware | Size: 707.74 KB | Download (51): Corsis (formerly Tenka Text) Download |
Emdros is a corpus query system for storage and retrieval of linguistic analyses of text. It is especially applicable in corpus linguistics dealing with syntax, morphology, phonology, and/or discourse. It is also a generally useful text database engine.
Platforms: Windows, Mac, Solaris, Linux
License: Freeware | Size: 8.33 MB | Download (48): Emdros Download |
PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files. Supported file formats are Kura XML, Elan XML and Toolbox files. A Corpus Reader API is provided to support statistical analysis within the NLTK.
Platforms: Windows, Mac, Linux
License: Freeware | Size: 45.38 KB | Download (46): PyAnnotation Download |
TXM is a free and open-source cross-platform Unicode & XML based text/corpus analysis environment and graphical client, supporting Windows, Linux and Mac OS X. It can also be used online as a J2EE standard compliant web portal (GWT based) with access control built in.
It offers a comprehensive...
Platforms: Windows, Mac, Linux
License: Freeware | Size: 2.46 MB | Download (46): TXM Download |
This module extends the nodereference fields by providing a filter based searching engine in order to automatically fill it using the Solace API filters features as backend.This means you can attach a SolR filter instance to node reference fields. Then, any node owner can enable and configure a...
Platforms: PHP
License: Freeware | Size: 30.72 KB | Download (46): Solace Node Reference Download |
Cunei is a data-driven machine translation system that builds dynamic, statistical models based on instances of known translations found in a corpus.
Platforms: *nix
License: Freeware | Size: 174.08 KB | Download (38): Cunei Machine Translation Platform Download |
A meta package-manager to deploy projects on UNIX Systemes sponsored by Makina Corpus. FEATURES; * Auto Update system. When minimerge upgrade (easy_install -U), we have now the infrastructure to run update callbacks. * Now minibuilds have revisions, this can facilitate their reinstallation as...
Platforms: *nix
License: Freeware | Size: 133.12 KB | Download (32): minitage.core Download |
PasteScripts to facilitate use of minitage and creation of minitage based projects sponsored by Makina Corpus. Projects templates * minitage.zope3: A sample layout for a zope 3 application * minitage.plone25: A sample layout for a plone 25 application * minitage.plone3: A sample layout for a...
Platforms: *nix
License: Freeware | Size: 634.88 KB | Download (40): minitage.paste Download |
MaxEntropy is a Perl5 module for Maximum Entropy Modeling and Feature Induction. SYNOPSIS use Statistics::MaxEntropy; # debugging messages; default 0 $Statistics::MaxEntropy::debug = 0; # maximum number of iterations for IIS; default 100 $Statistics::MaxEntropy::NEWTON_max_it = 100; #...
Platforms: *nix
License: Freeware | Size: 41.98 KB | Download (100): Statistics::MaxEntropy Download |
DadaDodo project is a program that generates random sentences based on input files. Sometimes these sentences are nonsense; but sometimes they cut right through to the heart of the matter, and reveal hidden meanings. DadaDodo works rather differently than Dissociated Press; whereas...
Platforms: *nix
License: Freeware | Size: 22.53 KB | Download (106): DadaDodo Download |
Knorpora is a modified version of the Knoppix 3.3 Live CD for students of corpus-based computational linguistics. Like Knoppix, the Knorpora CD allows you to run a fully operational Debian/Linux operating system from the CD-ROM drive, without installing anything on the computer. The Knorpora...
Platforms: *nix
License: Freeware | Size: 676.4 MB | Download (91): Knorpora Download |
Understanding computer networks without performing practical experiments is really difficult, not to say it is almost impossible. Unfortunately, setting up a networking lab can be very expensive. Netkit has been conceived as an environment for setting up and performing networking experiments at...
Platforms: *nix
License: Freeware | Size: 778.24 KB | Download (137): Netkit 4 Download |
Search::Lemur is a Perl class to query a Lemur server, and parse the results. SYNOPSYS use Search::Lemur; my $lem = Search::Lemur->new("http://url/to/lemur.cgi"); # run some queries, and get back an array of results # a query with a single term: my @results1 = $lem->query("encryption");...
Platforms: *nix
License: Freeware | Size: 8.19 KB | Download (89): Search::Lemur Download |
Search::FreeText is a free text indexing module for medium-to-large text corpuses. SYNOPSIS my $test = new Search::FreeText(-db => [DB_File, "stories.db"]); $text->open_index(); $text->clear_index(); $text->index_document(1, "Hello world"); $text->index_document(2, "World in motion");...
Platforms: *nix
License: Freeware | Size: 10.24 KB | Download (95): Search::FreeText Download |
TextSearch is a program that helps you search through a set of text files which are in a hierarchical structure, i.e. a directory structure. Each document is searched using a regular expression and an overview of the results is shown as a tree structure. By clicking on a file, it can be viewed,...
Platforms: *nix
License: Freeware | Size: 15.36 KB | Download (96): TextSearch Download |
mime4j project provides a parser, MimeStreamParser , for e-mail message streams in plain rfc822 and MIME format. The parser uses a callback mechanism to report parsing events such as the start of an entity header, the start of a body, etc. If you are familiar with the SAX XML parser interface you...
Platforms: *nix
License: Freeware | Download (96): mime4j Download |