Linux Software / Shell & Desktop / Text Editors

text-hr 0.17

Company: Robert Lujo

Date Added: December 03, 2013 | Visits: 695

Report Broken Link
Printer Friendly Version

Product Homepage
Download (39 downloads)

text-hr is Morphological/Inflection Engine for Croatian language written in Python programming language. Includes stopwords and Part-Of-Speech tagging engine (POS tagging) based on inverse inflection algorithm for detection. Since API is not still freezed, this project is still in alpha. TAGS Croatian language, python, natural language processing (NLP), Part-of-speech (POS) tagging, stopwords, inverse inflection, morphological lexicon FEATURES To name the most important are: * inflection system - for producing all forms of one word * detection of word types (POS tagging) - from existing list of word forms * list of stopwords System is based on unicode strings, default codepage to convert from and to string is cp-1250. Check Getting started. INSTALLATION Installation instructions - if you have installed pip package http://pypi.python.org/pypi/pip: pip install text-hr If not, then old-fashioned way: * download zip from http://pypi.python.org/pypi/text-hr/ * unzip * open shell * go to distribution directory * python setup.py install GETTING STARTED There are three important parts that this project provides: * Inflection system - for producing all forms of one word * Detection of word types (POS tagging) - from existing list of word forms * List of stopwords Inflection system Usage example - start python shell: > python >>> from text_hr.verbs import Verb >>> v = Verb("platiti") >>> for k in sorted(v.forms.keys()): ... print k, v.forms[k] ... AOR/P/1 [u'platismo'] AOR/P/2 [u'platiste'] AOR/P/3 [u'platiu0161e'] AOR/S/1 [u'platih'] AOR/S/2 [u'plati'] AOR/S/3 [u'plati'] IMP/P/1 [u'platasmo', u'plau0107asmo', u'platijasmo'] IMP/P/2 [u'plataste', u'plau0107aste', u'platijaste'] IMP/P/3 [u'platahu', u'plau0107ahu', u'platijahu'] ... VA_PA//P_O+S+V+N [u'plau0107eno'] X_INF// [u'platiti'] X_VAD_PAS// [u'plativu0161i'] X_VAD_PRE// [u'plateu0107i'] X_VAD_PRE// [u'plateu0107i'] Detection of word types (POS tagging) TODO: to be done - check test_detect.txt for samples, and detect.py for the logic: first example in test_detect.txt: >>> from text_hr.detect import WordTypeRecognizerExample >>> def test_it(word_list, word_types_filter=None, level=2): ... wdh = WordTypeRecognizerExample(word_list, silent=True) ... if not word_types_filter is None: ... wdh.detect(word_types_filter=word_types_filter, level=level) # e.g. word_types_filter=["N"] ... else: ... wdh.detect(level=level) # all word types ... lines_file = LinesFile() ... wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS ... print "n".join(lines_file.lines) ... return wdh >>> class LinesFile(object): ... def __init__(self): ... self.lines = [] ... def write(self, s): ... self.lines.append(repr(s.rstrip())) >>> word_list = [ ... "Broj 84" ... , "broji 34" ... , "Brojila 28" ... , "broje 23" ... , "broje?*a*?i 22" ... , "brojim 7" ... , "brojimo 5" ... , "broji?*?Z 4" ... , "brojahu 2" ... , "broja?*?Ze 1" ... , "brojite 1" ... , "-brijestovu 1" ... , "brijestovi 1" #the only one checked with endswith, but all other will be checked with get_freq ... , "-brijestove 1" ... , "-brijestova 1" ... ] Lowest quality, but fastest >>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS " 10/ 183 -> brojati (u'V-XX_-_JATI-je\u0107i-0') 84/broj,34/broji,23/broje,22/brojexe6i,7/brojim,5/brojimo,4/brojix9a,2/brojahu,1/brojite,1/brojax9ae" List of stopwords TODO: to be simplified and explained in details. this is not tested. Something like: from text_hr import word_types word_types_list = None for wordobj, l_key, cnt, _suff_id, wform_key, wform in word_types.get_all_std_words(word_types_list): if not (wordobj==wordobj_old and l_key==l_key_old): wordobj_data["value_base"] = wordobj l_key_flds = l_key.split("#") # wordobj l_key wform_key form # ondje FX#ADV#MJE.GDJE '' # one CH#PRON.OSO# #P/3F#|A#1 'njih' assert len(l_key_flds)==3, l_key_flds is_changeable = (l_key_flds[0]=="CH") print "word_type", l_key_flds[1] print "subtype", l_key_flds[2] assert wordobj_obj # TODO: # if wform: # raise NotImplementedError("now wordforms don't hold wf/key, but wf/cnt - it is reduced. Here this is not implemented!!!") Further Since there is currently no good documentation, the best source of further information is by reading tests inside of modules and tests in tests directory (dev version). More information in Running tests. And you can allways read a source. DOCUMENTATION Sorry but currently there is no good documentation. In progress ... SUPPORT Since this project is limited with my free time, support will be limited. REPORT BUG OR REQUEST FEATURE If you encounter bug, the best is to report it to bitbucket web page http://bitbucket.org/trebor74hr/text-hr. If there will be an interest for development for other inflection rich languages, I'd be glad to decouple language specific code and create new project that will be capable to deal with multiple languages. The best way to contact me is by mail (find in LICENCE). TODO list is in readme.txt (dev version). CONTRIBUTION Since this project is not currently in the stable API phase, contribution should wait for a while. RUNNING TESTS All tests are doctests (not unittests). There are three type of tests in the package: 1. doctests in each module - e.g. in verbs.py 2. doctests in tests/test_*.txt - only development version 3. tests which are not automatically compared - i.e. in special call mode detect.py can produce output file which needs to be compared manually with some existing file. Such test(s) are very slow. This needs to be changed to be automatic. Running each module directly will run 1. and 2. if running from development version. To get development version To use development version (http://bitbucket.org/trebor74hr/text-hr): hg clone https://trebor74hr@bitbucket.org/trebor74hr/text-hr create text_hr.pth in python site-packages directory with path to text-hr e.g.: r:hg-clonespythontext-hr To run all tests: * go to tests directory * run tests.py like (with sample output): > python tests.py testing module __init__ testing module adjectives ... testing module word_types testing textfile R:hg-clonespythontext-hrteststest_adj.txt ... testing textfile R:hg-clonespythontext-hrteststest_verbs_type.txt To run tests for just one module: * goto text_hr directory * run tests by running module, e.g.: > py pronouns.py __main__: running doctests ..teststest_pronouns.txt: running doctests * in the case you're not running from dev version, you'll get output like this: > py pronouns.py __main__: running doctests ..teststest_pronouns.txt: Not found, skipping #md5=c5e00de08d0b465a1624028c17cc29d0

Requirements: No special requirements

Platforms: *nix, Linux

Keyword: Development, Directory, Doctests, Existing, Forms, Hr, Inflection, Language, Module, Print, Project, Python, Quot, Running, Tagging, Testing, Tests, Text, Types, Version

Users rating: 0/10

License: Freeware

Size: 112.64 KB

USER REVIEWS

More Reviews or Write Review

TEXT-HR RELATED

HTML Utilities - SQL Language Module for BBEdit 1.0.1 SQL Language Module for BBEdit provides syntax coloring of SQL commands in BBEdit.	8 KB
Development Editors - Octave Workshop 0.10 Octave Workshop is an integrated development environment for the GNU Octave programming language.	1.36 MB
Modules - Inventory field Taxonomy Query Language 5.x-1.x-de The taxonomy query language module, tql, implements a plugin for the search (Drupal core) and views module. It provides a new tab Taxonomy in the Drupal search and a new Views filter.If you have the tql module and the search module enabled, a...	30.72 KB
Programming - Cython 0.9.6.3 Cython is a language that makes writing C extensions for the Python language as easy as Python itself. It is based on the well-known Pyrex, but supports more cutting edge functionality and optimizations. Development of Cython is mainly...	542.72 KB
Development Editors - Turbo Pascal 7.0 How to install Turbo Pascal on Windows x64 Turbo Pascal is a software development system that includes a compiler and an Integrated Development Environment (IDE) for the Pascal programming language running under CP/M, CP/M-86, and MS-DOS,...	1.34 MB
Business - Barcode Generator & Overprinter 6. 4. 2003 If you need to over print a barcode on existing forms, shipping labels, invoices, reports, etc. Barcode Generator & Overprinter can satisfy your requirement, just need a few quick mouse motions to set the print position, you can print barcodes on...	3.4 MB
Modules - Subform Element 1.0 This form element allows reusing existing forms inside a form.Building a new form can mean reusing an existing form and adding new form items to it.InstallationUnpack in your modules folder (usually /sites/all/modules/) and enable under Administer...
Development Tools - PDFMap 2.0 PDFMap is both a command line tool, a CGI script and a Python language module, designed to make the automated generation of very high quality interactive maps in the PDF format easy. PDFMap can place different objects on a rasterized map...
Programming - Youhp3 3.8 Youpee is an html preprocessor that allows you to embed any code of any script language as well as calling any external program to generate text files. It is specially designed to work with html/xml documents and provides traditional features:...	491.52 KB
Utilities - VEE, Vim Editor Embedded 3.2 beta VEE, Vim Embedded Editor is the sum of VIM and X-Terminal and is written by python language, pygtk binding, vte python modules and is tightly integrated with GNOME environment. VEE was started to make a text editor, which was based on the VIM,...	2.8 MB

NEW DOWNLOADS IN SHELL & DESKTOP, TEXT EDITORS

Shell & Desktop - Glunarclock 0.32.4 GNOME Lunar Clock Applet displays the current phase of the Moon as an applet for the gnome panel. In the properties box you can choose between a real image Features Pointing with the mouse at the applet...	522.24 KB
Shell & Desktop - Fekete 5 Icon theme for Linux For all possible desktop, and Linux distro Special additives: Suse's Yast icons. Xfce system icons, and archaic mimetypes icons Mandriva "special placed" status icons. Libreoffice icons.	71.59 MB
Shell & Desktop - XFast 0.9 XFast is a slim and lightweighted desktop environment that incorporates X and Window Manager within the same project.	1.15 MB
Shell & Desktop - print selection konqueror service menu 0.1 This service menu give you a silly way to print fast your selection on konqueror USE select the text copy the text rigt button on the webpage select print selection a kdialog will appear paste the text	10.24 KB
Shell & Desktop - Faenza 1.2 Faenza icon theme is available to install for Ubuntu users via a PPA repository. View the README file for instructions and a list of known issues.	23.49 MB
Text Editors - DocBook Doclet 6.0.3 DocBook Doclet (dbdoclet) creates DocBook XML and class diagrams from Javadoc comments, converts HTML to DocBook, and transfoms DocBook XML into various output formats. It consists of a complete DocBook distribution containing schemas and the...	57.64 MB
Text Editors - SeaScope 0.4 A pyQt GUI front-end for cscope. Written in python using pyQt, QScintilla libraries. Features: * Search features o cscope search features o Call tree for functions o Call tree for symbols ...	10.24 KB
Text Editors - Val(a)IDE 0.7.1 Val(a)IDE is an IDE (Integrated Development Environment) application for the Vala programming language. Here are some key features of "Val(a)IDE": ?A Syntax highlighting for Vala ?A Project compilation	1.52 MB
Text Editors - greyd 1.0 greyd is a transparent Greylist proxy for the purpose of rejecting spam send by spambot armies. The first generation of code which has been running in production for about 3 months has greatly reduced the amount of spam that needs to be processed...	10.24 KB
Text Editors - Siril 0.8 Siril is an astronomical image processing software for Linux.	204.8 KB

Windows Software	BeOS Software
Macintosh Software	Linux Software
PDA Software	OS/2 Software
Mobile Software	Scripts