Download Shareware and Freeware Software for Windows, Linux, Macintosh, PDA

line Home  |  About Us  |  Link To Us  |  FAQ  |  Contact

Serving Software Downloads in 976 Categories, Downloaded 29.884.900 Times

HTML Parser 1.6-20060610

  Date Added: October 24, 2010  |  Visits: 727

HTML Parser

Report Broken Link
Printer Friendly Version


Product Homepage
Download (78 downloads)



HTMLParser is a super-fast real-time parser for real-world HTML. What has attracted most developers to HTMLParser has been its simplicity in design, speed and ability to handle streaming real-world html. The two fundamental use-cases that are handled by the parser are extraction and transformation (the syntheses use-case, where HTML pages are created from scratch, is better handled by other tools closer to the source of data). While prior versions concentrated on data extraction from web pages, Version 1.4 of the HTMLParser has substantial improvements in the area of transforming web pages, with simplified tag creation and editing, and verbatim toHtml() method output. In order to use HTMLParser you will need to be able to write code in the Java programming language. Although some example programs are provided that may be useful as they stand, its more than likely you will need (or want) to create your own programs or modify the ones provided to match your intended application. To use the library, you will need to add either the htmllexer.jar or htmlparser.jar to your classpath when compiling and running. The htmllexer.jar provides low level access to generic string, remark and tag nodes on the page in a linear, flat, sequential manner. The htmlparser.jar, which includes the classes found in htmllexer.jar, provides access to a page as a sequence of nested differentiated tags containing string, remark and other tag nodes. So where the output from calls to the lexer nextNode() method might be: < html> < head> < title> "Welcome" < /title> < /head> < body> etc... The output from the parser NodeIterator would nest the tags as children of the , and other nodes (here represented by indentation): < html> < head> < title> "Welcome" < /title> < /head> < body> etc... The parser attempts to balance opening tags with ending tags to present the structure of the page, while the lexer simply spits out nodes. If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser. Extraction Extraction encompasses all the information retrieval programs that are not meant to preserve the source page. This covers uses like: - text extraction, for use as input for text search engine databases for example - link extraction, for crawling through web pages or harvesting email addresses - screen scraping, for programmatic data input from web pages - resource extraction, collecting images or sound - a browser front end, the preliminary stage of page display - link checking, ensuring links are valid - site monitoring, checking for page differences beyond simplistic diffs There are several facilities in the HTMLParser codebase to help with extraction, including filters, visitors and JavaBeans. Transformation Transformation includes all processing where the input and the output are HTML pages. Some examples are: - URL rewriting, modifying some or all links on a page - site capture, moving content from the web to local disk - censorship, removing offending words and phrases from pages - HTML cleanup, correcting erroneous pages - ad removal, excising URLs referencing advertising - conversion to XML, moving existing web pages to XML During or after reading in a page, operations on the nodes can accomplish many transformation tasks "in place", which can then be output with the toHtml() method. Depending on the purpose of your application, you will probably want to look into node decorators, visitors, or custom tags in conjunction with the PrototypicalNodeFactory. The HTML Parser is an open source library released under GNU Lesser General Public License, which basically says you are free to use the library "as is" in other (even proprietary) products, as long as due credit is given to the authors and the source code for the HTMLParser is included or available with the other product. For modified or embedded use, please consult the LGPL license..

Requirements: No special requirements
Platforms: Linux
Keyword: Extraction Html Html Parser Htmlparser Library Markup Nodes Page Pages Parser Text Editing Processing Web Pages
Users rating: 0/10

License: Freeware Size: 4.2 MB
USER REVIEWS
More Reviews or Write Review


HTML PARSER RELATED
Utilities  -  libsgml 1.1.4
libsgml is a fast, lightweight state machine SGML parser capable of parsing HTML, XML, and most other markup languages in their most elementary forms. libsgml library natively supports parsing HTML and XML documents into a tree format (DOM). All...
102.4 KB  
HTML Utilities  -  HTML Hidden Text Generator 1.1
The HTML Hidden Text Generator significantly speeds up, improves the accuracy and simplifies the process of creating expanding or hidden text sections within web pages. Why hide text on a web page? The advantage of hiding text on a web page...
2.08 MB  
Utilities  -  XML Tree Object Model Parser 1.0.1
XML Tree Object Model Parser is an easy to use XML parser designed to provide an easy and simple to use library for parsing XML configuration files. It is split into two components. The first one is the Parser itself, which uses J2SE 1.4 XML...
61.44 KB  
Utilities  -  Hoglet 0.2
Hoglet allows special markup to be added to text documents so that software documentation can be easily produced. It provides a configurable parser, simple markup rules, and extensible "tag handlers" that allow custom Java code to process...
491.52 KB  
Content Management  -  LibData 2.30
LibData is a library oriented web based application which provides authoring environments for subject pathfinders (Research QuickStart), course related pages (CourseLib) and general purpose web pages (PageScribe). LibData encompasses all of these...
 
Utilities  -  Grutatxt 2.0.13
Grutatxt is a plain text to HTML (and other formats) converter. Grutatxt project succesfully converts subtle text markup to lists, bold, italics, tables and headings to their corresponding HTML, troff, man page or LaTeX markup without having to...
29.7 KB  
Utilities  -  lMaker 1.11
lMaker is a php class designed for web masters and programmers who want a simple way to generate complex, dynamic web sites from easily maintainable text files. lMaker project is designed to help automate some of the most repetitive features of...
6.14 KB  
Network & Internet  -  WebReporter 0.5.1
WebReporter is a tool to periodically check Web pages of interest and report changes in HTML format (e.g., to the start page of your browser) and/or by email (maybe via SMS gateway). The project can also report almost anything via custom plugins....
7.17 KB  
Utilities  -  XMLPublication 0.4.1
XMLPublication project is a collection of tools to generate Web pages from desktop documents or other structured documents, such as books with paragraphs, or tabular data. It cuts documents into Web pages, and creates customizable multi-indices....
1.4 MB  
Utilities  -  Arabica January 2007
Arabica is a C++ XML parser toolkit that has a full SAX2 implementation (the Simple API for XML), including the optional interfaces and helper classes. It also implements the W3C DOM (Document Object Model) Level 2.0 Core, together with XPath 1.0....
256 KB  
NEW DOWNLOADS IN LINUX SOFTWARE, UTILITIES
Linux Software  -  Polling Autodialer Software 3.4
ICTBroadcast Auto Dialer software has a survey campaign for telephone surveys and polls. This auto dialer software automatically dials a list of numbers and asks them a set of questions that they can respond to, by using their telephone keypad....
488 B  
Linux Software  -  Total Video Converter Mac Free 3.5.5
Total Video Converter Mac Free developed by EffectMatrix Ltd is the official legal version of Total Video Converter which was a globally recognized brand since 2006. Total Video Converter Mac Free is a free but powerful all-in-one video...
17.7 MB  
Linux Software  -  Skeith mod_log_sql Analyzer 2.10beta2
Skeith is a php based front end for analyzing logs for Apache using mod_log_sql.
47.5 KB  
Linux Software  -  SLAX 6.0+
Slax is a modern, portable, small and fast Linux operating system with a modular approach and outstanding design. Despite its small size, Slax provides a wide collection of pre-installed software for daily use, including a well organized graphical...
190 KB  
Linux Software  -  GTK+ 2.5
GTK+, which stands for the GIMP Toolkit, is a library for creating graphical user interfaces for the X Window System. It is designed to be small, efficient, and flexible. GTK+ is written in C with a very object-oriented approach. Language bindings...
60 MB  
Utilities  -  LPAR2RRD 4.95-4
LPAR2RRD collects performance data and generates actual, historical and future trends utilization graphs of your virtual environment. It is agentless (it receives everything from the management stations like vCenter or HMC). The product supports...
2.25 MB  
Utilities  -  Nessconnect 1.0.2
Nessconnect is a GUI, CLI and API client for Nessus and Nessus compatible servers. With an improved user interface, it provides local session management, scan templates, report generation through XSLT, charts and graphs, and vulnerability trending.
819.2 KB  
Utilities  -  Dynamic Power Management 2.6.16
The Dynamic Power Management (DPM) project explores technologies to improve power conservation capabilities of platforms based on open source software. Of particular interest are techniques applicable to running systems, adjusting power parameters...
30.72 KB  
Utilities  -  Ethernet bridge tables 2.4.37.9
Ethernet bridge tables - Linux Ethernet filter for the Linux bridge. The 2.4-ebtables-brnf package contains the ebtables+bridge-nf patch. Be sure to check out the ebtables hp. This site also contains the arptables userspace tool.
40.96 KB  
Utilities  -  SaraB 1.0.0
SaraB works with DAR (Disk ARchive) to schedule and rotate backups on random-access media (i.e. hard drives, CDs, DVDs, Zip, etc. Basically anything except magnetic tapes.) This reduces hassle for the administrator by providing an automatic backup...
20.48 KB