Home  |  About Us  |  Link To Us  |  FAQ  |  Contact

# Text::Bloom 1.07

Date Added: August 27, 2010  |  Visits: 604

Text::Bloom can evaluate Bloom signature of a set of terms. SYNOPSIS my \$b = Text::Bloom->new(); \$b->Compute( qw( foo bar baz ) ); my \$sig = \$b->WriteToString(); \$b->WriteToFile( afile.sig ); my \$b2 = Text::Bloom::NewFromFile( afile.sig ); my \$b3 = Text::Bloom->new(); \$b3->Compute( qw( foo bar barbaz ) ); my \$sim = \$b->Similarity( \$b2 ); my \$b4 = Text::Bloom::NewFromString( \$sig ); Text::Bloom applies the Bloom filtering technique to the statistical analysis of documents. The terms in the document are quantized using a base-36 radix representation; each term thus corresponds to an integer in the range 0..p-1, where p is a prime, currently set to the greatest prime less than 2^32. Each quantized value is mapped to d integers in the range 0..size-1, where size is an integer less than p, currently 2^17, using a family of hash functions, computed by the HashV function. Each hashed value is used as the index in a large bit vector. Bits corresponding to terms present in the document are set to 1; all other bits are set to 0. Of course, collisions may cause the same bit to be set twice, by different terms. It follows that, if the document contains n distinct terms, in the resulting bit vector at most n * d bits are set to 1. The resulting bit string is a very compact representation of the presence/absence of terms in the document, and is therefore characterised as a signature. Moreover, it does not depend on a pre-set dictionary of terms. The signature may be used for: testing whether a given set of terms is present in the document, computing which fraction of terms are common to two documents. The bit representation may be written to and read from a file. Text::Bloom prepends a header to the bit stream proper; moreover, whenever the package Compress::Zlib is available, the bit vector is compressed, so that disk space requirements are drastically reduced, especially for small documents. The hash function is obviously a crucial component of the filter; the reference implementation uses a radix representation of strings. Each term must therefore match the regular expression /[0-9a-z]+/. There are quite a few viable alternatives, which can be pursued by subclassing and redefining the method QuantizeV..

 Requirements: No special requirements Platforms: Linux Keyword: Bit,  Bloom,  Document,  Libraries,  Programming,  Representation,  Signature,  Terms,  Textbloom Users rating: 0/10