Skip to content

Latest commit

 

History

History
194 lines (141 loc) · 8.02 KB

README.md

File metadata and controls

194 lines (141 loc) · 8.02 KB

Managing Gigabytes (MG) INFORMATION RETRIEVAL SYSTEM

Compressing and Indexing Documents and Images

The MG system is a suite of programs for compressing and indexing text and images. Most of the functionality implemented in the suite is as described in the book ``Managing Gigabytes: Compressing and Indexing Documents and Images'', second edition, I.H. Witten, A. Moffat, and T.C. Bell, Morgan Kaufmann, San Francisco, 1999, ISBN 1-55860-570-3; US $54.95. See also the web page http://www.cs.mu.oz.au/mg/.

These features include:

-- text compression using a Huffman-coded semi-static word-based scheme -- two-level context-based compression of bi-level images -- FELICS lossless compression of gray-scale images -- combined lossy/lossless compression for textual images -- indexing algorithms for large volumes of text in limited main memory -- index compression -- a retrieval system that processes Boolean and ranked queries -- an X windows interface to the retrieval system

As one example, a collection of 2 Gb of text (1,700,000 documents) can be indexed (on a 266 MHz Pentium II) in about one hour and compressed in a further one hour to make a database that in total occupies less than 800 Mb, or 40% of the original size. This includes a full index to every word and number in the original text. Boolean queries such as ``managing AND gigabytes'' run in a few tenths of a second on the same hardware, and ranked queries of 30--50 terms are evaluated in 1--3 seconds.

Details of these methods and further performance results appear in the MG book.

The MG system comes with ABSOLUTELY NO WARRANTY; for details see the file COPYING.

Instructions on how to build and install mg are in the files INSTALL.mg and INSTALL.

CHANGES FROM BOOK

For copyright reasons the stemmer used in this distribution of MG is not the same as the one illustrated in Figure 3.9 on page 146 of the MG book. (However, the numbers generated by the command ``mgstat alice'' will match those numbers in Figure A.1 on page 454, unlike in the first edition of the MG book.) Another stemmer was initially written as a simple stopgap for version 1.0. That stemmer has now been replaced by a stemmer based on the Lovin's stemming algorithm.

The output format of ``mgstat'' has changed slightly since Figure A.1 (page 454) was prepared. The same information is displayed but formatted differently.

Some of the on-disk data structures have changed slightly, to accomodate larger databases. This is reflected in some increased file sizes -- in most cases, the increase is just 8 bytes. This also means that databases built with older versions of mg are not compatible with this version.

MG VERSIONS

The current version is mg-1.2.1, August 1999. The changes from earlier versions are listed in the file MODIFICATIONS. This can be accessed with mg by building a database using ``mgbuild mods'' and can also be accessed from the mg web page (see below).

The mg-1.2.1 extensions include:

-- Fixes to compile under Linux (various distributions). -- Fixes to avoid some 32-bit integer overflows when building large (> 2Gb) databases.

The mg-1.2 extensions include:

-- Source modifications for use of GNU's autoconf.

The mg-1.1 extensions include:

-- A new highlighting mode. The output mode hilite'' will highlight the query terms in the retrieved text documents. The variable hilite_style'' can be set to bold'' or underline''. It works best with the pager ``less''. A .mgrc to use would include: .set pager less .set mode hilite .set hilite_style bold

-- A web site containing manual pages and documentation is at http://www.mds.rmit.edu.au/mg/ One of these pages ``about_mg.html'' is included in this distribution.

-- A revised mg_get script which uses a .mg_getrc file to map specific collection names to filter types. (Modifications by Bruce McKenzie). See mg_get.1 for more details.

-- Code to perform merging of existing databases. This code was created by Shane Hudson and is documented in the mgmerge.README file found in the docs subdirectory. This code was written by Shane Hudson (Canterbury).

-- Revised man pages, including some new entries (thanks to Nelson Beebe). See mg.1, mgintro.1, mgintro++.1.

-- A real (rather than toy) stemmer.

PORTABILITY

Please refer to "README.port".

CREDITS

The MG development is largely the result of research collaboration between:

    Tim C. Bell            <[email protected]>
    Ian Witten             <[email protected]>
    Alistair Moffat        <[email protected]>
    Justin Zobel           <[email protected]>

The bulk of the programming work has been carried out by:

    Stuart Inglis	       (Waikato)
    Craig Nevill-Manning   (Waikato)
    Neil Sharman           (Melbourne and RMIT)
    Tim Shimmin            (RMIT)

In addition to these, the following people have contributed to the development of the MG software:

    Lachlan Andrew         (RMIT)
    Tim A.H. Bell          (Melbourne)
Owen de Kretser        (Melbourne)
    Gary Eddy              (Melbourne)
    Hugh Emberson          (Canterbury)
    Kerry Guise            (Waikato)
Shane Hudson           (Canterbury)
    Linh Huynh             (Melbourne and RMIT)
    Bohdan S. Majewski     (Queensland)
    Bruce McKenzie         (Canterbury)
William Weber	       (RMIT)

In addition to these, the following people have submitted bug reports and suggestions/fixes:

    Rex Barzee
    Nelson Beebe
    Tim A.H. Bell
    Tim C. Bell
Rodney Brown
    Rok Sosic
    Carl Staelin

Development of the MG system was supported by the Australian Research Council; the Universities of Melbourne, Waikato, Canterbury, and Calgary; RMIT; and the Collaborative Information Technology Research Institute (Melbourne).

BUG REPORTS

Send bug reports to [email protected]. But do please be aware that there is little likelihood of any immediate response apart from a "thank you for letting me know", as we have no funded support for MG, and any software development is voluntary on the part of my students. What I do guarantee is that your mail will be retained against the eventuality that one day someone does give us $50,000 for further software development. And if you have $50,000, and thought MG was wonderful, well, think of us...

FURTHER READING

The bibliography of the MG book lists a wide range of relevant papers. Other recent work relevant to the project is listed at http://www.cs.mu.oz.au/~alistair/abstracts/ and at http://www.cs.rmit.edu.au/~jz/Papers.html. The NZDL project home page is at http://www.nzdl.org.

BOOK

As today’s information explosion generates greater and greater volumes of raw data, the challenge of storing and retrieving this information in the most efficient manner continues to grow, whether the data is stored on a local disk or distributed over the World-Wide Web.

Managing Gigabytes helps you to meet this challenge by showing how to capitalize on new methods of compressing and accessing data, enabling you to store information more efficiently and locate specific items more quickly and cost-effectively than ever before. It uniquely covers fully-tested techniques for both text and image compression and shows how to construct a tailor-made electronic index for accessing text, scanned documents, and images.

This book largely avoids extensive theoretical and mathematical discussions, making it accessible to curious laypersons who seek a clear, uncomplicated understanding of this new technology. Real, large-scale problems are illustrated, and the technical material is sprinkled with anecdotes and background information. The new edition is updated with information about recent standards and discoveries. An instructor's supplement will be available.

Managing Gigabytes provides current and comprehensive tools and techniques that will help professionals and academics to work more confidently and effectively in today's increasingly paperless society.