Name		Name	Last commit message	Last commit date
parent directory ..
Index_Construction.ipynb		Index_Construction.ipynb
Lab 5.pdf		Lab 5.pdf
Readme.md		Readme.md

Readme.md

Index Construction

Question

Build the inverted index for the following documents: ID1 : Selenium is a portable framework for testing web applications ID2 : Beautiful Soup is useful for web scraping ID3: It is a python package for parsing the pages

Perform Index Compression for the integer values in the inverted index (duplicates to be eliminated) using Elias delta coding and variable byte scheme .

Steps

We will first pre-process the documents, and then split the documents into tokens or words. The pre-processing includes conversion to lower case, removal of numbers and other special characters, and stop word removal. After this we tokenise this document using the NLTK library.

After this we then take all these words after tokenisation and form a inverted index, which will include the word, and a list of occurrences in all the documents, and in each entry of this array would include the document number, number of times this word has occurred in this document, and also a list of offset position where the word occurs in the document.

Now for the second part of the question, we then extract all the numbers in this inverted index and then perform compression using the above stated two methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index_Compression_Lab_5

Index_Compression_Lab_5

Readme.md

Index Construction

Question

Steps

Files

Index_Compression_Lab_5

Directory actions

More options

Directory actions

More options

Latest commit

History

Index_Compression_Lab_5

Folders and files

parent directory

Readme.md

Index Construction

Question

Steps