Single-Pass In-Memory Indexing

This is a project done for the fall 2018 COMP 479 - Information Retrieval course in Concordia University. The goal of the project was to analyze Reuters documents from a bunch of files by tokenizing the documents, subsequently constructing an index containing terms and their corresponding postings lists.

The Reuters files can be downloaded here, though the program will download them for you once run (if they're not already available in the root directory of the project).

Getting Started

Prerequisites

The following Python packages are required to run the program:

Click here for the specific versions of the packages used for this project.

Or just run it with Docker.

Docker

I also included a Dockerfile to make it easier to run on any machine. First, make sure you cd into this repository.

To build the image and start up a container:

docker image build -t spimi .
docker container run -it --name spimi-demo spimi bash

This will take you to an interactive Bash terminal, from which you can run the script. You can include the --rm option in the run command to automatically remove the container when you exit out of it.

Running

The file to run is in the src/ directory.

python3 main.py [-d DOCS_PER_BLOCK]
                [-r {1, 2, 3, ..., 22}]
                [-rs] [-s] [-c] [-rn]
                [-a]

optional arguments:
    -d, --docs                      number of documents per block (default 500)
    -r, --reuters                   number of Reuters files to parse (1-22) (default 22)
    -rs, --remove-stopwords         remove stopwords from the index
    -s, --stem                      stem terms in the index
    -c, --case-folding              reduce terms in the index to lowercase
    -rn, --remove-numbers           remove numbers from the index
    -a, --all                       use options -rs, -s, -c, and -rn

Generated files will appear in the root directory of the repository.

Author

Vartan Benohanian - ID: 27492049

Report

A project report showcasing a more detailed description of the SPIMI is available here.

The one showcasing the Okapi BM ranking function can be viewed here.

The Expectations of Originality form is available here.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.idea		.idea
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Expectations of Originality.pdf		Expectations of Originality.pdf
LICENSE.md		LICENSE.md
Project 1 Report.pdf		Project 1 Report.pdf
Project 2 Report.pdf		Project 2 Report.pdf
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-Pass In-Memory Indexing

Getting Started

Prerequisites

Docker

Running

Author

Report

License

About

Releases 2

Packages

Languages

License

vartanbeno/SPIMI

Folders and files

Latest commit

History

Repository files navigation

Single-Pass In-Memory Indexing

Getting Started

Prerequisites

Docker

Running

Author

Report

License

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages