Implementation of an information extraction system that uses Iterative Set Expansion over search results returned by Google, parsed with BeautifulSoup and processed with Stanford CoreNLP and Stanford Relation Extractor, to extract relations from the web, of a given type among four categories.
This is a project that I developped in Fall 2017 for the course COMS 6111 Advanced Database Systems taught by Professor Luis Gravano at Columbia University. Below are the instructions and the readme that I submitted for the project.
- README.md
- config.py
- iterative_set_expansion
- init.py
- main.py
- annotate.py
- extract.py
- helpers.py
- preprocess.py
- query.py
- relation_set.py
- scrape.py
- logs
- iterative_set_expansion.log
- transcript.txt
- requirements.txt
- resources
- NLPCore
- NLPCore.py
- data.py
- NLPCore
- setup.py
- tests
- mock_query_and_scraping.py
- query_history.json
From the top_level iterative_set_expansion folder, install all the requirements with:
bash setup.sh
(enter Yes every time requested)
The bash script in step 1 takes care of downloading the CoreNLP folder and putting it in the right folder. If you already have the folder on your machine, you may comment the lines wget..., unzip... and mv... in the script and move manually the folder stanford-corenlp-full-2017-06-09 into resources/.
Run the project with:
python3 -m iterative_set_expansion <relation> <threshold> <initial query> <k>
For example:
python3 -m iterative_set_expansion 1 "trump washington" 0.30 10
You should place in a file called config.py your Google API credentials. This file should contain two variables called DEVELOPER_KEY and SEARCH_ENGINE_ID. These instructions have been tested on a Google Cloud VM running Ubuntu 16.04. If one of the steps of the setup script fails, you can see all the steps of setup by opening setup.sh
The user enters a seed query along with the relation desired, the number of tuples to extract, and the confidence threshold. The program queries the Google API which returns the best 10 results. The program then scrapes the the corresponding URLs in order to get the content of the pages. Once scraped, they are preprocessed with NLTK (split into a list of strings, one for each sentence. Note that we keep uppercase characters and punctuation as they help relation extraction.), and passed to the annotator. The annotator runs two pipelines, one to detect Named Entities, then the second pipeline is run only on sentences containing the right Named Entities needed for the desired relation. After the second pipeline, the "extractor" reads the results of the annotator and extracts the relations identified. The results are then stored in a set X (of custom class relation_set) which deals with duplicates and pruning. If at the end of this first step, X contains more than k relations above threshold, the main loop breaks, otherwise a new query is generated by taking a relation not yet used with the highest confidence, and the process is repeated until k relations are extracted.
The main loop controlling this flow is in the main() method in main.py. Each part of the project is in a separate class, which allows better abstraction. Such an abstraction would let us, with little effort, change the query API, change the NLP tool used for annotation and query extraction, add some additional preprocessing, etc.
The different functions performed are:
- main: the main loop, by main.py
- get content of url: by scrape.py
- query the Google API: by query.py
- preprocess the sentences (split into sentences): by preprocess.py
- annotate the sentences (2 pipelines): by annotate.py
- extract relations from annotated text: by extract.py
- store extracted relations and filter them: by relation_set.py
- generate new query: by relation_set.py
The project also contains some test files. Each time the project is run, the results of the query and the scraping are stored in a json to allow to run a series of tests without querying the search engine / scraping. See mock_query_and_scraping.py functions.
The Annotator class also contains a mock method, which uses already computed .xml files. It allows to do tests for the final steps (selecting sentences for 2nd pipeline, extracting relations after 2nd pipeline, adding relations to the relation se...) without having to launch java and run the CoreNLP program.
Additional information about runtime (extracted plain text, etc.) can be read directly in the logs contained in the logs folder.
The webpages are retrieved using urllib, and scraped using BeautifulSoup, in the scrape method in scrape.py. There is a 30 second timeout set for urllib. Each document is retrieved and scraped in parallel, using a ThreadPool. If the retrieval or scraping fails, the main loop receives an empty string and therefore the document is skipped. BeautifulSoup extracts all the text inside <\p> tags, as the actual text content of a website is generally mainly contained inside these sections.
In annotate.py, the function annotate runs both pipelines by calling run_pipeline, which wraps the Python CoreNLP wrapper itself, which is in the resources folder. Note that it has been slightly modified, to use 10 separate input.txt and input.txt.xml files. This could later help to parallelize annotation. In my tests, I have tried to launch different processes to annotate documents in parallel, yet the results were not good enough (I also tried to run the CoreNLP with -nthreads option, which didn't increase efficiency either), and our project emphasizes the quality of results more than efficiency.
One advantage of this modification of the original Python CoreNLP wrapper is that during tests, after each run, we can look at the txt and xml files for each document. Also note that the Annotator class also contains a mock_run_pipeline method, see the "Test" section above.
The two pipelines are, as described in the project assignement, tokenize,ssplit,pos,lemma,ner and tokenize,ssplit,pos,lemma,ner,parse,relation.
The Annotator class sends for each doc an object of custom type Document (defined in the python CoreNLP wrapper) to the Extractor class. The Extractor class contains the desired relation and the entity types needed for this extraction, and transforms all the Relation objects from the Document to dicts. These dicts are then passed to the relation_set object X that deals with duplicates and pruning. Note that we could have done the pruning earlier without affecting the quality of the program, but this implementation respects the reference implementation style.
The relations are stored in a RelationSet object, which is mainly a wrapper around a Pandas DataFrame which will actually contain the relations. The add method of a RelationSet takes care of duplicates, i.e. if we try to add a tuple that already exists, only the instance with the highest confidence is kept. The RelationSet object also contains the method prune that simply keeps the relations with confidence above threshold, at the end of each step of the main loop. Finally, RelationSet contains the method generate_query that picks a relation for the next query, as described in the project assignement, i.e. it returns the best relation (highest confidence) that has not yet been used.
The choice of pandas is used for 1) pretty printing the relation table with little effort and mostly for 2) selecting the next query, as pandas allows to perform SQL-like queries similar to "Left Outer Join", etc., to DataFrames.
If the next query choice was to become more complex, it would be a better abstraction to move its logic outside of the RelationSet class.