Skip to content

A framework for automated web scraping and sentiment analysis for news articles

License

Notifications You must be signed in to change notification settings

christopher-wolff-zz/news-analysis

Repository files navigation

News Analysis

A framework for automated web scraping and sentiment analysis for news articles. Provides trained SVM and Naive Bayes classifiers for sentiment prediction.

This project is supposed to demonstrate the entire analysis process from data collection to data processing and visualization. Since the implementation will vary significantly depending on the use case, it is NOT meant to be a toolkit that can be used directly in your project, just an example of what such a project might look like. Most of the code can be found in analyzer.py and is fully documented.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

This project requires python version >= 3.6, which can be installed from here.

Installing

To download the project source directory, simply open terminal and run

git clone [email protected]:christopher-wolff/news-analysis.git

or download it directly from GitHub.

(optional) We highly recommend using a python virtual environment to install dependencies. Creating the environment is as simple as

cd /path/to/project/directory/
python3 -m venv virtualenv
source virtualenv/bin/activate

More information about python virtual environments can be found here.

This project has the following dependencies:

These can be installed using the provided requirements.txt with the command

pip3 install -r requirements.txt

inside the project root directory.

Lastly, the nltk library requires downloading some files. This can be done by launching an interactive python shell with

python3

and then running

>>>import nltk
>>>nltk.download('stopwords')

You may need to run

/Applications/Python\ 3.6/Install\ Certificates.command

prior to that to resolve a certificate conflict that appears between nltk and python 3.

In order to use the query feature that enables you to send requests to the New York Times Most Popular API and store the results automatically, you will need to obtain an API key here. Add it to the file settings_example.json under the "API" field, and rename the file to settings.json.

mv settings_example.json settings.json

Running the project

The project can be executed from a python shell with

python3
>>>from analyzer import *

All methods in this module operate independent of each other, which means that the shell can safely be closed after executing any one of them. All intermediate results are stored in separate files within the data folder.

Break down of available features

query

The query method allows you to automatically send API calls to the New York Times Most Popular API. These results will be stored in a file called raw.json.

scrape_stories

The returned API calls only contain the article title and abstract, but not the full document story. The scrape_stories method uses the beautifulsoup module to scrape the story content for all articles from the official New York Times website and stores it in a file called with_stories.json.

label_articles

label_articles is a UI that can be used to manually label the returned articles and was originally used to obtain training data for the sentiment classifiers. More information can be found in the method docstring. Results are stored in with_labels.json.

train_model

This method trains two different sentiment classifiers: a support vector machine and a multinomial naive bayes classifier, using the sklearn library. The documents can be modeled using either a term frequency - inverse document frequency vectorizer or a bag of words model. The resulting models are dumped into the /models/ folder using the python object serialization tool pickle and can easily be extracted.

analyze

analyze was originally used to analyze sentiment using the TextBlob library. However, we eventually moved on towards training our own classifier. This method is included anyway in case it might be useful to someone.

visualize

The visualize method plots the results using matplotlib.

Executing the sample script

The methods described are meant to be used in combination with each other. An example of this is found in the main file, which can be executed using

python3 news-analysis    # root directory

or

python3 .

from inside the root directory.

Deployment

The following snippet shows how you might use the trained SVM classifier on your own documents:

python3
>>>import pickle
>>>from sklearn import svm
>>>from sklearn.feature_extraction.text import CountVectorizer
>>>clf = pickle.load(open('models/svm.pkl', 'rb'))
>>>vect = CountVectorizer(vocabulary=pickle.load(open('models/count_vect.pkl', 'rb')))
>>>tf = vect.fit_transform(['I love python!'])
>>>clf.predict(tf)

Further usage instructions for the sklearn classifiers can be found here and here.

License

This project is licensed under the MIT License - see the LICENSE file for details

About

A framework for automated web scraping and sentiment analysis for news articles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages