This project aims to allow a computer program to predict the sentiment – positive or negative – of Twitter using Machine Learning techniques. Sentiment is defined as an attitude, emotion, or feeling. As Twitter is filled with “tweets” authored by numerous individuals “tweeting” about a variety of events, the program’s job is to give a probability of how those individuals feel about certain things. By analyzing these results, we can generally determine how Twitter users feel about a recent event, a particular person, and more.
Using Twitter’s API, we are able to access the most recent tweets from Twitter with our program. In our first run-through, our input was the last 25 to 100 tweets on our home page. These tweets were from the users we follow, and the topics of them are extremely diverse. By tokenizing these tweets, finding the most used terms, and eliminating other words from the file, we are left with words and phrases that are often associated with opinions. By associating these opinionated words and phrases with an opinion lexicon file, we may gather Twitter users’ emotions about certain topics and entities. In our second run-through, we decided to make our input to be filtered by certain hashtags. This means that we can find the sentiment on a certain hashtag through the same process. Our output gives probabilities of semantic orientation which shows what good and bad terms are popping up in our timeline and in what context. This type of sentiment detector would be highly valuable to many businesses attempting to garner the reaction to a product they have just released or even to a campaign manager attempting to gather information on how voters feel about their candidate.
Our Twitter Sentiment Detector project is composed of two files: TwitterDataMining.py and TwitterPreProcessing.py. In TwitterDataMining.py, we use Twitter’s API to access the tweets, and then we write the most recent tweets to a json file. TweetPreProcessing.py sets up which words or characters to ignore in the file – called stop words – and compiles them into a list. We then use a co-occurrence matrix which counts the number of times certain terms occur together throughout the json file. This is important because certain opinions and emotions are expressed through multiple words instead of just one. Finally, our program then calculates the Pointwise Mutual Information (PMI) of each word or co-occurring words to test if they are “good” or “bad.” This is a mathematical equation which calculates the closeness of a word to good and bad terms by using the number of times it has occurred in the file and in what context. This essentially compares the words in the tweets to the opinion lexicon in our project file. The semantic orientation is then calculated to see the probability of some words appearing more often with the positive or negative words from the lexicon. If a word appears more frequently with positive words, the output will be a positive probability; else, it will be negative.
“Firefighters are fighting to save homes and lives in California. Sending love and support to them and those affected.” – Ellen Degeneres, 9:45 AM – 5 Dec 2017
From this example tweet, our program would parse through each word and find words such as “fighting,” “save,” “love,” and “support.” Our program would see these words and assign a positive or negative probability to them. In this case, “fighting” would get a negative probability while “save,” “love,” and “support” would get a positive one. To make this project more applicable, we filtered our input through one hashtag. For example, if this example tweet had “#california”, then our sentiment detector would pick up that there are good and bad associations with this hashtag because of the wildfires. This is helpful in finding Twitter users’ attitude towards one specific topic.
Our project has had several setbacks including finding the correct formatting and the struggle to filter out irrelevant data. Since we are using a json file, we had to learn how to format this file in order for it to be readable into our code. We also had issues with making our list of stop words as we were unsure of which available dictionaries of words there were. As a result, we individually went through and chose words we did not want to be picked up in our file. There is certainly a more accurate way to solve this problem, and this would be something we would work on in the future. Furthermore, we used unsupervised learning in our approach, and it could be argued that a supervised environment might be more effective. For example, we could have trained our data on which words were positive and negative, and perhaps the results would come to be more accurate. Also, since we have not officially learned unsupervised learning in class, we spent a considerable amount of time just trying to understand what that meant.
The results of our project show that our calculation of sentiment is generally correct. By filtering through Twitter for a specific hashtag, we were able to obtain results that showed how Twitter users generally felt about the hashtag. For example, we queried for the hashtag “#Trump,” and the program outputted a -9.70 rating that shows that tweets about Trump are generally negative. It also outputs the evidence of this by showing the positive and negative words associated with this hashtag. For example, the some positive words associated with this hashtag are “Lead” with a rating of 6.6 and “decisions” with a 13.3 rating. Some negative words found to associate with “#Trump” include “Dems” (-13.3) and “capital” (-21.6).
While our results seem to come out as generally accurate, we still have trouble with some unnecessary words coming into our results. For example, for “#Trump,” the word “I’m” came out as positive with a 6.6 rating. This word should have been eliminated with our stop words since it is a pronoun, but it still slipped through. We also have not yet found a way to eliminate URL’s to images and videos from our results. These commonly pop up with happy hashtags, but they do not show how a user feels from a sentiment perspective. In the future, we hope to improve upon these issues.