Joey Livorno | [email protected] | 2.6.2020
This is my first project log of the term. Not much to report yet, but I'm excited to make some progress.
I've finished setting up my project repository
- created README.md
- created LICENSE.md
- created project_report.md
- created project_plan.md and laid out a detailed plan of attack
- created .gitignore
This is my first actual progress report. In this section of my project, I have accomplished much of what is encompassed by the Basic Data Processing category of Homework 2. First, I downloaded a .csv copy of Donald Trump's Twitter feed from the Trump Twitter Archive, a website that tracks the tweets of Trump and many other prominent politicians that was created by Brendan Brown. I then created my jupyter notebook and entitled it Project Code. I used the normal libraries (numpy, pandas, etc.), though I also included a useful library that I found called TextBlob. This library can be used for a multitude of things related to Natural Language Processing, however I will be using it to perform quick sentiment analyses on the text of each tweet. I also imported two dictionaries from a snippet of code I found on GitHub user NeelShah18's page. I emailed Neel asking if it was okay to use his code, and he allowed it!
The next step was to begin the actual data manipulation. The first thing I did was read the .csv file into a dataframe, and then I printed a preview to see what I was working with. I saw that each row contained a source, the tweet itself, the data and time it was posted, the number of retweets and favorites, and whether or not it was a retweet. This was a great place to start, but the tweets needed a bit of fluffing to be completely workable. Firstly, I attempted to replace the emojis with text using Neel's code, but I ran into an error that I have not yet solved. For now, this can stay. I will attempt to make this code work for my purposes, but I may need to just find a new library. Secondly, I added two columns to each of the rows: those being polarity and subjectivity. I got these values from textblob for each tweet through the use of a lambda function, and then I joined the columns to the existing dataframe. Finally, I needed a way to group my tweets together chronologically, and so I figured that extracting the year from the timestamp and making 'year' its own column was the best way to accomplish this. I achieved this by converting the timestamp to a python timestamp datatype and then isolating the year value. I then added these values to the dataframe.
The last step in my initial data processing, since the data was now workable as a whole, was to divide them into subcategories based on political topics. Since foreign affairs has been a hot button issue recently, I decided to begin by investigating these topics, but this is subject to change has my project evolves. I did this by extracting tweets that contained keywords that are associated with certain issues. For example, the subset 'nkorea' is made up of tweets that contain words such as 'North Korea,' 'Pyongyang,' and 'Jong-Un.' These searches left me with a fair bit of twitter data, especially just for investigating a single person. My fear though, is that I do not have enough data. In the future, my challenge will be to do one of two things: 1) find new ways to include a larger subset of tweets pertaining to certain issues (either through more keywords or in more creative ways) or 2) choosing new issues that have a larger volume of data to work with.
This is my second project report. In this report, I will discuss the final steps I took to prepare my data for analysis, the analyses I have begun, and the license that I have created/how I plan to share the data.
Much of the processing had already been completed in the first section of my report, though I made some key changes in this one. First, I added a new field that holds the tokenized and lowercased form of the tweet content. This will allow me to find keywords much more accurately. After this, I changed the subsets to only search for these lowercased forms of words, and I ended up finding more matches. Finally, I printed some basic statistics of the data like size and shape, and then pickled the final form of the data.
The first section of analysis in my code is for exploratory purposes. I retrieved the statistics about various points in the data such as retweets, favorites, and the volumes of tweets per year, and I used charts to visualize my findings.
The second section was a more linguistic analysis of the data. I grouped the tweets by year and then found the averages of the polarities of the tweets for each group. Using a bar graph, I could make some preliminary judgements regarding the sentiment of Trump's tweets, specifically that they have been trending negatively in recent years. Next, I did the same for the subsets of the tweets (russia, iran, north korea). Here, I found that the sentiment of a certain year's tweets was consistent with political events from that time. For example, I found that Trump tweeted rather positively of Russia while Obama was in office, and negatively of them while he was in office, but only when it suited him. Specifically, I found that in 2019, the same year as the Mueller Report was released, the sentiment of Trump's tweets towards Russia drastically increased. Very interesting to say the least...
For my license, I used A GNU General Public License. I chose this license because it is a requirement of NeelShah18's license that I use the same one if including it in my data.
As far as my actual data, the Trump Twitter Archive is free to use (and it doesn't seem to have a license anywhere), and I give credit to its founder Brendan Brown throughout my code documentation. Because of this, I am justified in creating the derivative dataset in which I include polarity, subjectivity, tokenized content, etc.
To share my data, I have included a copy of the original .csv file in my 'data/' directory, along with a pickled form of the derivative dataset I created. My license specifies that anyone will be allowed to use my data so long that they state any changes made, disclose the source, and use the same license.
This is my third project report. In this report, I will discuss the final steps I made toward completing the data collection/computational efforts, as well as the analysis I've done. I've made changes to my code notebook in the existing format, so all changes and prior work can be found there. Additionally, the new data I've gathered can be found in the data directory.
While I had already completed the data gathering/cleaning portion of the project for its original purpose, I felt that further depth was needed, so I decided to include a section where Machine Learning was implemented. Specifically, I compared Trump's tweets to a collection of random tweets from random users over a 48 hour period from April 14, 2016 to April 16, 2016. The dataset was free to use and can be found at Followthehashtag, so a big thank you to them.
Once I had obtained this data, I imported it into a dataframe, but only included the same number of lines that were present in the set of trump tweets, as havin an uneven number of tweets from each set would skew the data. I then tagged each tweet as either T or Nt, indicating Trump or Not Trump. After this, I stripped the dataframes to contain just the tweet content and these tags, then combined them into one dataframe.
I added a new section to my code entitled Machine Learning, and here is where I began the computational effort of building the model and making predictions. I decided that I would use a similar strategy with this to what I did in Homework 4. After I learned more about GridSearchCV, I realized two things: 1) it is actually pretty easy to implement and 2) it is even more so useful.
First, I created my pipeline which consisted of a TfidfVectorizer and a Multinomial Naive Bayes (this is the model I am most comfortable with). I then defined the parameters that I would test with the GridSearchCV. I started with max features of 1500, 3000, and 4500 and alpha values of .01 and .001, but later added max features of 9000. This results in a total of eight parameter combinations. Once this was done, I fit the GridSearchCV to my data and printed my predictions.
Because I was already familiar with the first two steps, I spent the bulk of this iteration of my project beefing up analyses I had already made, analyzing my new machine learning results, and adding a conclusion section where I drew some final conclusions for the entirety of the project. I won't go into too much depth on those since they can be read here, but other than that I think this iteration was very productive. In the final report, I would like to finalize my analyses and support them with some more real world causes for Trump's twitter behavior.