-
Notifications
You must be signed in to change notification settings - Fork 0
Research
A statistical study to determine the positivity of online news content.
In the paper below we researched a way to rate a texts positivity based on the words in the text. We have created a formula to guess if a text is positive. In the paper we describe how we got to this formula and for what kind of texts it is useful.
For our Happy News application we need to filter online content based on its positivity. We get the sources from Newsapi.org1 and filter the important content using the Mercury API2. The texts we get now have to be filtered on their positivity since we only want to output positivity to our end users. To accomplish this we want to devise a way in which we can rate a text based on its individual words and their sentiment value.
In order to filter a text on its sentiment or positivity we look at each individual word in the text and check if it’s positive or negative. For this we have gathered two wordlists3 4. These wordlists are used for sentiment analysis in tweets so companies can see the overall sentiment for their product or service. We use these lists in order to rate the words in our text as either being positive or negative. With this method we get a general idea if there are more positive than negative words in a text. This gives us an initial idea whether or not a text is positive. A problem here might be double negatives such as “lower unemployment” which would result in two negative words whereas it should be perceived as something positive.
In order to see if a text is really positive we need to add some sort of weight to positive and negative words. This will allow us to create a more general feeling of the text instead of just counting the words. We took a sample of 120 news posts gathered by our own web crawler. These posts were gathered from 12 different news sources and consisted of posts with very different themes(e.g. sports, politics, economics and entertainment). The posts were then analyzed one by one by our application which gave us the amount of positive and negative words in the text. We put these values in a scatter plot, x being negative and y being positive).
We then added a linear regression line to this plot. This gave us an indication when to expect a text to be positive or negative. We use this scatter plot to create the following formula which for us shows whether or not a text is positive.
Positivity ➝ 0.7203 * nPos - nNeg > 3
nPos is the amount of positive words in the text and nNeg is the amount of negative words in the test. In appendix 1 the results from our initial analysis can be found.
We have created a rudimentary way of telling whether or not a text is positive. Our analysis based on sentiment/positivity is in some way useable for most news sources. Posts such as sports and commercials are mostly found to be very positive. This is however not the content we want to push to our users and are therefore better left out of our sources. For the first implementation of this algorithm it is sufficient and serves our needs.
As discussed double negatives and overall tone of a text are not taken into account with this formula. Real natural language processing is outside the scope of this project and might be a good follow up in order to refine the process of analyzing positivity in texts. Also texts with more words included in our wordlists have a higher volatility regarding the score they receive. A lot of words usually result in a very high or very low score. We might want to add some form of mapping to create a definite scale on which to place the positivity of a text.
2 https://mercury.postlight.com/web-parser/
Determining the positivity of online texts by means of semantic analysis.pdf