-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
47 lines (43 loc) · 2.51 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Initial Hypthesisis: People retweet in regards to emotion they feel
- Feature Engineering
- Statuses Count:
Data distribution, status-boxplot.png and status-boxplot-outliers.png
- Scaling: x, too many outliers
- Clipping: yes but no, since small number of extreme outliers ; experiences shows it
is not a good ideas, since actually there is a lot of extreme outliers (see second local maxima)
- Log Scaling: yes ;Log scaling is helpful when a handful of your values have many points, while most other values have few points.
- Followers:
- Log normalisation for the same reasons
- Source to deal with 0 values: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3444996, fix add 1
- Increased correlation, drastically
- Verified very small correlation, still more than others
- Hashtags, no correlation, maybe analyse the text ?
- Urls, remove, no correlation, even the number of urls
- Mentions, no one mentions anyone
- Statuses and links seem pretty correlated, thats a pb, stauses count will be removed since it has less correlation, perf decrease, will not remove finally
- Feature Creation:
- Text polarity:
Source: https://www.sciencedirect.com/science/article/pii/S0096300315011145?casa_token=2_G8sxdXbasAAAAA:nceUyP7uU1tfU8Bc6j5uHrndQ6zCzlhIICGGarwaRqVGCzq0tcTPQua6y1Yj_BRqONXodh9WXP2u
(explaining challenges, 80% acc. of classifications, published in 2015)
- Sentimental Value:
- TextBlob, no support for the french language
- vaderSentiment-fr
- Baseline Model Finetuning
- BGR:
- Use HistGradientBoostingRegressor instead, (faster for larger training sets)
- 6.5 error score
- RFR:
- 7.27 error score
- Linear R
- 11.23 error score
- Gaussian Regression (seeming to fit best with our features distribution)
- Memory error, tryng to allocate 450Gb memory to DRAM sheh!
- Logistic Regression: 13.3 error (very loong)
- Bayesian Regression: 11.16 error
- XGboost: 7,5 error
- Sc. Litterature to back up choices.
Conclusions: Investigate further diff between BGR, XGboost, Logistic and RFR, eliminate Linear Regression model
Why do people Retweet ?:
- https://www.jstage.jst.go.jp/article/psysoc/58/4/58_189/_pdf/-char/ja
- https://ojs.aaai.org/index.php/ICWSM/article/view/14110
More informative tweets, change of initial assumptions, confirmed by our experiement (cf. polarity correlated with reweets)