Easiest to read on GitHub: https://github.com/WeersProductions/resolverflow/blob/master/dataFramePreviews.md Data previews for the following tables:
- Badges
- Comments
- PostHistory
- PostLinks
- Posts
- Tags
- Users
- Votes
- output_stackoverflow
- correlation
- decision tree (regressor)
You can find the database schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede
>>> df = spark.read.parquet("/user/***REMOVED***/StackOverflow/Badges.parquet")
>>> df.show()
+------+--------------------+-------+----------------+---------+-------+
|_Class| _Date| _Id| _Name|_TagBased|_UserId|
+------+--------------------+-------+----------------+---------+-------+
| 2|2011-02-02T17:42:...|1400008|Notable Question| false| 3488|
| 2|2011-02-02T17:42:...|1400009| Guru| false| 353410|
| 3|2011-02-02T17:47:...|1400010| Student| false| 599272|
| 3|2011-02-02T17:47:...|1400011| Student| false| 557399|
| 3|2011-02-02T17:47:...|1400012| Student| false| 600168|
| 3|2011-02-02T17:47:...|1400013| Editor| false| 564626|
| 3|2011-02-02T17:47:...|1400014| Autobiographer| false| 296093|
| 3|2011-02-02T17:47:...|1400016| Scholar| false| 410592|
| 3|2011-02-02T17:47:...|1400017| Scholar| false| 463428|
| 3|2011-02-02T17:47:...|1400018| Nice Question| false| 32484|
| 3|2011-02-02T17:47:...|1400019|Popular Question| false| 1512|
| 3|2011-02-02T17:47:...|1400020|Popular Question| false| 7453|
| 3|2011-02-02T17:47:...|1400021|Popular Question| false| 256239|
| 3|2011-02-02T17:47:...|1400022|Popular Question| false| 326284|
| 3|2011-02-02T17:47:...|1400023|Popular Question| false| 22215|
| 3|2011-02-02T17:47:...|1400024| Tumbleweed| false| 591013|
| 3|2011-02-02T17:47:...|1400025| Tumbleweed| false| 591011|
| 3|2011-02-02T17:52:...|1400026| Student| false| 543649|
| 3|2011-02-02T17:52:...|1400027| Student| false| 553638|
| 3|2011-02-02T17:52:...|1400028| Student| false| 155689|
+------+--------------------+-------+----------------+---------+-------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Comments.parquet").show()
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+
|_ContentLicense| _CreationDate| _Id|_PostId|_Score| _Text|_UserDisplayName|_UserId|
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+
| CC BY-SA 2.5|2010-07-02T21:17:...|3260274|3168284| 0|So, for each pati...| null| 32632|
| CC BY-SA 2.5|2010-07-02T21:17:...|3260275|3167981| 0|Another alternati...| null| 29639|
| CC BY-SA 2.5|2010-07-02T21:17:...|3260276|3169136| 3|And no info in Bo...| null| 155356|
| CC BY-SA 2.5|2010-07-02T21:17:...|3260277|3128496| 0|One thing to reme...| null| 89761|
| CC BY-SA 2.5|2010-07-02T21:17:...|3260279|3169068| 0|Not really relate...| null| 74757|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260281|3168944| 0|javascript:var%20...| null| 281705|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260282|3168667| 0|Yeah. But I think...| null| 243943|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260283| 235067| 0|@Jon Skeet: Maybe...| null| 135172|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260284|3161146| 0|This is a duplica...| null| 340221|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260285|3161593| 0|Is the first_two_...| null| 305555|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260286|3169115| 0|Isn't there an "E...| null| 17028|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260287|3169126| 6|If your code is b...| null| 189416|
| CC BY-SA 2.5|2010-07-02T21:18:...|3260288|3169146| 2|Without code to e...| null| 91768|
| CC BY-SA 2.5|2010-07-02T21:19:...|3260289|3168175| 0|This is very nice...| null| 312026|
| CC BY-SA 2.5|2010-07-02T21:19:...|3260290|3167625| 0|I was able to fix...| null| 87154|
| CC BY-SA 2.5|2010-07-02T21:19:...|3260291|3163175| 0|The first solutio...| null| 381675|
| CC BY-SA 2.5|2010-07-02T21:19:...|3260292|3168299| 0|I think you can i...| null| 69998|
| CC BY-SA 2.5|2010-07-02T21:20:...|3260293| 204506| 0|@David Thornley -...| null| 1737|
| CC BY-SA 2.5|2010-07-02T21:20:...|3260294|3169159| 0|Is there somethin...| null| 356292|
| CC BY-SA 2.5|2010-07-02T21:20:...|3260295|3162018| 0|Okay what can i d...| null| 365706|
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/PostHistory.parquet").show()
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+
| _Comment|_ContentLicense| _CreationDate| _Id|_PostHistoryTypeId| _PostId| _RevisionGUID| _Text|_UserDisplayName|_UserId|
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519365| 5| 2613348|6499a078-4532-4d2...|When I went to Un...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519366| 5| 2615297|26b49208-3be4-4eb...|If you have some ...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519367| 5| 2619007|bc975e70-a69a-4f3...|How to create a s...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519368| 5| 2617924|d529d0f4-15ed-4f1...|Yes - don't forge...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519369| 5| 2617787|fb7ed36e-ee2b-42a...|Instead of nested...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519370| 5| 2617507|bc1093c6-d4ca-435...|You said: > ... ...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519371| 5| 2617212|578031e6-c28a-4ff...|I am referred to ...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519372| 5| 2617119|54556aab-9ea2-411...|i want get data f...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519373| 5| 2616985|a600c321-4ab0-4c7...|Here is the issue...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519374| 5| 2616839|0fb6470a-ca94-4d6...|The Wikipedia [Gr...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519375| 5| 2616320|5db679ca-da0c-4ad...|I'm trying to cre...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519376| 5| 2616314|301771e4-67a3-48c...|When defining you...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519377| 5| 2616195|f81da27b-4153-49b...|This must be doab...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519378| 5| 2618517|9ef29b6a-c1c3-432...|I'm racking my br...| null| -1|
| null| CC BY-SA 4.0|2020-06-20T09:45:...|224519379| 2|62484164|11cc8566-d631-40a...|Your client code ...| null| 256196|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519385| 5| 2622179|1797efdd-e2fb-43e...|Sorry for answeri...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519386| 5| 2621679|bc43b8a5-5d21-418...|With respect ther...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519387| 5| 2621634|fcae8d6f-3af5-432...|From [Statement j...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519388| 5| 2621174|59bb584d-129a-47c...|The general recur...| null| -1|
|Commonmark migration| CC BY-SA 4.0|2020-06-20T09:12:...|224519389| 5| 2621136|6f5e75ba-3413-4dc...|I have compiled t...| null| -1|
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/PostLinks.parquet").show()
+--------------------+---+-----------+-------+--------------+
| _CreationDate|_Id|_LinkTypeId|_PostId|_RelatedPostId|
+--------------------+---+-----------+-------+--------------+
|2010-04-26T02:59:...| 19| 1| 109| 32412|
|2010-04-26T02:59:...| 37| 1| 1970| 617600|
|2010-04-26T02:59:...| 42| 1| 2154| 2451138|
|2010-04-26T02:59:...| 52| 1| 2572| 209329|
|2010-04-26T02:59:...| 58| 1| 3376| 2187|
|2010-04-26T02:59:...| 59| 1| 3376| 18080|
|2010-04-26T02:59:...| 63| 1| 3859| 802573|
|2010-04-26T02:59:...| 69| 1| 4565| 583532|
|2010-04-26T02:59:...| 70| 1| 4582| 1955663|
|2010-04-26T02:59:...| 72| 1| 4850| 1369312|
|2010-04-26T02:59:...| 74| 1| 4952| 1607|
|2010-04-26T02:59:...| 75| 1| 4952| 173|
|2010-04-26T02:59:...| 76| 1| 4952| 6371|
|2010-04-26T02:59:...| 81| 1| 5724| 131955|
|2010-04-26T02:59:...| 84| 1| 6173| 1732348|
|2010-04-26T02:59:...| 90| 1| 7470| 82872|
|2010-04-26T02:59:...| 96| 1| 8170| 245395|
|2010-04-26T02:59:...|119| 1| 9705| 305694|
|2010-04-26T02:59:...|121| 1| 9751| 371|
|2010-04-26T02:59:...|126| 1| 9926| 619677|
+--------------------+---+-----------+-------+--------------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Posts.parquet").show()
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
|_AcceptedAnswerId|_AnswerCount| _Body|_ClosedDate|_CommentCount|_CommunityOwnedDate|_ContentLicense| _CreationDate|_FavoriteCount| _Id| _LastActivityDate| _LastEditDate|_LastEditorDisplayName|_LastEditorUserId|_OwnerDisplayName|_OwnerUserId|_ParentId|_PostTypeId|_Score| _Tags| _Title|_ViewCount|
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
| null| null|<p>Have you tried...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:22:...| null|821861|2009-05-06T13:38:...|2009-05-06T13:38:...| null| 95029| null| 95029| 819238| 2| 2| null| null| null|
| null| null|<p>It seems that ...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:22:...| null|821862|2009-05-04T20:22:...| null| null| null| null| 60315| 820218| 2| 0| null| null| null|
| null| null|<blockquote> <p>F...| null| 4| null| CC BY-SA 3.0|2009-05-04T20:23:...| null|821864|2014-07-29T15:12:...|2020-06-20T09:12:...| null| -1| null| 55159| 311873| 2| 4| null| null| null|
| null| null|<p>I would seriou...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:23:...| null|821865|2009-05-04T21:43:...|2009-05-04T21:43:...| null| 2424| null| 2424| 821598| 2| 1| null| null| null|
| null| null|<pre><code>String...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:23:...| null|821866|2009-05-04T20:23:...| null| null| null| null| 89266| 821806| 2| 0| null| null| null|
| 821986| 4|<p>Suppose I have...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:23:...| 2|821867|2009-05-08T07:18:...|2009-05-04T20:30:...| null| 16012| null| 16012| null| 1| 4| <c#><linq>|how to process "p...| 640|
| null| null|<p>You have vario...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:23:...| null|821868|2009-05-04T20:23:...| null| null| null| null| 79294| 821780| 2| 10| null| null| null|
| null| null|<p>Because that's...| null| 1| null| CC BY-SA 2.5|2009-05-04T20:24:...| null|821869|2009-05-04T20:24:...| null| null| null| null| 21886| 821839| 2| 20| null| null| null|
| 1049025| 4|<p>I want to only...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:24:...| 40|821870|2017-05-09T10:09:...| null| null| null| null| 36680| null| 1| 26|<django><django-a...|How can I detect ...| 16019|
| null| null|<p>I had a deer-i...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:24:...| null|821871|2009-05-04T20:24:...| null| null| null| null| 101116| 821740| 2| 0| null| null| null|
| null| null|<p>Implement ISer...| null| 1| null| CC BY-SA 2.5|2009-05-04T20:24:...| null|821872|2009-05-04T20:24:...| null| null| null| null| 12971| 821780| 2| 2| null| null| null|
| 822032| 7|<p>You wouldn't i...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:24:...| 8|821873|2020-11-27T10:49:...|2009-05-04T20:56:...| null| 65336| null| 65336| null| 1| 57|<c++><windows><un...|How to open an st...| 57692|
| null| null|<p>On <strong>Lin...| null| 7| null| CC BY-SA 2.5|2009-05-04T20:24:...| null|821874|2009-05-06T14:28:...|2009-05-06T14:28:...| null| 44434| null| 44434| 821837| 2| 14| null| null| null|
| null| null|<p>Interface prop...| null| 2| null| CC BY-SA 2.5|2009-05-04T20:24:...| null|821875|2009-05-04T20:24:...| null| null| null| null| 16623| 821780| 2| -1| null| null| null|
| 822334| 4|<p>I've seen on s...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:25:...| 2|821877|2009-05-04T22:11:...| null| null| null| null| 100884| null| 1| 1|<php><web-applica...|Create a link bet...| 137|
| null| null|<p>The reason it ...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:25:...| null|821879|2009-05-04T20:25:...| null| null| null| null| 19131| 821839| 2| 12| null| null| null|
| null| null|<p>Depending on y...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:25:...| null|821881|2009-05-04T20:25:...| null| null| null| null| 91872| 820807| 2| 5| null| null| null|
| null| null|<p>One option is ...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:26:...| null|821882|2009-05-04T20:40:...|2009-05-04T20:40:...| null| 41094| null| 41094| 821844| 2| 8| null| null| null|
| null| null|<p>Write a matrix...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:26:...| null|821883|2009-05-04T20:26:...| null| null| null| null| 32174| 819138| 2| 16| null| null| null|
| null| null|<p>Convert the tw...| null| 0| null| CC BY-SA 2.5|2009-05-04T20:26:...| null|821884|2009-05-04T20:26:...| null| null| null| null| 78259| 821423| 2| 1| null| null| null|
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Tags.parquet").show()
+-------+--------------+---+----------+-----------+
| _Count|_ExcerptPostId|_Id| _TagName|_WikiPostId|
+-------+--------------+---+----------+-----------+
| 303362| 3624959| 1| .net| 3607476|
|1038358| 3673183| 2| html| 3673182|
|2130783| 3624960| 3|javascript| 3607052|
| 694921| 3644670| 4| css| 3644669|
|1381623| 3624936| 5| php| 3607050|
| 345968| 3624961| 8| c| 3607013|
|1450818| 3624962| 9| c#| 3607007|
| 702947| 3624963| 10| c++| 3606997|
| 216766| 3624964| 12| ruby| 3607043|
| 6324| 3656743| 14| lisp| 3656742|
|1597896| 3624965| 16| python| 3607014|
|1735439| 3624966| 17| java| 3607018|
| 235211| 3624967| 18| regex| 3607017|
| 198437| 3624968| 19| xml| 3607588|
| 613118| 3624969| 21| mysql| 3607033|
| 570524| 3625226| 22| sql| 3607304|
| 64654| 4777787| 23| tsql| 4777786|
| 63| 8355939| 26| j#| 8355938|
| 24176| 5388160| 27| browser| 5388159|
| 80168| 4890031| 28| linq| 4890030|
+-------+--------------+---+----------+-----------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Users.parquet").show()
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+
| _AboutMe|_AccountId| _CreationDate| _DisplayName|_DownVotes| _Id| _LastAccessDate| _Location| _ProfileImageUrl|_Reputation|_UpVotes|_Views| _WebsiteUrl|
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+
| null| 17439875|2020-01-02T16:08:...| FLvME| 0|12642164|2020-01-07T05:58:...| Vietnam|https://lh3.googl...| 1| 0| 0| |
| null| 17439877|2020-01-02T16:08:...| BAAGreen| 0|12642165|2020-10-04T17:18:...| null|https://www.grava...| 1| 0| 0| null|
| null| 17439878|2020-01-02T16:09:...| V Payment| 0|12642166|2020-01-02T16:24:...| null|https://lh3.googl...| 1| 0| 0| null|
|<p>In no way allo...| 17439879|2020-01-02T16:09:...| terlysz306| 0|12642167|2020-01-02T16:09:...| null|https://www.grava...| 1| 0| 0|http://www.wikido...|
| null| 17439880|2020-01-02T16:09:...| Sergey Lemish| 0|12642168|2020-12-04T16:58:...| Dublin, Ireland|https://www.grava...| 1| 0| 0| null|
| null| 17439881|2020-01-02T16:09:...|Genesis Dominic I...| 0|12642169|2020-01-02T16:09:...| null|https://graph.fac...| 1| 0| 0| null|
| null| 17439882|2020-01-02T16:09:...| Ritesh Bhartiya| 0|12642170|2020-10-25T02:57:...| null| null| 21| 0| 6| null|
| null| 17439883|2020-01-02T16:09:...| chmiga092020| 0|12642171|2020-01-16T15:56:...| null|https://www.grava...| 1| 0| 0| null|
| | 17439884|2020-01-02T16:09:...| Lee Ford| 0|12642172|2020-12-05T20:31:...| null|https://www.grava...| 13| 0| 1|https://www.lee-f...|
| null| 17439886|2020-01-02T16:10:...| MillerAuto| 0|12642173|2020-01-02T16:10:...| null|https://www.grava...| 1| 0| 0| null|
| null| 17439888|2020-01-02T16:10:...| Gowtham Reddy| 0|12642174|2020-03-23T18:41:...| null|https://lh3.googl...| 1| 0| 0| null|
| null| 17439889|2020-01-02T16:10:...| Hamza Soomro| 0|12642175|2020-07-12T11:41:...| Karachi, Pakistan|https://lh3.googl...| 1| 0| 0| null|
| null| 17439890|2020-01-02T16:10:...| Rajasiva singh| 0|12642176|2020-01-06T13:47:...| null|https://lh3.googl...| 1| 0| 0| null|
| null| 17439762|2020-01-02T16:10:...| Andrii Pereverziev| 0|12642177|2020-11-30T13:21:...| null|https://www.grava...| 1| 0| 0| null|
| null| 17439892|2020-01-02T16:10:...| Juned Shaikh| 0|12642179|2020-01-12T15:01:...| null|https://lh6.googl...| 1| 0| 0| null|
| null| 17439893|2020-01-02T16:10:...| james webb| 0|12642180|2020-01-02T16:10:...| null|https://lh6.googl...| 1| 0| 0| null|
| null| 17439894|2020-01-02T16:11:...| MasonDDD| 0|12642181|2020-12-03T10:00:...|Canberra ACT, Aus...|https://www.grava...| 25| 1| 3| null|
| null| 17439895|2020-01-02T16:11:...| Satyam Mishra| 0|12642182|2020-12-04T08:11:...|Jabalpur, Madhya ...|https://lh3.googl...| 1| 0| 3| null|
| null| 17439896|2020-01-02T16:11:...| orchocsearchterp| 0|12642183|2020-01-02T16:11:...| W|https://www.grava...| 1| 0| 0| null|
| | 17439899|2020-01-02T16:12:...| Prince_Patel486| 0|12642185|2020-05-30T12:05:...| null|https://www.grava...| 1| 0| 0| |
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Votes.parquet").show()
+-------------+--------------------+---------+--------+-------+-----------+
|_BountyAmount| _CreationDate| _Id| _PostId|_UserId|_VoteTypeId|
+-------------+--------------------+---------+--------+-------+-----------+
| null|2020-05-04T00:00:...|208377486|53104880| null| 2|
| null|2020-05-04T00:00:...|208377487| 1722518| null| 2|
| null|2020-05-04T00:00:...|208377489|58112953| null| 2|
| null|2020-05-04T00:00:...|208377490|55824926| null| 2|
| null|2020-05-04T00:00:...|208377491|61599528| null| 2|
| null|2020-05-04T00:00:...|208377492|48986909| null| 2|
| null|2020-05-04T00:00:...|208377493|55812717| null| 2|
| null|2020-05-04T00:00:...|208377494|36952823| null| 3|
| null|2020-05-04T00:00:...|208377495|34975690| null| 2|
| null|2020-05-04T00:00:...|208377496|48981817| null| 2|
| null|2020-05-04T00:00:...|208377497|14428111| null| 2|
| null|2020-05-04T00:00:...|208377499|14428084| null| 2|
| null|2020-05-04T00:00:...|208377500|21689505| null| 2|
| null|2020-05-04T00:00:...|208377501|12468641| null| 2|
| null|2020-05-04T00:00:...|208377503|61597117| null| 16|
| null|2020-05-04T00:00:...|208377504|26414140| null| 2|
| null|2020-05-04T00:00:...|208377506| 1402410| null| 2|
| null|2020-05-04T00:00:...|208377507|17991752| null| 2|
| null|2020-05-04T00:00:...|208377508|45945829| null| 2|
| null|2020-05-04T00:00:...|208377509|61588542| null| 2|
+-------------+--------------------+---------+--------+-------+-----------+
Last version: 15 Januari 2021
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/output_stackoverflow.parquet").show()
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+
| _Id|title_contains_questionmark|title_number_of_characters|has_answer|number_of_characters|number_of_interpunction_characters|number_of_emoji_characters| interpunction_ratio| emoji_ratio|number_of_lines|average_line_length|number_of_words|average_word_length|creation_seconds|number_of_tags|contains_language_tag|contains_platform_tag|user_age|posts_amount|answered_posts_amount|is_question|
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+
| 19| false| 30| true| 3544| 279| 3386| 0.07872460496613995|0.9554176072234764| 80| 44.3| 1263| 2.8060174188440223| 1217560882| 4| false| true| 3798| 9| 8| true|
| 6578| false| 57| true| 537| 12| 529| 0.0223463687150838|0.9851024208566108| 5| 107.4| 102| 5.264705882352941| 1218244384| 4| false| true| 3540| 73| 47| true|
| 9341| false| 62| true| 1472| 37| 1442|0.025135869565217392|0.9796195652173914| 16| 92.0| 286| 5.146853146853147| 1218579063| 2| true| false| 791491| 54| 44| true|
| 9750| false| 29| false| 258| 4| 242|0.015503875968992248| 0.937984496124031| 9| 28.666666666666668| 68| 3.7941176470588234| 1218625248| 2| true| false| 968362| 22| 17| true|
|11574| true| 49| true| 336| 8| 332|0.023809523809523808|0.9880952380952381| 3| 112.0| 64| 5.25| 1218735123| 4| true| true| 279371| 7| 5| true|
|16007| false| 48| true| 451| 25| 429| 0.05543237250554324|0.9512195121951219| 12| 37.583333333333336| 159| 2.8364779874213837| 1219137895| 3| false| false| 79969| 17| 15| true|
|16828| true| 77| true| 206| 3| 198|0.014563106796116505|0.9611650485436893| 5| 41.2| 49| 4.204081632653061| 1219168833| 2| false| true| 101662| 10| 5| true|
|19790| false| 52| true| 274| 5| 274| 0.01824817518248175| 1.0| 1| 274.0| 54| 5.074074074074074| 1219316456| 2| false| false| 270129| 52| 40| true|
|21288| true| 69| true| 54| 1| 54|0.018518518518518517| 1.0| 1| 54.0| 10| 5.4| 1219348583| 4| true| false| 181970| 18| 14| true|
|26719| false| 46| true| 597| 21| 581|0.035175879396984924|0.9731993299832495| 9| 66.33333333333333| 103| 5.796116504854369| 1219686691| 4| true| false| 840137| 29| 20| true|
|33321| false| 19| true| 234| 7| 234|0.029914529914529916| 1.0| 1| 234.0| 44| 5.318181818181818| 1219948329| 3| false| false| 281269| 74| 68| true|
|37628| false| 10| true| 144| 7| 138| 0.04861111111111111|0.9583333333333334| 4| 36.0| 29| 4.9655172413793105| 1220251161| 2| true| false| 2577458| 28| 28| true|
|37809| true| 39| true| 298| 14| 282| 0.04697986577181208|0.9463087248322147| 9| 33.111111111111114| 61| 4.885245901639344| 1220262957| 2| true| false| 2336960| 50| 49| true|
|39391| true| 60| true| 280| 11| 280|0.039285714285714285| 1.0| 1| 280.0| 54| 5.185185185185185| 1220351847| 3| true| false| 329| 14| 13| true|
|40622| true| 50| true| 774| 14| 756| 0.01808785529715762|0.9767441860465116| 10| 77.4| 148| 5.22972972972973| 1220381512| 2| false| false| 27382| 6| 2| true|
|42954| false| 33| true| 225| 3| 225|0.013333333333333334| 1.0| 1| 225.0| 47| 4.787234042553192| 1220482662| 2| false| false| 1471610| 12| 11| true|
|44617| false| 57| true| 826| 22| 798|0.026634382566585957|0.9661016949152542| 15| 55.06666666666667| 161| 5.130434782608695| 1220550306| 2| false| false| 965638| 6| 6| true|
|47169| true| 101| true| 455| 9| 451| 0.01978021978021978|0.9912087912087912| 3| 151.66666666666666| 79| 5.7594936708860756| 1220657872| 3| false| false| 824731| 103| 91| true|
|47217| false| 98| true| 509| 10| 497|0.019646365422396856|0.9764243614931237| 7| 72.71428571428571| 103| 4.941747572815534| 1220662519| 2| false| false| 890777| 3| 3| true|
|49080| true| 39| true| 271| 3| 269| 0.01107011070110701| 0.992619926199262| 2| 135.5| 60| 4.516666666666667| 1220835278| 2| false| false| 8491| 0| 0| true|
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+
Calculated using spearman
. This means that we cannot simply rely on the value shown here. We should also verify that the values are monotonic.
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/pair_correlation.parquet").show()
+--------------------+--------------------+----------+
| correlation| feature| label|
+--------------------+--------------------+----------+
| 0.04164706093979694|title_contains_qu...|has_answer|
| -0.034707835357526|title_number_of_c...|has_answer|
|-0.00491253326448154|number_of_characters|has_answer|
|0.019044663496529227|number_of_interpu...|has_answer|
|-0.00233838443165...| average_word_length|has_answer|
|-0.19493722195095384| creation_seconds|has_answer|
|-0.00404959346628...| number_of_tags|has_answer|
| 0.0701465992728538|contains_language...|has_answer|
|0.020167772743942793| punctuation_ratio|has_answer|
|6.850081805398476E-4| number_of_lines|has_answer|
|-0.00491454014554...| average_line_length|has_answer|
|-0.00317468875425...| number_of_words|has_answer|
| 0.2571830458177784| posts_amount|has_answer|
| 0.31784655302704|answered_posts_am...|has_answer|
|-0.03127949576091...|contains_platform...|has_answer|
|0.038255967261455755| user_age|has_answer|
+--------------------+--------------------+----------+
Show the data of the model.
>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/analysis/regressor_saved.parquet/data").show()
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+
| id| prediction| impurity| impurityStats| gain|leftChild|rightChild| split|
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+
| 31| 0.7547091181863644|0.18512326511272467|[11998.0, 9055.0,...| -1.0| -1| -1| [-1, [], -1]|
| 32|0.37486982240304373|0.23434243865455417|[1152272.0, 43195...|0.006136729293756998| 33| 48|[2, [1.36361601E9...|
| 33| 0.2512793183428513|0.18813802251600328|[330254.0, 82986....|0.004776013837556381| 34| 41| [8, [7.5], -1]|
| 34|0.35021977616075056|0.22756588454666435|[108292.0, 37926....|0.002296493147494...| 35| 38| [8, [2.5], -1]|
| 35| 0.4059573728463728|0.24115598427804388|[46027.0, 18685.0...|0.005602832480530534| 36| 37| [7, [4.5], -1]|
| 36|0.36253557117138047| 0.2311035308068214|[34438.0, 12485.0...| -1.0| -1| -1| [-1, [], -1]|
| 37| 0.5349900767969626|0.24877569452574266|[11589.0, 6200.0,...| -1.0| -1| -1| [-1, [], -1]|
| 38| 0.3090179073315667|0.21352584027998595|[62265.0, 19241.0...|0.005544332988142012| 39| 40| [7, [10.5], -1]|
| 15| 0.7243969948596283|0.19964598869796799|[10116.0, 7328.0,...| -1.0| -1| -1| [-1, [], -1]|
| 16| 0.8313752591568763| 0.1401904376187131|[4341.0, 3609.0, ...| -1.0| -1| -1| [-1, [], -1]|
| 17| 0.726157776534727|0.19885266011286845|[239554.0, 173954...|0.001002674480903...| 18| 25| [6, [445698.5], -1]|
| 18| 0.7766155482669976|0.17348383845694834|[67686.0, 52566.0...| 7.64981619419805E-4| 19| 22| [6, [686.0], -1]|
| 19| 0.8119030980666201| 0.1527164574164444|[25758.0, 20913.0...| 7.40050812104133E-4| 20| 21| [7, [0.5], -1]|
| 20| 0.8261989694807769| 0.1435942323096792|[20184.0, 16676.0...| -1.0| -1| -1| [-1, [], -1]|
| 21| 0.7601363473268747| 0.1823290807994316|[5574.0, 4237.0, ...| -1.0| -1| -1| [-1, [], -1]|
| 22| 0.7549370349170006|0.18500710822772806|[41928.0, 31653.0...|5.364548292366399E-4| 23| 24|[2, [1.575364746E...|
| 7| 0.7953811770548795|0.16274996024167385|[8054.0, 6406.0, ...|0.001592709755649...| 8| 9| [7, [4.5], -1]|
| 8| 0.7653239929947461|0.17960317874132392|[5139.0, 3933.0, ...| -1.0| -1| -1| [-1, [], -1]|
| 9| 0.8483704974271012| 0.1286379965223941|[2915.0, 2473.0, ...| -1.0| -1| -1| [-1, [], -1]|
| 10| 0.628404926205963| 0.2335121749260412|[181722.0, 114195...|0.001418629698850...| 11| 14| [7, [3.5], -1]|
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+