Skip to content

Latest commit

 

History

History
337 lines (323 loc) · 41.2 KB

dataFramePreviews.md

File metadata and controls

337 lines (323 loc) · 41.2 KB

Data previews

Easiest to read on GitHub: https://github.com/WeersProductions/resolverflow/blob/master/dataFramePreviews.md Data previews for the following tables:

  1. Badges
  2. Comments
  3. PostHistory
  4. PostLinks
  5. Posts
  6. Tags
  7. Users
  8. Votes
  9. output_stackoverflow
  10. correlation
  11. decision tree (regressor)

You can find the database schema here: https://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede

Badges

>>> df = spark.read.parquet("/user/***REMOVED***/StackOverflow/Badges.parquet")
>>> df.show()                                                                   
+------+--------------------+-------+----------------+---------+-------+        
|_Class|               _Date|    _Id|           _Name|_TagBased|_UserId|
+------+--------------------+-------+----------------+---------+-------+
|     2|2011-02-02T17:42:...|1400008|Notable Question|    false|   3488|
|     2|2011-02-02T17:42:...|1400009|            Guru|    false| 353410|
|     3|2011-02-02T17:47:...|1400010|         Student|    false| 599272|
|     3|2011-02-02T17:47:...|1400011|         Student|    false| 557399|
|     3|2011-02-02T17:47:...|1400012|         Student|    false| 600168|
|     3|2011-02-02T17:47:...|1400013|          Editor|    false| 564626|
|     3|2011-02-02T17:47:...|1400014|  Autobiographer|    false| 296093|
|     3|2011-02-02T17:47:...|1400016|         Scholar|    false| 410592|
|     3|2011-02-02T17:47:...|1400017|         Scholar|    false| 463428|
|     3|2011-02-02T17:47:...|1400018|   Nice Question|    false|  32484|
|     3|2011-02-02T17:47:...|1400019|Popular Question|    false|   1512|
|     3|2011-02-02T17:47:...|1400020|Popular Question|    false|   7453|
|     3|2011-02-02T17:47:...|1400021|Popular Question|    false| 256239|
|     3|2011-02-02T17:47:...|1400022|Popular Question|    false| 326284|
|     3|2011-02-02T17:47:...|1400023|Popular Question|    false|  22215|
|     3|2011-02-02T17:47:...|1400024|      Tumbleweed|    false| 591013|
|     3|2011-02-02T17:47:...|1400025|      Tumbleweed|    false| 591011|
|     3|2011-02-02T17:52:...|1400026|         Student|    false| 543649|
|     3|2011-02-02T17:52:...|1400027|         Student|    false| 553638|
|     3|2011-02-02T17:52:...|1400028|         Student|    false| 155689|
+------+--------------------+-------+----------------+---------+-------+

Comments

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Comments.parquet").show()
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+
|_ContentLicense|       _CreationDate|    _Id|_PostId|_Score|               _Text|_UserDisplayName|_UserId|
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+
|   CC BY-SA 2.5|2010-07-02T21:17:...|3260274|3168284|     0|So, for each pati...|            null|  32632|
|   CC BY-SA 2.5|2010-07-02T21:17:...|3260275|3167981|     0|Another alternati...|            null|  29639|
|   CC BY-SA 2.5|2010-07-02T21:17:...|3260276|3169136|     3|And no info in Bo...|            null| 155356|
|   CC BY-SA 2.5|2010-07-02T21:17:...|3260277|3128496|     0|One thing to reme...|            null|  89761|
|   CC BY-SA 2.5|2010-07-02T21:17:...|3260279|3169068|     0|Not really relate...|            null|  74757|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260281|3168944|     0|javascript:var%20...|            null| 281705|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260282|3168667|     0|Yeah. But I think...|            null| 243943|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260283| 235067|     0|@Jon Skeet: Maybe...|            null| 135172|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260284|3161146|     0|This is a duplica...|            null| 340221|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260285|3161593|     0|Is the first_two_...|            null| 305555|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260286|3169115|     0|Isn't there an "E...|            null|  17028|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260287|3169126|     6|If your code is b...|            null| 189416|
|   CC BY-SA 2.5|2010-07-02T21:18:...|3260288|3169146|     2|Without code to e...|            null|  91768|
|   CC BY-SA 2.5|2010-07-02T21:19:...|3260289|3168175|     0|This is very nice...|            null| 312026|
|   CC BY-SA 2.5|2010-07-02T21:19:...|3260290|3167625|     0|I was able to fix...|            null|  87154|
|   CC BY-SA 2.5|2010-07-02T21:19:...|3260291|3163175|     0|The first solutio...|            null| 381675|
|   CC BY-SA 2.5|2010-07-02T21:19:...|3260292|3168299|     0|I think you can i...|            null|  69998|
|   CC BY-SA 2.5|2010-07-02T21:20:...|3260293| 204506|     0|@David Thornley -...|            null|   1737|
|   CC BY-SA 2.5|2010-07-02T21:20:...|3260294|3169159|     0|Is there somethin...|            null| 356292|
|   CC BY-SA 2.5|2010-07-02T21:20:...|3260295|3162018|     0|Okay what can i d...|            null| 365706|
+---------------+--------------------+-------+-------+------+--------------------+----------------+-------+

PostHistory

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/PostHistory.parquet").show()
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+
|            _Comment|_ContentLicense|       _CreationDate|      _Id|_PostHistoryTypeId| _PostId|       _RevisionGUID|               _Text|_UserDisplayName|_UserId|
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519365|                 5| 2613348|6499a078-4532-4d2...|When I went to Un...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519366|                 5| 2615297|26b49208-3be4-4eb...|If you have some ...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519367|                 5| 2619007|bc975e70-a69a-4f3...|How to create a s...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519368|                 5| 2617924|d529d0f4-15ed-4f1...|Yes - don't forge...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519369|                 5| 2617787|fb7ed36e-ee2b-42a...|Instead of nested...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519370|                 5| 2617507|bc1093c6-d4ca-435...|You said:  > ... ...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519371|                 5| 2617212|578031e6-c28a-4ff...|I am referred to ...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519372|                 5| 2617119|54556aab-9ea2-411...|i want get data f...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519373|                 5| 2616985|a600c321-4ab0-4c7...|Here is the issue...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519374|                 5| 2616839|0fb6470a-ca94-4d6...|The Wikipedia [Gr...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519375|                 5| 2616320|5db679ca-da0c-4ad...|I'm trying to cre...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519376|                 5| 2616314|301771e4-67a3-48c...|When defining you...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519377|                 5| 2616195|f81da27b-4153-49b...|This must be doab...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519378|                 5| 2618517|9ef29b6a-c1c3-432...|I'm racking my br...|            null|     -1|
|                null|   CC BY-SA 4.0|2020-06-20T09:45:...|224519379|                 2|62484164|11cc8566-d631-40a...|Your client code ...|            null| 256196|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519385|                 5| 2622179|1797efdd-e2fb-43e...|Sorry for answeri...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519386|                 5| 2621679|bc43b8a5-5d21-418...|With respect ther...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519387|                 5| 2621634|fcae8d6f-3af5-432...|From [Statement j...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519388|                 5| 2621174|59bb584d-129a-47c...|The general recur...|            null|     -1|
|Commonmark migration|   CC BY-SA 4.0|2020-06-20T09:12:...|224519389|                 5| 2621136|6f5e75ba-3413-4dc...|I have compiled t...|            null|     -1|
+--------------------+---------------+--------------------+---------+------------------+--------+--------------------+--------------------+----------------+-------+

PostLinks

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/PostLinks.parquet").show()
+--------------------+---+-----------+-------+--------------+
|       _CreationDate|_Id|_LinkTypeId|_PostId|_RelatedPostId|
+--------------------+---+-----------+-------+--------------+
|2010-04-26T02:59:...| 19|          1|    109|         32412|
|2010-04-26T02:59:...| 37|          1|   1970|        617600|
|2010-04-26T02:59:...| 42|          1|   2154|       2451138|
|2010-04-26T02:59:...| 52|          1|   2572|        209329|
|2010-04-26T02:59:...| 58|          1|   3376|          2187|
|2010-04-26T02:59:...| 59|          1|   3376|         18080|
|2010-04-26T02:59:...| 63|          1|   3859|        802573|
|2010-04-26T02:59:...| 69|          1|   4565|        583532|
|2010-04-26T02:59:...| 70|          1|   4582|       1955663|
|2010-04-26T02:59:...| 72|          1|   4850|       1369312|
|2010-04-26T02:59:...| 74|          1|   4952|          1607|
|2010-04-26T02:59:...| 75|          1|   4952|           173|
|2010-04-26T02:59:...| 76|          1|   4952|          6371|
|2010-04-26T02:59:...| 81|          1|   5724|        131955|
|2010-04-26T02:59:...| 84|          1|   6173|       1732348|
|2010-04-26T02:59:...| 90|          1|   7470|         82872|
|2010-04-26T02:59:...| 96|          1|   8170|        245395|
|2010-04-26T02:59:...|119|          1|   9705|        305694|
|2010-04-26T02:59:...|121|          1|   9751|           371|
|2010-04-26T02:59:...|126|          1|   9926|        619677|
+--------------------+---+-----------+-------+--------------+

Posts

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Posts.parquet").show()
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
|_AcceptedAnswerId|_AnswerCount|               _Body|_ClosedDate|_CommentCount|_CommunityOwnedDate|_ContentLicense|       _CreationDate|_FavoriteCount|   _Id|   _LastActivityDate|       _LastEditDate|_LastEditorDisplayName|_LastEditorUserId|_OwnerDisplayName|_OwnerUserId|_ParentId|_PostTypeId|_Score|               _Tags|              _Title|_ViewCount|
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+
|             null|        null|<p>Have you tried...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:22:...|          null|821861|2009-05-06T13:38:...|2009-05-06T13:38:...|                  null|            95029|             null|       95029|   819238|          2|     2|                null|                null|      null|
|             null|        null|<p>It seems that ...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:22:...|          null|821862|2009-05-04T20:22:...|                null|                  null|             null|             null|       60315|   820218|          2|     0|                null|                null|      null|
|             null|        null|<blockquote> <p>F...|       null|            4|               null|   CC BY-SA 3.0|2009-05-04T20:23:...|          null|821864|2014-07-29T15:12:...|2020-06-20T09:12:...|                  null|               -1|             null|       55159|   311873|          2|     4|                null|                null|      null|
|             null|        null|<p>I would seriou...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:23:...|          null|821865|2009-05-04T21:43:...|2009-05-04T21:43:...|                  null|             2424|             null|        2424|   821598|          2|     1|                null|                null|      null|
|             null|        null|<pre><code>String...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:23:...|          null|821866|2009-05-04T20:23:...|                null|                  null|             null|             null|       89266|   821806|          2|     0|                null|                null|      null|
|           821986|           4|<p>Suppose I have...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:23:...|             2|821867|2009-05-08T07:18:...|2009-05-04T20:30:...|                  null|            16012|             null|       16012|     null|          1|     4|          <c#><linq>|how to process "p...|       640|
|             null|        null|<p>You have vario...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:23:...|          null|821868|2009-05-04T20:23:...|                null|                  null|             null|             null|       79294|   821780|          2|    10|                null|                null|      null|
|             null|        null|<p>Because that's...|       null|            1|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|          null|821869|2009-05-04T20:24:...|                null|                  null|             null|             null|       21886|   821839|          2|    20|                null|                null|      null|
|          1049025|           4|<p>I want to only...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|            40|821870|2017-05-09T10:09:...|                null|                  null|             null|             null|       36680|     null|          1|    26|<django><django-a...|How can I detect ...|     16019|
|             null|        null|<p>I had a deer-i...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|          null|821871|2009-05-04T20:24:...|                null|                  null|             null|             null|      101116|   821740|          2|     0|                null|                null|      null|
|             null|        null|<p>Implement ISer...|       null|            1|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|          null|821872|2009-05-04T20:24:...|                null|                  null|             null|             null|       12971|   821780|          2|     2|                null|                null|      null|
|           822032|           7|<p>You wouldn't i...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|             8|821873|2020-11-27T10:49:...|2009-05-04T20:56:...|                  null|            65336|             null|       65336|     null|          1|    57|<c++><windows><un...|How to open an st...|     57692|
|             null|        null|<p>On <strong>Lin...|       null|            7|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|          null|821874|2009-05-06T14:28:...|2009-05-06T14:28:...|                  null|            44434|             null|       44434|   821837|          2|    14|                null|                null|      null|
|             null|        null|<p>Interface prop...|       null|            2|               null|   CC BY-SA 2.5|2009-05-04T20:24:...|          null|821875|2009-05-04T20:24:...|                null|                  null|             null|             null|       16623|   821780|          2|    -1|                null|                null|      null|
|           822334|           4|<p>I've seen on s...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:25:...|             2|821877|2009-05-04T22:11:...|                null|                  null|             null|             null|      100884|     null|          1|     1|<php><web-applica...|Create a link bet...|       137|
|             null|        null|<p>The reason it ...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:25:...|          null|821879|2009-05-04T20:25:...|                null|                  null|             null|             null|       19131|   821839|          2|    12|                null|                null|      null|
|             null|        null|<p>Depending on y...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:25:...|          null|821881|2009-05-04T20:25:...|                null|                  null|             null|             null|       91872|   820807|          2|     5|                null|                null|      null|
|             null|        null|<p>One option is ...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:26:...|          null|821882|2009-05-04T20:40:...|2009-05-04T20:40:...|                  null|            41094|             null|       41094|   821844|          2|     8|                null|                null|      null|
|             null|        null|<p>Write a matrix...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:26:...|          null|821883|2009-05-04T20:26:...|                null|                  null|             null|             null|       32174|   819138|          2|    16|                null|                null|      null|
|             null|        null|<p>Convert the tw...|       null|            0|               null|   CC BY-SA 2.5|2009-05-04T20:26:...|          null|821884|2009-05-04T20:26:...|                null|                  null|             null|             null|       78259|   821423|          2|     1|                null|                null|      null|
+-----------------+------------+--------------------+-----------+-------------+-------------------+---------------+--------------------+--------------+------+--------------------+--------------------+----------------------+-----------------+-----------------+------------+---------+-----------+------+--------------------+--------------------+----------+

Tags

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Tags.parquet").show()
+-------+--------------+---+----------+-----------+
| _Count|_ExcerptPostId|_Id|  _TagName|_WikiPostId|
+-------+--------------+---+----------+-----------+
| 303362|       3624959|  1|      .net|    3607476|
|1038358|       3673183|  2|      html|    3673182|
|2130783|       3624960|  3|javascript|    3607052|
| 694921|       3644670|  4|       css|    3644669|
|1381623|       3624936|  5|       php|    3607050|
| 345968|       3624961|  8|         c|    3607013|
|1450818|       3624962|  9|        c#|    3607007|
| 702947|       3624963| 10|       c++|    3606997|
| 216766|       3624964| 12|      ruby|    3607043|
|   6324|       3656743| 14|      lisp|    3656742|
|1597896|       3624965| 16|    python|    3607014|
|1735439|       3624966| 17|      java|    3607018|
| 235211|       3624967| 18|     regex|    3607017|
| 198437|       3624968| 19|       xml|    3607588|
| 613118|       3624969| 21|     mysql|    3607033|
| 570524|       3625226| 22|       sql|    3607304|
|  64654|       4777787| 23|      tsql|    4777786|
|     63|       8355939| 26|        j#|    8355938|
|  24176|       5388160| 27|   browser|    5388159|
|  80168|       4890031| 28|      linq|    4890030|
+-------+--------------+---+----------+-----------+

Users

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Users.parquet").show()
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+
|            _AboutMe|_AccountId|       _CreationDate|        _DisplayName|_DownVotes|     _Id|     _LastAccessDate|           _Location|    _ProfileImageUrl|_Reputation|_UpVotes|_Views|         _WebsiteUrl|
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+
|                null|  17439875|2020-01-02T16:08:...|               FLvME|         0|12642164|2020-01-07T05:58:...|             Vietnam|https://lh3.googl...|          1|       0|     0|                    |
|                null|  17439877|2020-01-02T16:08:...|            BAAGreen|         0|12642165|2020-10-04T17:18:...|                null|https://www.grava...|          1|       0|     0|                null|
|                null|  17439878|2020-01-02T16:09:...|           V Payment|         0|12642166|2020-01-02T16:24:...|                null|https://lh3.googl...|          1|       0|     0|                null|
|<p>In no way allo...|  17439879|2020-01-02T16:09:...|          terlysz306|         0|12642167|2020-01-02T16:09:...|                null|https://www.grava...|          1|       0|     0|http://www.wikido...|
|                null|  17439880|2020-01-02T16:09:...|       Sergey Lemish|         0|12642168|2020-12-04T16:58:...|     Dublin, Ireland|https://www.grava...|          1|       0|     0|                null|
|                null|  17439881|2020-01-02T16:09:...|Genesis Dominic I...|         0|12642169|2020-01-02T16:09:...|                null|https://graph.fac...|          1|       0|     0|                null|
|                null|  17439882|2020-01-02T16:09:...|     Ritesh Bhartiya|         0|12642170|2020-10-25T02:57:...|                null|                null|         21|       0|     6|                null|
|                null|  17439883|2020-01-02T16:09:...|        chmiga092020|         0|12642171|2020-01-16T15:56:...|                null|https://www.grava...|          1|       0|     0|                null|
|                    |  17439884|2020-01-02T16:09:...|            Lee Ford|         0|12642172|2020-12-05T20:31:...|                null|https://www.grava...|         13|       0|     1|https://www.lee-f...|
|                null|  17439886|2020-01-02T16:10:...|          MillerAuto|         0|12642173|2020-01-02T16:10:...|                null|https://www.grava...|          1|       0|     0|                null|
|                null|  17439888|2020-01-02T16:10:...|       Gowtham Reddy|         0|12642174|2020-03-23T18:41:...|                null|https://lh3.googl...|          1|       0|     0|                null|
|                null|  17439889|2020-01-02T16:10:...|        Hamza Soomro|         0|12642175|2020-07-12T11:41:...|   Karachi, Pakistan|https://lh3.googl...|          1|       0|     0|                null|
|                null|  17439890|2020-01-02T16:10:...|      Rajasiva singh|         0|12642176|2020-01-06T13:47:...|                null|https://lh3.googl...|          1|       0|     0|                null|
|                null|  17439762|2020-01-02T16:10:...|  Andrii Pereverziev|         0|12642177|2020-11-30T13:21:...|                null|https://www.grava...|          1|       0|     0|                null|
|                null|  17439892|2020-01-02T16:10:...|        Juned Shaikh|         0|12642179|2020-01-12T15:01:...|                null|https://lh6.googl...|          1|       0|     0|                null|
|                null|  17439893|2020-01-02T16:10:...|          james webb|         0|12642180|2020-01-02T16:10:...|                null|https://lh6.googl...|          1|       0|     0|                null|
|                null|  17439894|2020-01-02T16:11:...|            MasonDDD|         0|12642181|2020-12-03T10:00:...|Canberra ACT, Aus...|https://www.grava...|         25|       1|     3|                null|
|                null|  17439895|2020-01-02T16:11:...|       Satyam Mishra|         0|12642182|2020-12-04T08:11:...|Jabalpur, Madhya ...|https://lh3.googl...|          1|       0|     3|                null|
|                null|  17439896|2020-01-02T16:11:...|    orchocsearchterp|         0|12642183|2020-01-02T16:11:...|                   W|https://www.grava...|          1|       0|     0|                null|
|                    |  17439899|2020-01-02T16:12:...|     Prince_Patel486|         0|12642185|2020-05-30T12:05:...|                null|https://www.grava...|          1|       0|     0|                    |
+--------------------+----------+--------------------+--------------------+----------+--------+--------------------+--------------------+--------------------+-----------+--------+------+--------------------+

Votes

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/Votes.parquet").show()
+-------------+--------------------+---------+--------+-------+-----------+     
|_BountyAmount|       _CreationDate|      _Id| _PostId|_UserId|_VoteTypeId|
+-------------+--------------------+---------+--------+-------+-----------+
|         null|2020-05-04T00:00:...|208377486|53104880|   null|          2|
|         null|2020-05-04T00:00:...|208377487| 1722518|   null|          2|
|         null|2020-05-04T00:00:...|208377489|58112953|   null|          2|
|         null|2020-05-04T00:00:...|208377490|55824926|   null|          2|
|         null|2020-05-04T00:00:...|208377491|61599528|   null|          2|
|         null|2020-05-04T00:00:...|208377492|48986909|   null|          2|
|         null|2020-05-04T00:00:...|208377493|55812717|   null|          2|
|         null|2020-05-04T00:00:...|208377494|36952823|   null|          3|
|         null|2020-05-04T00:00:...|208377495|34975690|   null|          2|
|         null|2020-05-04T00:00:...|208377496|48981817|   null|          2|
|         null|2020-05-04T00:00:...|208377497|14428111|   null|          2|
|         null|2020-05-04T00:00:...|208377499|14428084|   null|          2|
|         null|2020-05-04T00:00:...|208377500|21689505|   null|          2|
|         null|2020-05-04T00:00:...|208377501|12468641|   null|          2|
|         null|2020-05-04T00:00:...|208377503|61597117|   null|         16|
|         null|2020-05-04T00:00:...|208377504|26414140|   null|          2|
|         null|2020-05-04T00:00:...|208377506| 1402410|   null|          2|
|         null|2020-05-04T00:00:...|208377507|17991752|   null|          2|
|         null|2020-05-04T00:00:...|208377508|45945829|   null|          2|
|         null|2020-05-04T00:00:...|208377509|61588542|   null|          2|
+-------------+--------------------+---------+--------+-------+-----------+

output_stackoverflow

Last version: 15 Januari 2021

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/output_stackoverflow.parquet").show()
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+
|  _Id|title_contains_questionmark|title_number_of_characters|has_answer|number_of_characters|number_of_interpunction_characters|number_of_emoji_characters| interpunction_ratio|       emoji_ratio|number_of_lines|average_line_length|number_of_words|average_word_length|creation_seconds|number_of_tags|contains_language_tag|contains_platform_tag|user_age|posts_amount|answered_posts_amount|is_question|
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+
|   19|                      false|                        30|      true|                3544|                               279|                      3386| 0.07872460496613995|0.9554176072234764|             80|               44.3|           1263| 2.8060174188440223|      1217560882|             4|                false|                 true|    3798|           9|                    8|       true|
| 6578|                      false|                        57|      true|                 537|                                12|                       529|  0.0223463687150838|0.9851024208566108|              5|              107.4|            102|  5.264705882352941|      1218244384|             4|                false|                 true|    3540|          73|                   47|       true|
| 9341|                      false|                        62|      true|                1472|                                37|                      1442|0.025135869565217392|0.9796195652173914|             16|               92.0|            286|  5.146853146853147|      1218579063|             2|                 true|                false|  791491|          54|                   44|       true|
| 9750|                      false|                        29|     false|                 258|                                 4|                       242|0.015503875968992248| 0.937984496124031|              9| 28.666666666666668|             68| 3.7941176470588234|      1218625248|             2|                 true|                false|  968362|          22|                   17|       true|
|11574|                       true|                        49|      true|                 336|                                 8|                       332|0.023809523809523808|0.9880952380952381|              3|              112.0|             64|               5.25|      1218735123|             4|                 true|                 true|  279371|           7|                    5|       true|
|16007|                      false|                        48|      true|                 451|                                25|                       429| 0.05543237250554324|0.9512195121951219|             12| 37.583333333333336|            159| 2.8364779874213837|      1219137895|             3|                false|                false|   79969|          17|                   15|       true|
|16828|                       true|                        77|      true|                 206|                                 3|                       198|0.014563106796116505|0.9611650485436893|              5|               41.2|             49|  4.204081632653061|      1219168833|             2|                false|                 true|  101662|          10|                    5|       true|
|19790|                      false|                        52|      true|                 274|                                 5|                       274| 0.01824817518248175|               1.0|              1|              274.0|             54|  5.074074074074074|      1219316456|             2|                false|                false|  270129|          52|                   40|       true|
|21288|                       true|                        69|      true|                  54|                                 1|                        54|0.018518518518518517|               1.0|              1|               54.0|             10|                5.4|      1219348583|             4|                 true|                false|  181970|          18|                   14|       true|
|26719|                      false|                        46|      true|                 597|                                21|                       581|0.035175879396984924|0.9731993299832495|              9|  66.33333333333333|            103|  5.796116504854369|      1219686691|             4|                 true|                false|  840137|          29|                   20|       true|
|33321|                      false|                        19|      true|                 234|                                 7|                       234|0.029914529914529916|               1.0|              1|              234.0|             44|  5.318181818181818|      1219948329|             3|                false|                false|  281269|          74|                   68|       true|
|37628|                      false|                        10|      true|                 144|                                 7|                       138| 0.04861111111111111|0.9583333333333334|              4|               36.0|             29| 4.9655172413793105|      1220251161|             2|                 true|                false| 2577458|          28|                   28|       true|
|37809|                       true|                        39|      true|                 298|                                14|                       282| 0.04697986577181208|0.9463087248322147|              9| 33.111111111111114|             61|  4.885245901639344|      1220262957|             2|                 true|                false| 2336960|          50|                   49|       true|
|39391|                       true|                        60|      true|                 280|                                11|                       280|0.039285714285714285|               1.0|              1|              280.0|             54|  5.185185185185185|      1220351847|             3|                 true|                false|     329|          14|                   13|       true|
|40622|                       true|                        50|      true|                 774|                                14|                       756| 0.01808785529715762|0.9767441860465116|             10|               77.4|            148|   5.22972972972973|      1220381512|             2|                false|                false|   27382|           6|                    2|       true|
|42954|                      false|                        33|      true|                 225|                                 3|                       225|0.013333333333333334|               1.0|              1|              225.0|             47|  4.787234042553192|      1220482662|             2|                false|                false| 1471610|          12|                   11|       true|
|44617|                      false|                        57|      true|                 826|                                22|                       798|0.026634382566585957|0.9661016949152542|             15|  55.06666666666667|            161|  5.130434782608695|      1220550306|             2|                false|                false|  965638|           6|                    6|       true|
|47169|                       true|                       101|      true|                 455|                                 9|                       451| 0.01978021978021978|0.9912087912087912|              3| 151.66666666666666|             79| 5.7594936708860756|      1220657872|             3|                false|                false|  824731|         103|                   91|       true|
|47217|                      false|                        98|      true|                 509|                                10|                       497|0.019646365422396856|0.9764243614931237|              7|  72.71428571428571|            103|  4.941747572815534|      1220662519|             2|                false|                false|  890777|           3|                    3|       true|
|49080|                       true|                        39|      true|                 271|                                 3|                       269| 0.01107011070110701| 0.992619926199262|              2|              135.5|             60|  4.516666666666667|      1220835278|             2|                false|                false|    8491|           0|                    0|       true|
+-----+---------------------------+--------------------------+----------+--------------------+----------------------------------+--------------------------+--------------------+------------------+---------------+-------------------+---------------+-------------------+----------------+--------------+---------------------+---------------------+--------+------------+---------------------+-----------+

pair_correlation

Calculated using spearman. This means that we cannot simply rely on the value shown here. We should also verify that the values are monotonic.

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/pair_correlation.parquet").show()
+--------------------+--------------------+----------+
|         correlation|             feature|     label|
+--------------------+--------------------+----------+
| 0.04164706093979694|title_contains_qu...|has_answer|
|  -0.034707835357526|title_number_of_c...|has_answer|
|-0.00491253326448154|number_of_characters|has_answer|
|0.019044663496529227|number_of_interpu...|has_answer|
|-0.00233838443165...| average_word_length|has_answer|
|-0.19493722195095384|    creation_seconds|has_answer|
|-0.00404959346628...|      number_of_tags|has_answer|
|  0.0701465992728538|contains_language...|has_answer|
|0.020167772743942793| punctuation_ratio|has_answer|
|6.850081805398476E-4|     number_of_lines|has_answer|
|-0.00491454014554...| average_line_length|has_answer|
|-0.00317468875425...|     number_of_words|has_answer|
|  0.2571830458177784|        posts_amount|has_answer|
|    0.31784655302704|answered_posts_am...|has_answer|
|-0.03127949576091...|contains_platform...|has_answer|
|0.038255967261455755|            user_age|has_answer|
+--------------------+--------------------+----------+

decision_tree (regression)

Show the data of the model.

>>> spark.read.parquet("/user/***REMOVED***/StackOverflow/analysis/regressor_saved.parquet/data").show()
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+
| id|         prediction|           impurity|       impurityStats|                gain|leftChild|rightChild|               split|
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+
| 31| 0.7547091181863644|0.18512326511272467|[11998.0, 9055.0,...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 32|0.37486982240304373|0.23434243865455417|[1152272.0, 43195...|0.006136729293756998|       33|        48|[2, [1.36361601E9...|
| 33| 0.2512793183428513|0.18813802251600328|[330254.0, 82986....|0.004776013837556381|       34|        41|      [8, [7.5], -1]|
| 34|0.35021977616075056|0.22756588454666435|[108292.0, 37926....|0.002296493147494...|       35|        38|      [8, [2.5], -1]|
| 35| 0.4059573728463728|0.24115598427804388|[46027.0, 18685.0...|0.005602832480530534|       36|        37|      [7, [4.5], -1]|
| 36|0.36253557117138047| 0.2311035308068214|[34438.0, 12485.0...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 37| 0.5349900767969626|0.24877569452574266|[11589.0, 6200.0,...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 38| 0.3090179073315667|0.21352584027998595|[62265.0, 19241.0...|0.005544332988142012|       39|        40|     [7, [10.5], -1]|
| 15| 0.7243969948596283|0.19964598869796799|[10116.0, 7328.0,...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 16| 0.8313752591568763| 0.1401904376187131|[4341.0, 3609.0, ...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 17|  0.726157776534727|0.19885266011286845|[239554.0, 173954...|0.001002674480903...|       18|        25| [6, [445698.5], -1]|
| 18| 0.7766155482669976|0.17348383845694834|[67686.0, 52566.0...| 7.64981619419805E-4|       19|        22|    [6, [686.0], -1]|
| 19| 0.8119030980666201| 0.1527164574164444|[25758.0, 20913.0...| 7.40050812104133E-4|       20|        21|      [7, [0.5], -1]|
| 20| 0.8261989694807769| 0.1435942323096792|[20184.0, 16676.0...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 21| 0.7601363473268747| 0.1823290807994316|[5574.0, 4237.0, ...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 22| 0.7549370349170006|0.18500710822772806|[41928.0, 31653.0...|5.364548292366399E-4|       23|        24|[2, [1.575364746E...|
|  7| 0.7953811770548795|0.16274996024167385|[8054.0, 6406.0, ...|0.001592709755649...|        8|         9|      [7, [4.5], -1]|
|  8| 0.7653239929947461|0.17960317874132392|[5139.0, 3933.0, ...|                -1.0|       -1|        -1|        [-1, [], -1]|
|  9| 0.8483704974271012| 0.1286379965223941|[2915.0, 2473.0, ...|                -1.0|       -1|        -1|        [-1, [], -1]|
| 10|  0.628404926205963| 0.2335121749260412|[181722.0, 114195...|0.001418629698850...|       11|        14|      [7, [3.5], -1]|
+---+-------------------+-------------------+--------------------+--------------------+---------+----------+--------------------+