Resolverflow

Stackoverflow, a programmer's best friend. Well, if you get an answer, and if it is a useful one.

We will perform a data analysis on the StackOverflow dataset to find out how you can best formulate your question. We aim to find features that will help you get a resolution as quick as possible. Features will be ordinal and categorical, taken from the literal dataset values and but also from some custom NLP. Let's make some Stackoverflow clickbait 😎

Dataset

The https://archive.org/download/stackexchange dataset is uploaded to an HDFS. convert_dataset.py converts the xml dataset into a .parquet file that is already partitioned. This way, loading is a lot faster. Parquet structures of all generates file can be found at https://github.com/WeersProductions/resolverflow/blob/master/dataFramePreviews.md .

Project overview

The project is divided into three folders:

features
analysis
analysis/local
util

Features

Responsible for collecting features from the big data set of StackOverflow. Uses spark to fetch the features. Each file contains a group of features and can be spark-submitted on its own to gather these features. However, to run all features at once and combine them into one resulting dataset, run_all.py can be used. Users can define what feature groups should be extracted and it will combine those automatically.

To add a feature, create a new file and add your function definition. It should receive a spark context that can be used to interact with the Spark cluster. This method should return a dataframe at least one column: _Id. _Id is the Id of the post. Note: if you are using PostHistory.parquet as a source for data, be sure to use _PostId and rename the column to _Id.

Analysis

Responsible for analyzing the features after feature collection has been done. This reads from a output_stackoverflow.parquet file which contain the extracted features.

correlation.py
Calculates the correlation between a feature and the label.
decision_tree.py
Contains code to train and evaluate a decision tree (whether as classifier or regressor). Features that should be used can be selected.
swashbuckler.py
Bucketizes the input to be used for graphs.
vif.py
Used to remove features that have a too high VIF. Calculates the vif of pairs of features and also calculates the VIF while removing a single feature from all features.

Analaysis/local

To be run on a local machine. This uses .pickle files (small data) and can generate plots.

qq_plots_plot.py Generates qq plots for features. Different distributions can be plotted against a feature.
swashbuckler_plot.py Generates histograms of a feature for both resolved and unresolved questions.

Util

Utility scripts. Used to e.g. convert parquet files to pickle files, or to join several .parquet files together into a single .parquet file.

Name		Name	Last commit message	Last commit date
Latest commit History 266 Commits
analysis		analysis
features		features
util		util
.gitignore		.gitignore
LICENSE		LICENSE
convert_dataset.py		convert_dataset.py
dataFramePreviews.md		dataFramePreviews.md
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Resolverflow

Dataset

Project overview

Features

Analysis

Analaysis/local

Util

About

Releases

Packages

Contributors 3

Languages

License

WeersProductions/resolverflow

Folders and files

Latest commit

History

Repository files navigation

Resolverflow

Dataset

Project overview

Features

Analysis

Analaysis/local

Util

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages