StackExchange blockchain data exploration scripts

This is a data research how blockchain development has changed over the years, based on the most popular programmers' forum, StackOverflow.com data dumps.

Read the research report.

Prerequisites

If you want to use this notebook for your own research you need to

Know Python and UNIX shell basics
Have Python 3.11 installed
Have Poetry installed

Get started

Check out files with git-lfs:

git clone ...
cd blockchain-stackoverflow
git lfs install
git lfs pull

Create Python environment:

poetry shell
poetry install

Usage

After you have Python environment and large files set up, you can open research.ipynb in your notebook editor (Visual Studio Code) and point the Python interpreter to the environment created with Poetry.

Alternatively you can open the notebook using stock Jupyter and the web browser

jupyter notebook research.ipynb

Recreating datasets

We supply ./blockchain-questions.parquet with the Github repository. You might want to update this dataset as soon as StackOverflow starts to re-publish their data dumps.

To re-create the dataset you need ~200 GB free disk space. We recommend you work on a remote server using Visual Studio Code remote extensions.

We need

Posts dataset
Tags dataset

Creating tag map

First we need to create tag name -> primary key mappings we can use to navigate the StackOverflow posts dump.

Create tags CSV file we can import to Pandas:

wget -O stackoverflow.com-Tags.7z https://archive.org/download/stackexchange/stackoverflow.com-Tags.7z
7z x stackoverflow.com-Tags.7z
./converter --source-path Tags.xml --result-format csv --store-to-dir csv

Then we create tags.parquet using our script:

python blockchain_stackoverflow/tag_map.py

This will create tags.parquet and also output post counts for our tags:

ethereum with 6681 posts
blockchain with 6637 posts
solidity with 6534 posts
svelte with 4932 posts
hyperledger with 3938 posts
smartcontracts with 2989 posts...

Downloading and extracting the full posts dataset

We now need to get all StackOverflow questions to a CSV file.

Download using Bittorrent, and this way you do not die to the old age waiting for the download to finish.

cd download
npm install
node_modules/.bin/webtorrent --select stackoverflow.com-Posts.7z stackexchange_archive.torrent 
# 658 = index for Posts.7z
node_modules/.bin/webtorrent --select 658 stackexchange_archive.torrent

Or HTTPS:

wget -O download/stackexchange/stackoverflow.com-Posts.7z https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z

And then after two hours:

7z x download/stackexchange/stackoverflow.com-Posts.7z
./converter --source-path Posts.xml --result-format csv --store-to-dir csv
rm Posts.xml  # Save 95 GB space
ipython create-reduced-dataset.ipynb  # Or run in Visual Studio Code

Now we have created blockchain-posts.parquet.

Creating blockchain questions only reduced dataset

As the full posts dataset is too large to read in RAM, we will use a chunked reader to create a smaller dataset of 25k blockchain questions weighting around 25 MB.

ipython create-reduced-dataset.ipynb  # Or use Visual Studio Code

Creating StackOverflow question count baseline

Because StackOverflow is in decline we need to separate this StackOverflow's decline from the possible blockchains decline.

For this purpose, we create a time-series that contains monthly binned question counts of all StackOverflow posts.

We do this with our notebook, which is also going to display a graph of the question counts:

ipython create-baseline.ipynb  # Or use Visual Studio Code

Exporting Jupyter Notebook as Ghost blog post

First let's convert the notebook to a static HTML:

jupyter nbconvert --to=html --no-input --embed-images --output-dir html-export research.ipynb

Then you can open html-export/research.html in your web browser and copy-paste content to the Ghost blog post editor.

Useful links and background

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
blockchain_stackoverflow		blockchain_stackoverflow
csv		csv
download		download
screenshots		screenshots
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
blockchain-questions.parquet		blockchain-questions.parquet
converter		converter
create-baseline.ipynb		create-baseline.ipynb
create-comparison-dataset.ipynb		create-comparison-dataset.ipynb
create-reduced-dataset.ipynb		create-reduced-dataset.ipynb
other-questions.parquet		other-questions.parquet
other-tags.parquet		other-tags.parquet
poetry.lock		poetry.lock
post_counts_month.parquet		post_counts_month.parquet
post_counts_quarterly.parquet		post_counts_quarterly.parquet
pyproject.toml		pyproject.toml
research.ipynb		research.ipynb
scratchpad.ipynb		scratchpad.ipynb
tags.parquet		tags.parquet
top-tags.ipynb		top-tags.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StackExchange blockchain data exploration scripts

Prerequisites

Get started

Usage

Recreating datasets

Creating tag map

Downloading and extracting the full posts dataset

Creating blockchain questions only reduced dataset

Creating StackOverflow question count baseline

Exporting Jupyter Notebook as Ghost blog post

Useful links and background

About

Releases

Packages

Languages

tradingstrategy-ai/blockchain-stackoverflow

Folders and files

Latest commit

History

Repository files navigation

StackExchange blockchain data exploration scripts

Prerequisites

Get started

Usage

Recreating datasets

Creating tag map

Downloading and extracting the full posts dataset

Creating blockchain questions only reduced dataset

Creating StackOverflow question count baseline

Exporting Jupyter Notebook as Ghost blog post

Useful links and background

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages