Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.

If you would like to use a dataset that you don't see listed here, please submit an issue to add the dataset to this table. If you are using your own data that you are willing to make public but does not have a source URL, then feel free to write 'N/A' in the URL column.

Name	Descriptions	Size	Format	URL
163 source Dataset	NetEase Open Source Mirror Station	-	iso	https://mirrors.163.com
COVID-19 Open Research Dataset	An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House	19 GB	JSON	https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Chest X-Ray Images (Pneumonia)	5,863 images, 2 categories	2.29 GB	JPEG	https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Huge Stock Market Dataset	Historical daily prices and volumes of all U.S. stocks and ETFs	772 MB	CSV	https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies	A large-scale video dataset, featuring clips from movies with detailed captions.	250 GB	Video	https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011)	Compressed USENET posts	36 GB	Text	http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Sloan Digital Sky Survey	Three dimensional view of the universe	273 TB	Various	https://www.sdss.org/
GHTorrent Project	a scalable, queriable, offline mirror of data offered through the Github REST API.	18TB	MySQL	https://ghtorrent.org/
Free Music Archive	106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres	879 GB	MP3	https://github.com/mdeff/fma
Open Images Dataset	9 million URLs to images that have been annotated with labels spanning over 6000 categories	18 TB	PNG	https://storage.googleapis.com/openimages/web/index.html
Internet Archive	a digital library of Internet sites and other cultural artifacts in digital form	45 PB	Various	https://archive.org/
Common Crawl	An open repository of web crawl data	235 TB	WARC	https://commoncrawl.org/
Noisy speech database	Used for training speech enhancement algorithms and TTS models	14 GB	WAV	https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play	The data has three tables: teams, players, and plays.	2.54 GB	Text	https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data	include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.	267 GB	CSV	https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
National Cancer Institute	Cancer data for analysis	18.46 TB	JSON	https://portal.gdc.cancer.gov/repository
Public Blockchain Datasets	Blockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex.	9 TB	Various	https://github.com/blockchain-etl/public-datasets
Landsat 8	Multispectral time series satellite imagery of all land on Earth since 2013	1.3 PB (estimated)	GeoTIFF + metadata - sample scene	https://registry.opendata.aws/landsat-8/#usageexamples
Docker Images	Docker container images that are published on Docker Hub	167 TB	images	https://hub.docker.com/
Filecoin Proofs	-	224 GB	-	https://proofs.filecoin.io/
Filecoin Trusted Setup	-	2.05 TB	-	https://trusted-setup.filecoin.io/
Audius	-	GB	MP3	https://www.audius.com/
Flickr Commons	The key goal of The Commons is to share hidden treasures from the world's public photography archives.	50 TB	jpeg	https://www.flickr.com/commons
Arxiv	Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more.	-	PDF	https://arxiv.org/
Audius	An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol.	-	MP3	https://audius.co/
Blackbird Dataset	A large-scale dataset for UAV perception in aggressive flight	4.79 TB	-	https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656
Linux ISO	Linux ISO Images	-	ISO	https://www.linuxlookup.com/linux_iso
ArchLinux	ArchLinux packages repository	56 GB	Various	https://wiki.archlinux.org/index.php/Mirrors
CentOS	CentOS packages repository	200 GB	Various	http://mirror.sesp.northwestern.edu/centos/
Data is Plural	A variety of public, structured data sets.	-	Various	https://tinyletter.com/data-is-plural/archive
Tencent Corpus for Chinese Words and Phrases	Meant for AI purposes	6.3 GB	Various	https://ai.tencent.com/ailab/nlp/en/embedding.html
R-fMRI Maps Project	Medical data from neurological imaging	-	Various	http://mrirc.psych.ac.cn/RfMRIMaps
National Palace Museum (Taiwan)	A variety of museum artifacts	-	Various	https://theme.npm.edu.tw/opendata/
Congressional Datasets	Videos of meetings as well as textual legislative data.	-	Various	https://www.congress.gov/
Unsplash	The internet’s source of freely-usable images.	931 MB	jpeg	https://unsplash.com/
Project Gutenberg	online library of free eBooks - english	60GB	various	https://www.gutenberg.org
Monolith VR Materials	Self filmed materials and the produced VR videos	800TB	Video	http://ipfsnb.io
Starry Sky in Yunnan	meteorological and astronomical data	10PiB	tar,fits	http://hlmxy.file123.pro:9006
ImageNet	an image database organized according to the WordNet hierarchy	1.2T	jpeg	http://www.image-net.org/
Github	Public code hosting platform	20TB	Git repositories / plain text	https://github.com
IPUMS	Global census data	-	Structured data	https://ipums.org/
Kaggle datasets	Various public datasets used for training machine learning models	-	Varies	https://www.kaggle.com/datasets
Amazon datasets	Various public datasets used for research	-	Varies	https://registry.opendata.aws/
Udacity Self-Driving Car data	Data used for training self-driving machine learning models	~285GB	-	https://github.com/udacity/self-driving-car/tree/master/datasets
Million Song Dataset	NSF-funded public music dataset for research	280GB	http://millionsongdataset.com/
The nuScenes dataset	The nuScenes dataset is a large-scale autonomous driving dataset.	350G	jpeg	https://www.nuscenes.org/nuscenes
The Boxy Vehicles Dataset	A large vehicle detection dataset with almost two million annotated vehicles for training and evaluating object detection methods for self-driving cars on freeways.	1T	image	https://boxy-dataset.com/boxy/
TrackingNet	A Large-Scale Dataset and Benchmark for Object Tracking in the Wild.	970G	image	https://tracking-net.org/
A2D2	The Audi Autonomous Driving Dataset (A2D2) to support startups and academic researchers working on autonomous driving.	1.9T	point cloud, image	https://www.a2d2.audi/a2d2/en.html
KITTI-raw data	Autonomous Driving	442G	point cloud, image	http://www.cvlibs.net/datasets/kitti/raw_data.php
NEAR-VI-Dataset	The NetEase AR Oriented Visual Inertial Dataset	175G	gif	https://github.com/EZXR-Research/NEAR-VI-Dataset
Top 100 Crypto Investor Dataset	Crypto price and project analytics	9 GB	Various	https://www.kaggle.com/georgemac510/top-100-crypto-dataset
Common Voice	Common Voice is Mozilla's initiative to help teach machines how real people speak.	100G	audio	https://commonvoice.mozilla.org/en/datasets
TAO	TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.	225G	video	http://taodataset.org/
OTW	The Out the Window (OTW) dataset is a crowdsourced activity dataset containing 5,668 instances of 17 activities from the NIST Activities in Extended Video (ActEV) challenge.	48G	video	https://stresearch.github.io/otw/
Waymo	The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology.	1.2T	point cloud, image	https://waymo.com/open/
IMDB-WIKI	IMDB-WIKI – 500k+ face images with age and gender labels	276G	image	https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
Genomic Data Commons	Genomic, epigenomic, transcriptomic, and proteomic data from the National Genome Atlas Program	2.5 PB	JSON	https://portal.gdc.cancer.gov
OpenStreetMap	A collaborative project to create a free editable map of the world	40 GB	JSON	https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?filter=solution-type%3Adataset&filter=category%3Atransportation&id=88e087d0-5f92-4407-8dcc-5577bd06d776
Wikipedia	A multilingual open-collaborative online encyclopedia created and maintained by a community of volunteer editors using a wiki-based editing system	18.9 GB	JSON	https://portal.gdc.cancer.gov
openFDA	Open datasets from the US Food and Drug Administration	N/A	JSON	https://open.fda.gov/data/downloads/
Amateur radio	Amateur Radio Software)	60.0 GB TB	JSON	https://bigquery.cloud.google.com/table/dataproc-fun:wsprnet.all_wsprnet_data?pli=1&tab=details
Reddit	Collection of Reddit posts and comments	546 GB	JSON	https://console.cloud.google.com/bigquery?utm_source=bqui&utm_medium=link&utm_campaign=classic
Dota 2	Open data around the Dota Game platform	500 GB	JSON	https://www.opendota.com
AVSpeech: Large-scale Audio-Visual Speech Dataset	large-scale audio-visual dataset comprising speech video clips with no interfering background noises	1.50 TB GB	N/A	https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41
Google Open Images	9 million URLs to images that have been annotated with labels spanning over 6000 categories	456 GB	image	https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
UC Berkeley Computer Science Courses	An archive of UC Berkeley Computer Science Courses	446 GB	Video	https://academictorrents.com/details/5e84be34f69b1a313f6dcb51667edf238d5d4412
Functional Map of the World	Satellite images of the world	352 GB	image	https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
NEAR-VI-Dataset	The NetEase AR Oriented Visual Inertial Dataset	175G	gif	https://github.com/EZXR-Research/NEAR-VI-Dataset
Netease Cloud Music	Online music services lead playlists, social networking, brand recommendations and music fingerprints	-	Audio	https://music.163.com
Movie Heaven	Movie Paradise is a large online movie broadcasting platform in China	-	Video	https://www.dytt8.net
COCO	COCO is a large-scale object detection, segmentation, and captioning dataset.	-	ZIP	https://cocodataset.org
Google Cloud Public Datasets	Uncover new insights with high-demand public datasets	-	Varies	https://cloud.google.com/public-datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Curated Datasets for the Slingshot Competition

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Curated Datasets for the Slingshot Competition