Skip to content

Latest commit

 

History

History
84 lines (79 loc) · 16.5 KB

datasets.md

File metadata and controls

84 lines (79 loc) · 16.5 KB

Curated Datasets for the Slingshot Competition

Slingshot’s aim for using curated datasets is to ensure meaningful data is stored and retrieved from the Filecoin Network. The use-cases don’t need to be complex and can be proprietary in nature for applications.

There are a wide variety of public data sets that can be leveraged for this challenge - a sampling is shown in the table below.

If you would like to use a dataset that you don't see listed here, please submit an issue to add the dataset to this table. If you are using your own data that you are willing to make public but does not have a source URL, then feel free to write 'N/A' in the URL column.

Name Descriptions Size Format URL
163 source Dataset NetEase Open Source Mirror Station - iso https://mirrors.163.com
COVID-19 Open Research Dataset An AI challenge with AI2, CZI, MSR, Georgetown, NIH & The White House 19 GB JSON https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge
Chest X-Ray Images (Pneumonia) 5,863 images, 2 categories 2.29 GB JPEG https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
Huge Stock Market Dataset Historical daily prices and volumes of all U.S. stocks and ETFs 772 MB CSV https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
Condensed Movies A large-scale video dataset, featuring clips from movies with detailed captions. 250 GB Video https://www.robots.ox.ac.uk/~vgg/research/condensed-movies/
USENET (2005-2011) Compressed USENET posts 36 GB Text http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html
Sloan Digital Sky Survey Three dimensional view of the universe 273 TB Various https://www.sdss.org/
GHTorrent Project a scalable, queriable, offline mirror of data offered through the Github REST API. 18TB MySQL https://ghtorrent.org/
Free Music Archive 106,574 tracks from 16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres 879 GB MP3 https://github.com/mdeff/fma
Open Images Dataset 9 million URLs to images that have been annotated with labels spanning over 6000 categories 18 TB PNG https://storage.googleapis.com/openimages/web/index.html
Internet Archive a digital library of Internet sites and other cultural artifacts in digital form 45 PB Various https://archive.org/
Common Crawl An open repository of web crawl data 235 TB WARC https://commoncrawl.org/
Noisy speech database Used for training speech enhancement algorithms and TTS models 14 GB WAV https://datashare.is.ed.ac.uk/handle/10283/2791
NFL play-by-play The data has three tables: teams, players, and plays. 2.54 GB Text https://www.dolthub.com/repositories/Liquidata/nfl-play-by-play
NYC Trip Record Data include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. 267 GB CSV https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
National Cancer Institute Cancer data for analysis 18.46 TB JSON https://portal.gdc.cancer.gov/repository
Public Blockchain Datasets Blockchain data from cryptocurrencies Bitcoin, Ethereum, Dodgecoin, ZCash, Litecoin, Dash, Bitcoin Cash, Ethereum Classic, Tezos, Hedera Hashgraph, IoTex. 9 TB Various https://github.com/blockchain-etl/public-datasets
Landsat 8 Multispectral time series satellite imagery of all land on Earth since 2013 1.3 PB (estimated) GeoTIFF + metadata - sample scene https://registry.opendata.aws/landsat-8/#usageexamples
Docker Images Docker container images that are published on Docker Hub 167 TB images https://hub.docker.com/
Filecoin Proofs - 224 GB - https://proofs.filecoin.io/
Filecoin Trusted Setup - 2.05 TB - https://trusted-setup.filecoin.io/
Audius - GB MP3 https://www.audius.com/
Flickr Commons The key goal of The Commons is to share hidden treasures from the world's public photography archives. 50 TB jpeg https://www.flickr.com/commons
Arxiv Scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, and more. - PDF https://arxiv.org/
Audius An American decentralized music platform developing the first community-owned and artist-controlled Music sharing protocol. - MP3 https://audius.co/
Blackbird Dataset A large-scale dataset for UAV perception in aggressive flight 4.79 TB - https://academictorrents.com/details/eb542a231dbeb2125e4ec88ddd18841a867c2656
Linux ISO Linux ISO Images - ISO https://www.linuxlookup.com/linux_iso
ArchLinux ArchLinux packages repository 56 GB Various https://wiki.archlinux.org/index.php/Mirrors
CentOS CentOS packages repository 200 GB Various http://mirror.sesp.northwestern.edu/centos/
Data is Plural A variety of public, structured data sets. - Various https://tinyletter.com/data-is-plural/archive
Tencent Corpus for Chinese Words and Phrases Meant for AI purposes 6.3 GB Various https://ai.tencent.com/ailab/nlp/en/embedding.html
R-fMRI Maps Project Medical data from neurological imaging - Various http://mrirc.psych.ac.cn/RfMRIMaps
National Palace Museum (Taiwan) A variety of museum artifacts - Various https://theme.npm.edu.tw/opendata/
Congressional Datasets Videos of meetings as well as textual legislative data. - Various https://www.congress.gov/
Unsplash The internet’s source of freely-usable images. 931 MB jpeg https://unsplash.com/
Project Gutenberg  online library of free eBooks - english  60GB  various https://www.gutenberg.org
Monolith VR Materials  Self filmed materials and the produced VR videos  800TB  Video http://ipfsnb.io
Starry Sky in Yunnan  meteorological and astronomical data  10PiB  tar,fits http://hlmxy.file123.pro:9006
ImageNet an image database organized according to the WordNet hierarchy 1.2T jpeg http://www.image-net.org/
Github Public code hosting platform 20TB Git repositories / plain text https://github.com
IPUMS Global census data - Structured data https://ipums.org/
Kaggle datasets Various public datasets used for training machine learning models - Varies https://www.kaggle.com/datasets
Amazon datasets Various public datasets used for research - Varies https://registry.opendata.aws/
Udacity Self-Driving Car data Data used for training self-driving machine learning models ~285GB - https://github.com/udacity/self-driving-car/tree/master/datasets
Million Song Dataset NSF-funded public music dataset for research 280GB http://millionsongdataset.com/
The nuScenes dataset The nuScenes dataset is a large-scale autonomous driving dataset. 350G jpeg https://www.nuscenes.org/nuscenes
The Boxy Vehicles Dataset A large vehicle detection dataset with almost two million annotated vehicles for training and evaluating object detection methods for self-driving cars on freeways. 1T image https://boxy-dataset.com/boxy/
TrackingNet A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. 970G image https://tracking-net.org/
A2D2 The Audi Autonomous Driving Dataset (A2D2) to support startups and academic researchers working on autonomous driving. 1.9T point cloud, image https://www.a2d2.audi/a2d2/en.html
KITTI-raw data Autonomous Driving 442G point cloud, image http://www.cvlibs.net/datasets/kitti/raw_data.php
NEAR-VI-Dataset The NetEase AR Oriented Visual Inertial Dataset 175G gif https://github.com/EZXR-Research/NEAR-VI-Dataset
Top 100 Crypto Investor Dataset Crypto price and project analytics 9 GB Various https://www.kaggle.com/georgemac510/top-100-crypto-dataset
Common Voice Common Voice is Mozilla's initiative to help teach machines how real people speak. 100G audio https://commonvoice.mozilla.org/en/datasets
TAO TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. 225G video http://taodataset.org/
OTW The Out the Window (OTW) dataset is a crowdsourced activity dataset containing 5,668 instances of 17 activities from the NIST Activities in Extended Video (ActEV) challenge. 48G video https://stresearch.github.io/otw/
Waymo The Waymo Open Dataset is comprised of high resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. We are releasing this dataset publicly to aid the research community in making advancements in machine perception and self-driving technology. 1.2T point cloud, image https://waymo.com/open/
IMDB-WIKI IMDB-WIKI – 500k+ face images with age and gender labels 276G image https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/
Genomic Data Commons Genomic, epigenomic, transcriptomic, and proteomic data from the National Genome Atlas Program 2.5 PB JSON https://portal.gdc.cancer.gov
OpenStreetMap A collaborative project to create a free editable map of the world 40 GB JSON https://console.cloud.google.com/marketplace/product/openstreetmap/geo-openstreetmap?filter=solution-type%3Adataset&filter=category%3Atransportation&id=88e087d0-5f92-4407-8dcc-5577bd06d776
Wikipedia A multilingual open-collaborative online encyclopedia created and maintained by a community of volunteer editors using a wiki-based editing system 18.9 GB JSON https://portal.gdc.cancer.gov
openFDA Open datasets from the US Food and Drug Administration N/A JSON https://open.fda.gov/data/downloads/
Amateur radio Amateur Radio Software) 60.0 GB TB JSON https://bigquery.cloud.google.com/table/dataproc-fun:wsprnet.all_wsprnet_data?pli=1&tab=details
Reddit Collection of Reddit posts and comments 546 GB JSON https://console.cloud.google.com/bigquery?utm_source=bqui&utm_medium=link&utm_campaign=classic
Dota 2 Open data around the Dota Game platform 500 GB JSON https://www.opendota.com
AVSpeech: Large-scale Audio-Visual Speech Dataset large-scale audio-visual dataset comprising speech video clips with no interfering background noises 1.50 TB GB N/A https://academictorrents.com/details/b078815ca447a3e4d17e8a2a34f13183ec5dec41
Google Open Images 9 million URLs to images that have been annotated with labels spanning over 6000 categories 456 GB image https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
UC Berkeley Computer Science Courses An archive of UC Berkeley Computer Science Courses 446 GB Video https://academictorrents.com/details/5e84be34f69b1a313f6dcb51667edf238d5d4412
Functional Map of the World Satellite images of the world 352 GB image https://academictorrents.com/details/9e9194e21ce045deee8d811481b4cd676b20b06b
NEAR-VI-Dataset The NetEase AR Oriented Visual Inertial Dataset 175G gif https://github.com/EZXR-Research/NEAR-VI-Dataset
Netease Cloud Music Online music services lead playlists, social networking, brand recommendations and music fingerprints - Audio https://music.163.com
Movie Heaven Movie Paradise is a large online movie broadcasting platform in China - Video https://www.dytt8.net
COCO COCO is a large-scale object detection, segmentation, and captioning dataset. - ZIP https://cocodataset.org
Google Cloud Public Datasets Uncover new insights with high-demand public datasets - Varies https://cloud.google.com/public-datasets