Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!

Awesome Big Data
Other Awesome Lists

Frameworks

Apache Hadoop - framework for distributed processing. Integrated MapReduce, YARN and HDFS.

Distributed Programming

AddThis Hydra - distributed data processing and storage system.
AMPLab SIMR - run Spark on Hadoop MapReduce v1.
Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
Apache Gora - framework for in-memory data model and persistence.
Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Apache Pig - high level language to express data analysis programs for Hadoop.
Apache S4 - framework for stream processing, implementation of S4.
Apache Spark - framework for in-memory cluster computing.
Apache Spark Streaming - framework for stream processing, part of Spark.
Apache Storm - framework for stream processing by Twitter also on YARN.
Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
Cascalog - data processing and querying library.
Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
Concurrent Cascading - framework for data management/analytics on Hadoop.
Damballa Parkour - MapReduce library for Clojure.
Datasalt Pangool - alternative MapReduce paradigm.
DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance..
Facebook Corona - Hadoop enhancement which removes single point of failure.
Facebook Peregrine - Map Reduce framework.
Facebook Scuba - distributed in-memory datastore.
Google MapReduce - map reduce framework.
Google MillWheel - fault tolerant stream processing framework.
HadoopDB - hybrid of MapReduce and DBMS.
JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
Metamarkers Druid - framework for real-time analysis of large datasets.
Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
Nokia Disco - MapReduce framework developed by Nokia.
Pydoop - Python MapReduce and HDFS API for Hadoop.
Stratosphere - general purpose cluster computing framework.
Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Apache HDFS - a way to store large files across multiple machines.
Ceph Filesystem - software storage platform designed.
BeeGFS - formerly FhGFS, parallel distributed file system
Facebook Haystack - object storage system.
Google Colossus - distributed filesystem (GFS2).
Google GFS - distributed filesystem.
Google Megastore - scalable, highly available storage.
GridGain - GGFS, Hadoop compliant in-memory file system.
Lustre file system - high-performance distributed filesystem.
Quantcast File System QFS - open-source distributed file system.
Red Hat GlusterFS - scale-out network-attached storage file system.
Tachyon - reliable file sharing at memory speed across cluster frameworks.

Column Data Model

Actian Vector - column-oriented analytic database.
Apache Accumulo - distribuited key/value store, built on Hadoop.
Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
C-Store - column oriented DBMS.
Facebook HydraBase - evolution of HBase made by Facebook.
Google BigTable - column-oriented distributed datastore.
Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable
Hypertable - column-oriented distribuited datastore, inspired by BigTable.
InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
MonetDB - column store database.
OhmData C5 - improved version of HBase.
Parquet - columnar storage format for Hadoop.
Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Document Data Model

Crate Data - is an open source massively scalable data store. It requires zero administration.
Facebook Apollo - Facebook’s Paxos-like NoSQL database.
jumboDB - document oriented datastore over Hadoop.
LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
MongoDB - Document-oriented database system.
RethinkDB - document database that supports queries like table joins and group by.

Key-value Data Model

Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
Edis - is a protocol-compatible Server replacement for Redis.
ElephantDB - Distributed database specialized in exporting data from Hadoop.
EventStore - distributed time series database.
LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
Linkedin Voldemort - distributed key/value storage system.
OpenTSDB - distributed time series database on top of HBase.
Redis - in memory key value datastore.
Riak - a decentralized datastore.
Storehaus - library to work with asynchronous key value stores, by Twitter.
Tarantool - an efficient NoSQL database and a Lua application server.

Graph Data Model

Apache Giraph - implementation of Pregel, based on Hadoop.
Apache Spark Bagel - implementation of Pregel, part of Spark.
ArangoDB - multi model distribuited database.
Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
Gremlin - graph traversal Language.
Google Cayley - open-source graph database.
Google Pregel - graph processing framework.
GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
GraphX - resilient Distributed Graph System on Spark.
Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
Neo4j - graph database writting entirely in Java.
OrientDB - document and graph database.
Phoebus - framework for large scale graph processing.
Titan - distributed graph database, built over Cassandra.
Twitter FlockDB - distribuited graph database.

NewSQL Databases

Amazon RedShift - data warehouse service, based on PostgreSQL.
BayesDB - statistic oriented SQL database.
FoundationDB - distributed database, inspired by F1.
Google F1 - distributed SQL database built on Spanner.
Google Spanner - globally distributed semi-relational database.
H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
HandlerSocket - NoSQL plugin for MySQL/MariaDB.
InfiniSQL - infinity scalable RDBMS.
MemSQL - in memory SQL database witho optimized columnar storage on flash.
NuoDB - SQL/ACID compliant distributed database.
Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
SAP HANA - SQL based in-memory database.
SenseiDB - distributed, realtime, semi-structured database.
Sky - database used for flexible, high performance analysis of behavioral data.
SymmetricDS - open source software for both file and database synchronization.

Time-Series Databases

TempoDB - Cloud-based
InfluxDB - Open-source distributed time series database
OpenTSDB - uses HBase
Kairosdb - similar to OpenTSDB but allows for Cassandra
Cube - uses MongoDB to store time series data

SQL-like processing

AMPLAB Shark - data warehouse system for Spark.
Apache Drill - framework for interactive analysis, inspired by Dremel.
Apache HCatalog - table and storage management layer for Hadoop.
Apache Hive - SQL-like data warehouse system for Hadoop.
Apache Phoenix - SQL skin over HBase.
BlinkDB - massively parallel, approximate query engine.
Cloudera Impala - framework for interactive analysis, Inspired by Dremel.
Concurrent Lingual - SQL-like query language for Cascading.
Datasalt Splout SQL - full SQL query engine for big datasets.
Facebook PrestoDB - distributed SQL query engine.
Google BigQuery - framework for interactive analysis, implementation of Dremel.
Pivotal HAWQ - SQL-like data warehouse system for Hadoop.
Spark Catalyst - is a Query Optimization Framework for Spark and Shark.
SparkSQL - Manipulating Structured Data Using Spark.
Splice Machine - a full-featured SQL-on-Hadoop RDBMS with ACID transactions.
Stinger - interactive query for Hive.
Tajo - distributed data warehouse system on Hadoop.

Data Ingestion

Amazon Kinesis - real-time processing of streaming data at massive scale.
Apache Chukwa - data collection system.
Apache Flume - service to manage large amount of log data.
Apache Kafka - distributed publish-subscribe messaging system.
Apache Samza - stream processing framework, based on Kafla and YARN.
Apache Sqoop - tool to transfer data between Hadoop and a structured datastore.
Cloudera Morphlines - framework that help ETL to Solr, HBase and HDFS.
Facebook Scribe - streamed log data aggregator.
Fluentd - tool to collect events and logs.
HIHO - framework for connecting disparate data sources with Hadoop.
Kestrel - distributed message queue system.
LinkedIn Databus - stream of change capture events for a database.
LinkedIn Kamikaze - utility package for compressing sorted integer arrays.
LinkedIn White Elephant - log aggregator and dashboard.
Netflix Suro - log agregattor like Storm and Samza based on Chukwa.
Pinterest Secor - is a service implementing Kafka log persistance.

Integrated Development Environments

R-Studio - IDE for R.

Service Programming

Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
Apache Avro - data serialization system.
Apache Curator - Java libaries for Apache ZooKeeper.
Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
Apache Thrift - framework to build binary protocols.
Apache Zookeeper - centralized service for process management.
Google Chubby - a lock service for loosely-coupled distributed systems.
Linkedin Norbert - cluster manager.
OpenMPI - message passing framework.
Serf - decentralized solution for service discovery and orchestration.
[Spotify Luigi] (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
Twitter Elephant Bird - libraries for working with LZOP-compressed data.
Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Apache Aurora - is a service scheduler that runs on top of Apache Mesos.
Apache Falcon - data management framework.
Apache Oozie - workflow job schedul.
Chronos - distributed and fault-tolerant scheduler.
Linkedin Azkaban - batch workflow job scheduler.
Sparrow - scheduling platform.

Machine Learning

Apache Mahout - machine learning library for Hadoop.
brain - Neural networks in JavaScript.
Cloudera Oryx - real-time large-scale machine learning.
Concurrent Pattern - machine learning library for Cascading.
convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
Decider - Flexible and Extensible Machine Learning in Ruby.
etcML - text classification with machine learning.
Etsy Conjecture - scalable Machine Learning in Scalding.
H2O - statistical, machine learning and math runtime for Hadoop.
MLbase - distributed machine learning libraries for the BDAS stack.
MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
scikit-learn - scikit-learn: machine learning in Python.
Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
WEKA - suite of machine learning software.

Benchmarking

Apache Hadoop Benchmarking - micro-benchmarks for testing Hadoop performances.
Berkeley SWIM Benchmark - real-world big data workload benchmark.
Intel HiBench - a Hadoop benchmark suite.
PUMA Benchmarking - benchmark suite for MapReduce application.
Yahoo Gridmix3 - Hadoop cluster benchmarking from Yahoo engineer team.

Security

Apache Knox Gateway - single point of secure access for Hadoop clusters.
Apache Sentr - security module for data stored in Hadoop.

System Deployment

Apache Ambari - operational framework for Hadoop mangement.
Apache Bigtop - system deployment framework for the Hadoop ecosystem.
Apache Helix - cluster management framework.
Apache Mesos - cluster manager.
Apache Slider - is a YARN application to deploy existing distributed applications on YARN.
Apache Whirr - set of libraries for running cloud services.
Apache YARN - Cluster manager.
Brooklyn - library that simplifies application deployment and management.
Buildoop - Similar to Apache BigTop based on Groovy language.
Cloudera HUE - web application for interacting with Hadoop.
Facebook Prism - multi datacenters replication system.
Google Borg - job scheduling and monitoring system.
Google Omega - job scheduling and monitoring system.
Hortonworks HOYA - application that can deploy HBase cluster on YARN.
Marathon - Mesos framework for long-running services.

Applications

Apache Kiji - framework to collect and analyze data in real-time, based on HBas.
Apache Nutch - open source web crawler.
Apache OODT - capturing, processing and sharing of data for NASA’s scientific archives.
Apache Tika - content analysis toolkit.
Eclipse BIRT - Eclipse-based reporting system.
Eventhub - open source event analytics platform.
HIPI Library - API for performing image processing tasks on Hadoop’s MapReduce.
Hunk - Splunk analytics for Hadoop.
MADlib - data-processing library of an RDBMS to analyze data.
PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
Qubole - auto-scaling Hadoop cluster, built-in data connectors.
Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
SparkR - R frontend for Spark.
Splunk - analyzer for machine-generated date.
Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

Apache Lucene - Search engine library.
Apache Solr - Search platform for Apache Lucene.
ElasticSearch - Search and analytics engine based on Apache Lucene.
Facebook Unicorn - social graph search platform.
Google Caffeine - continuous indexing system.
Google Percolator - continuous indexing system.
TeraGoogle - large search index.
HBase Comprocessor - implementation of Percolator, part of HBase.
LinkedIn Bobo - is a Faceted Search implementation written purely in Java, an extension to Apache Lucene.
LinkedIn Cleo - is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
LinkedIn Galene - search architecture at LinkedIn.
LinkedIn Zoie - is a realtime search/indexing system written in Java.
Sphnix Search Server - fulltext search engine.

MySQL forks and evolutions

Amazon RDS - MySQL databases in Amazon’s cloud.
Drizzle - evolution of MySQL 6.0.
Google Cloud SQL - MySQL databases in Google’s cloud.
MariaDB - enhanced, drop-in replacement for MySQL.
MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
Percona Server - enhanced, drop-in replacement for MySQL.
ProxySQL - High Performance Proxy for MySQL.
TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

Memcached forks and evolutions

Facebook McDipper - key/value cache for flash storage.
Facebook Memcached - fork of Memcache.
Twemproxy - a fast, light-weight proxy for memcached and redis.
Twitter Fatcache - key/value cache for flash storage.
Twitter Twemcache - fork of Memcache.

Embedded Databases

BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
HanoiDB - Erlang LSM BTree Storage.
LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Chartio - lean business intelligence platform to visualize and explore your data.
Jaspersoft - powerful business intelligence suite.
Jedox Palo - customisable business intelligence platform.
Microsoft - business intelligence software and platform.
Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
Pentaho - business intelligence platform.
Qlik - business intelligence and analytics platform.
Tableau - business intelligence platform.
Spango BI - open source business intelligence platform.

Data Visualization

Arbor - graph visualization library using web workers and jQuery.
Chart.js - open source HTML5 Charts visualizations.
Cubism - JavaScript library for time series visualization.
D3 - javaScript library for manipulating documents.
Envisionjs - dynamic HTML5 visualization.
Grafana - graphite dashboard frontend, editor and graph composer.
Graphite - scalable Realtime Graphing.
Google Charts - simple charting API.
Highcharts - simple and flexible charting API.
Matplotlib - plotting with Python.
NVD3 - chart components for d3.js.
Peity - Progressive bar, line and pie charts.
Recline - simple but powerful library for building data applications in pure Javascript and HTML.
Sigma.js - JavaScript library dedicated to graph drawing.
Vega - a visualization grammar.

Interesting Readings

Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

Interesting Papers

2013 - 2014

2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
2013 - AMPLab - MLbase: A Distributed Machine-learning System.
2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
2013 - Google - Online, Asynchronous Schema Change in F1.
2013 - Google - F1: A Distributed SQL Database That Scales.
2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
2013 - Facebook - Scuba: Diving into Data at Facebook.
2013 - Facebook - Unicorn: A System for Searching the Social Graph.
2013 - Facebook - Scaling Memcache at Facebook.

2011 - 2012

2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
2012 - Microsoft - Paxos Made Parallel.
2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
2012 - Google - Processing a trillion cells per mouse click.
2012 - Google - Spanner: Google’s Globally-Distributed Database.
2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
2010 - AMPLab - Spark: Cluster Computing with Working Sets.
2010 - Google - Storage Architecture and Challenges.
2010 - Google - Pregel: A System for Large-Scale Graph Processing.
2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations base of Percolator and Caffeine.
2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
2010 - Yahoo - S4: Distributed Stream Computing Platform.
2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
2008 - AMPLab - Chukwa: A large-scale monitoring system.
2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
2003 - Google - The Google File System.

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Big Data

Frameworks

Distributed Programming

Distributed Filesystem

Column Data Model

Document Data Model

Key-value Data Model

Graph Data Model

NewSQL Databases

Time-Series Databases

SQL-like processing

Data Ingestion

Integrated Development Environments

Service Programming

Scheduling

Machine Learning

Benchmarking

Security

System Deployment

Applications

Search engine and framework

MySQL forks and evolutions

Memcached forks and evolutions

Embedded Databases

Business Intelligence

Data Visualization

Interesting Readings

Interesting Papers

2013 - 2014

2011 - 2012

2001 - 2010

Other Awesome Lists

About

Releases

Packages

License

glennstreet/awesome-bigdata

Folders and files

Latest commit

History

Repository files navigation

Awesome Big Data

Frameworks

Distributed Programming

Distributed Filesystem

Column Data Model

Document Data Model

Key-value Data Model

Graph Data Model

NewSQL Databases

Time-Series Databases

SQL-like processing

Data Ingestion

Integrated Development Environments

Service Programming

Scheduling

Machine Learning

Benchmarking

Security

System Deployment

Applications

Search engine and framework

MySQL forks and evolutions

Memcached forks and evolutions

Embedded Databases

Business Intelligence

Data Visualization

Interesting Readings

Interesting Papers

2013 - 2014

2011 - 2012

2001 - 2010

Other Awesome Lists

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages