Skip to content

A curated list of awesome big data frameworks, ressources and other awesomeness.

License

Notifications You must be signed in to change notification settings

glennstreet/awesome-bigdata

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 

Repository files navigation

Awesome Big Data

A curated list of awesome big data frameworks, ressources and other awesomeness. Inspired by awesome-php, awesome-python, awesome-ruby, hadoopecosystemtable & big-data.

Your contributions are always welcome!

Frameworks

  • Apache Hadoop - framework for distributed processing. Integrated MapReduce, YARN and HDFS.

Distributed Programming

  • AddThis Hydra - distributed data processing and storage system.
  • AMPLab SIMR - run Spark on Hadoop MapReduce v1.
  • Apache Crunch - a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu - collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Gora - framework for in-memory data model and persistence.
  • Apache Hama - BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce - programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig - high level language to express data analysis programs for Hadoop.
  • Apache S4 - framework for stream processing, implementation of S4.
  • Apache Spark - framework for in-memory cluster computing.
  • Apache Spark Streaming - framework for stream processing, part of Spark.
  • Apache Storm - framework for stream processing by Twitter also on YARN.
  • Apache Tez - application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill - abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog - data processing and querying library.
  • Cheetah - High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading - framework for data management/analytics on Hadoop.
  • Damballa Parkour - MapReduce library for Clojure.
  • Datasalt Pangool - alternative MapReduce paradigm.
  • DataTorrent StrAM - real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance..
  • Facebook Corona - Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine - Map Reduce framework.
  • Facebook Scuba - distributed in-memory datastore.
  • Google MapReduce - map reduce framework.
  • Google MillWheel - fault tolerant stream processing framework.
  • HadoopDB - hybrid of MapReduce and DBMS.
  • JAQL - declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite - is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Metamarkers Druid - framework for real-time analysis of large datasets.
  • Netflix PigPen - map-reduce for Clojure whiche compiles to Apache Pig.
  • Nokia Disco - MapReduce framework developed by Nokia.
  • Pydoop - Python MapReduce and HDFS API for Hadoop.
  • Stratosphere - general purpose cluster computing framework.
  • Twitter Scalding - Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird - Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Column Data Model

  • Actian Vector - column-oriented analytic database.
  • Apache Accumulo - distribuited key/value store, built on Hadoop.
  • Apache Cassandra - column-oriented distribuited datastore, inspired by BigTable.
  • Apache HBase - column-oriented distribuited datastore, inspired by BigTable.
  • C-Store - column oriented DBMS.
  • Facebook HydraBase - evolution of HBase made by Facebook.
  • Google BigTable - column-oriented distributed datastore.
  • Google Cloud Datastore - is a fully managed, schemaless database for storing non-relational data over BigTable
  • Hypertable - column-oriented distribuited datastore, inspired by BigTable.
  • InfiniDB - is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
  • MonetDB - column store database.
  • OhmData C5 - improved version of HBase.
  • Parquet - columnar storage format for Hadoop.
  • Twitter Manhattan - real-time, multi-tenant distributed database for Twitter scale.
  • Vertica - is designed to manage large, fast-growing volumes of data and provide very fast query performance when used for data warehouses.

Document Data Model

  • Crate Data - is an open source massively scalable data store. It requires zero administration.
  • Facebook Apollo - Facebook’s Paxos-like NoSQL database.
  • jumboDB - document oriented datastore over Hadoop.
  • LinkedIn Espresso - horizontally scalable document-oriented NoSQL data store.
  • MarkLogic - Schema-agnostic Enterprise NoSQL database technology.
  • MongoDB - Document-oriented database system.
  • RethinkDB - document database that supports queries like table joins and group by.

Key-value Data Model

  • Amazon DynamoDB - distributed key/value store, implementation of Dynamo paper.
  • Edis - is a protocol-compatible Server replacement for Redis.
  • ElephantDB - Distributed database specialized in exporting data from Hadoop.
  • EventStore - distributed time series database.
  • LinkedIn Krati - is a simple persistent data store with very low latency and high throughput.
  • Linkedin Voldemort - distributed key/value storage system.
  • OpenTSDB - distributed time series database on top of HBase.
  • Redis - in memory key value datastore.
  • Riak - a decentralized datastore.
  • Storehaus - library to work with asynchronous key value stores, by Twitter.
  • Tarantool - an efficient NoSQL database and a Lua application server.

Graph Data Model

  • Apache Giraph - implementation of Pregel, based on Hadoop.
  • Apache Spark Bagel - implementation of Pregel, part of Spark.
  • ArangoDB - multi model distribuited database.
  • Facebook TAO - TAO is the distributed data store that is widely used at facebook to store and serve the social graph.
  • Gremlin - graph traversal Language.
  • Google Cayley - open-source graph database.
  • Google Pregel - graph processing framework.
  • GraphLab PowerGraph - a core C++ GraphLab API and a collection of high-performance machine learning and data mining toolkits built on top of the GraphLab API.
  • GraphX - resilient Distributed Graph System on Spark.
  • Intel GraphBuilder - tools to construct large-scale graphs on top of Hadoop.
  • Neo4j - graph database writting entirely in Java.
  • OrientDB - document and graph database.
  • Phoebus - framework for large scale graph processing.
  • Titan - distributed graph database, built over Cassandra.
  • Twitter FlockDB - distribuited graph database.

NewSQL Databases

  • Amazon RedShift - data warehouse service, based on PostgreSQL.
  • BayesDB - statistic oriented SQL database.
  • FoundationDB - distributed database, inspired by F1.
  • Google F1 - distributed SQL database built on Spanner.
  • Google Spanner - globally distributed semi-relational database.
  • H-Store - is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  • HandlerSocket - NoSQL plugin for MySQL/MariaDB.
  • InfiniSQL - infinity scalable RDBMS.
  • MemSQL - in memory SQL database witho optimized columnar storage on flash.
  • NuoDB - SQL/ACID compliant distributed database.
  • Postgres-XL - Scalable Open Source PostgreSQL-based Database Cluster.
  • SAP HANA - SQL based in-memory database.
  • SenseiDB - distributed, realtime, semi-structured database.
  • Sky - database used for flexible, high performance analysis of behavioral data.
  • SymmetricDS - open source software for both file and database synchronization.

Time-Series Databases

  • TempoDB - Cloud-based
  • InfluxDB - Open-source distributed time series database
  • OpenTSDB - uses HBase
  • Kairosdb - similar to OpenTSDB but allows for Cassandra
  • Cube - uses MongoDB to store time series data

SQL-like processing

Data Ingestion

Integrated Development Environments

Service Programming

  • Akka Toolkit - runtime for distributed, and fault tolerant event-driven applications on the JVM.
  • Apache Avro - data serialization system.
  • Apache Curator - Java libaries for Apache ZooKeeper.
  • Apache Karaf - OSGi runtime that runs on top of any OSGi framework.
  • Apache Thrift - framework to build binary protocols.
  • Apache Zookeeper - centralized service for process management.
  • Google Chubby - a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert - cluster manager.
  • OpenMPI - message passing framework.
  • Serf - decentralized solution for service discovery and orchestration.
  • [Spotify Luigi] (https://github.com/spotify/luigi) - a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  • Spring XD - distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  • Twitter Elephant Bird - libraries for working with LZOP-compressed data.
  • Twitter Finagle - asynchronous network stack for the JVM.

Scheduling

Machine Learning

  • Apache Mahout - machine learning library for Hadoop.
  • brain - Neural networks in JavaScript.
  • Cloudera Oryx - real-time large-scale machine learning.
  • Concurrent Pattern - machine learning library for Cascading.
  • convnetjs - Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  • Decider - Flexible and Extensible Machine Learning in Ruby.
  • etcML - text classification with machine learning.
  • Etsy Conjecture - scalable Machine Learning in Scalding.
  • H2O - statistical, machine learning and math runtime for Hadoop.
  • MLbase - distributed machine learning libraries for the BDAS stack.
  • MLPNeuralNet - Fast multilayer perceptron neural network library for iOS and Mac OS X.
  • nupic - Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • PredictionIO - machine learning server buit on Hadoop, Mahout and Cascading.
  • scikit-learn - scikit-learn: machine learning in Python.
  • Spark MLlib - a Spark implementation of some common machine learning (ML) functionality.
  • Vowpal Wabbit - learning system sponsored by Microsoft and Yahoo!.
  • WEKA - suite of machine learning software.

Benchmarking

Security

System Deployment

Applications

  • Apache Kiji - framework to collect and analyze data in real-time, based on HBas.
  • Apache Nutch - open source web crawler.
  • Apache OODT - capturing, processing and sharing of data for NASA’s scientific archives.
  • Apache Tika - content analysis toolkit.
  • Eclipse BIRT - Eclipse-based reporting system.
  • Eventhub - open source event analytics platform.
  • HIPI Library - API for performing image processing tasks on Hadoop’s MapReduce.
  • Hunk - Splunk analytics for Hadoop.
  • MADlib - data-processing library of an RDBMS to analyze data.
  • PivotalR - R on Pivotal HD / HAWQ and PostgreSQL.
  • Qubole - auto-scaling Hadoop cluster, built-in data connectors.
  • Snowplow - enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
  • SparkR - R frontend for Spark.
  • Splunk - analyzer for machine-generated date.
  • Talend - unified open source environment for YARN, Hadoop, HBASE, Hive, HCatalog & Pig.

Search engine and framework

MySQL forks and evolutions

  • Amazon RDS - MySQL databases in Amazon’s cloud.
  • Drizzle - evolution of MySQL 6.0.
  • Google Cloud SQL - MySQL databases in Google’s cloud.
  • MariaDB - enhanced, drop-in replacement for MySQL.
  • MySQL Cluster - MySQL implementation using NDB Cluster storage engine.
  • Percona Server - enhanced, drop-in replacement for MySQL.
  • ProxySQL - High Performance Proxy for MySQL.
  • TokuDB - TokuDB is a storage engine for MySQL and MariaDB.
  • WebScaleSQL - is a collaboration among engineers from several companies that face similar challenges in running MySQL at scale.

Memcached forks and evolutions

Embedded Databases

  • BerkeleyDB - a software library that provides a high-performance embedded database for key/value data.
  • HanoiDB - Erlang LSM BTree Storage.
  • LevelDB - a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
  • LMDB - ultra-fast, ultra-compact key-value embedded data store developed by Symas.
  • RocksDB - embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

  • Chartio - lean business intelligence platform to visualize and explore your data.
  • Jaspersoft - powerful business intelligence suite.
  • Jedox Palo - customisable business intelligence platform.
  • Microsoft - business intelligence software and platform.
  • Microstrategy - software platforms for business intelligence, mobile intelligence, and network applications.
  • Pentaho - business intelligence platform.
  • Qlik - business intelligence and analytics platform.
  • Tableau - business intelligence platform.
  • Spango BI - open source business intelligence platform.

Data Visualization

  • Arbor - graph visualization library using web workers and jQuery.
  • Chart.js - open source HTML5 Charts visualizations.
  • Cubism - JavaScript library for time series visualization.
  • D3 - javaScript library for manipulating documents.
  • Envisionjs - dynamic HTML5 visualization.
  • Grafana - graphite dashboard frontend, editor and graph composer.
  • Graphite - scalable Realtime Graphing.
  • Google Charts - simple charting API.
  • Highcharts - simple and flexible charting API.
  • Matplotlib - plotting with Python.
  • NVD3 - chart components for d3.js.
  • Peity - Progressive bar, line and pie charts.
  • Recline - simple but powerful library for building data applications in pure Javascript and HTML.
  • Sigma.js - JavaScript library dedicated to graph drawing.
  • Vega - a visualization grammar.

Interesting Readings

  • Big Data Benchmark - Benchmark of Redshift, Hive, Shark, Impala and Stiger/Tez.
  • NoSQL Comparison - Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Couchbase vs Neo4j vs Hypertable vs ElasticSearch vs Accumulo vs VoltDB vs Scalaris comparison.

Interesting Papers

2013 - 2014

  • 2013 - AMPLab - Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices.
  • 2013 - AMPLab - MLbase: A Distributed Machine-learning System.
  • 2013 - AMPLab - Shark: SQL and Rich Analytics at Scale.
  • 2013 - AMPLab - GraphX: A Resilient Distributed Graph System on Spark.
  • 2013 - Google - HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm.
  • 2013 - Microsoft - Scalable Progressive Analytics on Big Data in the Cloud.
  • 2013 - Metamarkets - Druid: A Real-time Analytical Data Store.
  • 2013 - Google - Online, Asynchronous Schema Change in F1.
  • 2013 - Google - F1: A Distributed SQL Database That Scales.
  • 2013 - Google - MillWheel: Fault-Tolerant Stream Processing at Internet Scale.
  • 2013 - Facebook - Scuba: Diving into Data at Facebook.
  • 2013 - Facebook - Unicorn: A System for Searching the Social Graph.
  • 2013 - Facebook - Scaling Memcache at Facebook.

2011 - 2012

  • 2012 - AMPLab - Blink and It’s Done: Interactive Queries on Very Large Data.
  • 2012 - AMPLab - Fast and Interactive Analytics over Hadoop Data with Spark.
  • 2012 - AMPLab - Shark: Fast Data Analysis Using Coarse-grained Distributed Memory.
  • 2012 - Microsoft - Paxos Replicated State Machines as the Basis of a High-Performance Data Store.
  • 2012 - Microsoft - Paxos Made Parallel.
  • 2012 - AMPLab - BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data.
  • 2012 - Google - Processing a trillion cells per mouse click.
  • 2012 - Google - Spanner: Google’s Globally-Distributed Database.
  • 2011 - AMPLab - Scarlett: Coping with Skewed Popularity Content in MapReduce Clusters.
  • 2011 - AMPLab - Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center.
  • 2011 - Google - Megastore: Providing Scalable, Highly Available Storage for Interactive Services.

2001 - 2010

  • 2010 - Facebook - Finding a needle in Haystack: Facebook’s photo storage.
  • 2010 - AMPLab - Spark: Cluster Computing with Working Sets.
  • 2010 - Google - Storage Architecture and Challenges.
  • 2010 - Google - Pregel: A System for Large-Scale Graph Processing.
  • 2010 - Google - Large-scale Incremental Processing Using Distributed Transactions and Notifications base of Percolator and Caffeine.
  • 2010 - Google - Dremel: Interactive Analysis of Web-Scale Datasets.
  • 2010 - Yahoo - S4: Distributed Stream Computing Platform.
  • 2009 - HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads.
  • 2008 - AMPLab - Chukwa: A large-scale monitoring system.
  • 2007 - Amazon - Dynamo: Amazon’s Highly Available Key-value Store.
  • 2006 - Google - The Chubby lock service for loosely-coupled distributed systems.
  • 2006 - Google - Bigtable: A Distributed Storage System for Structured Data.
  • 2004 - Google - MapReduce: Simplied Data Processing on Large Clusters.
  • 2003 - Google - The Google File System.

Other Awesome Lists

Other amazingly awesome lists can be found in the awesome-awesomeness list.

About

A curated list of awesome big data frameworks, ressources and other awesomeness.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published