Awesome Big Data — Project Awesome

Frameworks

general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Tigon 285 (archived)

High Throughput Real-time Stream Processing Framework.

Polyaxon 3.7k updated 3mo ago

A platform for reproducible and scalable machine learning and deep learning.

Smooks 415 updated 8mo ago

An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.

Distributed Programming

AddThis Hydra 436 (archived)

distributed data processing and storage system originally developed at AddThis.

Damballa Parkour 255 updated 10y ago

MapReduce library for Clojure.

Datasalt Pangool 56 updated 4y ago

alternative MapReduce paradigm.

Netflix PigPen 566 updated 3y ago

map-reduce for Clojure which compiles to Apache Pig.

Ray 41.9k updated 4mo ago

A fast and simple framework for building and running distributed applications.

Skale 397 (archived)

High performance distributed data processing in NodeJS.

streamsx.topology 28 updated 4y ago

Libraries to enable building IBM Streams application in Java, Python or Scala.

Tuktu 60 updated 8y ago

Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!

Twitter Heron 3.7k (archived)

Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.

Twitter Scalding 3.5k updated 3y ago

Scala library for Map Reduce jobs, built on Cascading.

Twitter Summingbird 2.1k (archived)

Streaming MapReduce with Scalding and Storm, by Twitter.

Distributed Filesystem

Ambry 1.8k updated 4mo ago

a distributed object store that supports storage of trillion of small immutable objects as well as billions of large objects.

Seaweed-FS 18 updated 4mo ago

simple and highly scalable distributed file system.

Baidu File System 2.9k updated 7y ago

distributed filesystem.

Distributed Index

Pilosa 2.5k (archived)

Open source distributed bitmap index that dramatically accelerates queries across multiple, massive data sets.

Key Map Data Model

Baidu Tera 1.9k updated 2y ago

an Internet-scale database, inspired by BigTable.

Facebook HydraBase

evolution of HBase made by Facebook.

InfiniDB 247 updated 8y ago

is accessed through a MySQL interface and use massive parallel processing to parallelize queries.

Tephra 158 (archived)

Transactions for HBase.

InfiniDB

is accessed through a MySQL interface and use massive parallel processing to parallelize queries.

Key-value Data Model

Bolt 14.6k (archived)

an embedded key-value database for Go.

BTDB 140 updated 2mo ago

Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more

BuntDB 4.8k updated 1y ago

a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.

Edis 554 updated 11y ago

is a protocol-compatible Server replacement for Redis.

ElephantDB 558 updated 12y ago

Distributed database specialized in exporting data from Hadoop.

GhostDB 752 updated 5y ago

a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.

Graviton 423 updated 4y ago

a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).

GridDB 2.5k updated 4mo ago

suitable for sensor data stored in a timeseries.

HyperDex 1.4k updated 2y ago

a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.

LinkedIn Krati 26 updated 14y ago

is a simple persistent data store with very low latency and high throughput.

Riak 4.0k updated 2y ago

a decentralized datastore.

Storehaus 464 updated 6y ago

library to work with asynchronous key value stores, by Twitter.

SummitDB 1.4k (archived)

an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.

Tarantool 3.6k updated 4mo ago

an efficient NoSQL database and a Lua application server.

TiKV 16.7k updated 3mo ago

a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.

Tile38 9.6k updated 4mo ago

a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON

TreodeDB 175 updated 10y ago

key-value store that's replicated and sharded and provides atomic multirow writes.

Graph Data Model

Actionbase 212 updated 4mo ago

a database for user interactions (likes, views, follows) with precomputed reads, supports HBase.

DGraph 21.7k updated 2mo ago

A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.

EliasDB 1.0k updated 4y ago

a lightweight graph based database that does not require any third-party libraries.

GCHQ Gaffer 1.8k (archived)

Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.

Google Cayley 15.0k updated 8mo ago

open-source graph database.

Gremlin 2.0k updated 5y ago

graph traversal Language.

Infovore 148 updated 4y ago

RDF-centric Map/Reduce framework.

Microsoft Graph Engine 2.3k updated 1y ago

a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.

Phoebus 384 updated 14y ago

framework for large scale graph processing.

Twitter FlockDB 3.3k (archived)

distributed graph database.

Columnar Databases

IndexR 452 updated 3y ago

an open-source columnar storage format for fast & realtime analytic with big data.

LocustDB 1.6k updated 1y ago

an experimental analytics database aiming to set a new standard for query performance on commodity hardware.

NewSQL Databases

ActorDB 1.9k updated 3y ago

a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.

BayesDB 889 updated 10y ago

statistic oriented SQL database.

Cockroach 32.0k updated 4mo ago

Scalable, Geo-Replicated, Transactional Datastore.

Comdb2 1.5k updated 3mo ago

a clustered RDBMS built on optimistic concurrency control techniques.

Haeinsa 159 updated 9y ago

linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.

KarelDB 387 updated 9mo ago

a relational database backed by Apache Kafka.

TiDB 39.9k updated 4mo ago

TiDB is a distributed SQL database. Inspired by the design of Google F1.

yugabyteDB 10.2k updated 4mo ago

open source, high-performance, distributed SQL database compatible with PostgreSQL.

Time-Series Databases

Kairosdb 1.8k updated 4mo ago

similar to OpenTSDB but allows for Cassandra.

TDengine 24.8k updated 4mo ago

a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data

Druid 14.0k updated 3mo ago

Column oriented distributed data store ideal for powering interactive applications

Akumuli 840 (archived)

Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

Rhombus

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

Dalmatiner DB 692 updated 7y ago

Fast distributed metrics database

Blueflood 598 updated 1y ago

A distributed system designed to ingest and process time series data

Timely 388 updated 6mo ago

Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.

SiriDB 510 updated 4mo ago

Highly-scalable, robust and fast, open source time series database with cluster functionality.

Thanos 14.0k updated 3mo ago

Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.

VictoriaMetrics 16.9k updated 2mo ago

fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included

Beringei 3.2k (archived)

Facebook's in-memory time-series database.

SQL-like processing

Materialize 6.3k updated 4mo ago

is a streaming database for real-time applications using SQL for queries and supporting a large fraction of PostgreSQL.

Spark

Spark Catalyst 43.0k updated 4mo ago

is a Query Optimization Framework for Spark and Shark.

Data Ingestion

Apache Pulsar 15.2k updated 2mo ago

a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.

Facebook Scribe 3.9k (archived)

streamed log data aggregator.

Gazette 785 updated 4mo ago

Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.

Heka 3.4k (archived)

open source stream processing software system.

HIHO 92 updated 13y ago

framework for connecting disparate data sources with Hadoop.

Kestrel

distributed message queue system.

LinkedIn Kamikaze 22 updated 12y ago

utility package for compressing sorted integer arrays.

LinkedIn White Elephant 190 (archived)

log aggregator and dashboard.

Netflix Suro 797 (archived)

log agregattor like Storm and Samza based on Chukwa.

Pinterest Secor 1.9k updated 4mo ago

is a service implementing Kafka log persistance.

Linkedin Gobblin 2.3k updated 4mo ago

linkedin's universal data ingestion framework.

Skizze 772 updated 10y ago

sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.

StreamSets Data Collector

continuous big data ingest infrastructure with a simple to use IDE.

RudderStack 4.4k updated 4mo ago

an open source customer data infrastructure (segment, mParticle alternative) written in go.

Zilla 682 updated 4mo ago

An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.

Service Programming

Hydrosphere Mist 325 updated 3mo ago

a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.

Mara 2.1k updated 2y ago

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

Spotify Luigi 18.7k updated 4mo ago

a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Spring XD 478 (archived)

distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.

Twitter Elephant Bird 1.1k updated 3y ago

libraries for working with LZOP-compressed data.

Scheduling

Apache Airflow 45.2k updated 2mo ago

a platform to programmatically author, schedule and monitor workflows.

Cronicle 5.6k updated 3mo ago

Distributed, easy to install, NodeJS based, task scheduler

Dagster 15.1k updated 4mo ago

a data orchestrator for machine learning, analytics, and ETL.

Schedoscope 97 (archived)

Scala DSL for agile scheduling of Hadoop jobs.

Sparrow 328 (archived)

scheduling platform.

Machine Learning

brain 8.0k (archived)

Neural networks in JavaScript.

Oryx 1.8k (archived)

Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.

convnetjs 11.1k updated 3y ago

Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.

DataVec

A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.

Decider 383 updated 9y ago

Flexible and Extensible Machine Learning in Ruby.

Etsy Conjecture 359 (archived)

scalable Machine Learning in Scalding.

Feast 7.0k updated 2mo ago

A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.

H2O 7.5k updated 4mo ago

statistical, machine learning and math runtime with Hadoop. R and Python.

Karate Club 2.3k updated 2y ago

An unsupervised machine learning library for graph structured data. Python

Keras 64.0k updated 4mo ago

An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.

Lambdo 1 updated 7y ago

Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.

Little Ball of Fur 713 updated 7mo ago

A subsampling library for graph structured data. Python

MLPNeuralNet 903 updated 9y ago

Fast multilayer perceptron neural network library for iOS and Mac OS X.

ML Workspace 3.5k updated 2y ago

All-in-one web-based IDE specialized for machine learning and data science.

ND4J

A matrix library for the JVM. Numpy for Java.

nupic 6.4k updated 1y ago

Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.

PyTorch Geometric Temporal 3.0k updated 10mo ago

a temporal extension library for PyTorch Geometric .

RL4J

Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.

scikit-learn 65.9k updated 2mo ago

scikit-learn: machine learning in Python.

Shapley 225 updated 6mo ago

A data-driven framework to quantify the value of classifiers in a machine learning ensemble.

TensorFlow 194.9k updated 2mo ago

Library from Google for machine learning using data flow graphs.

Velox 110 updated 9y ago

System for serving machine learning predictions.

Vowpal Wabbit 8.7k updated 4mo ago

learning system sponsored by Microsoft and Yahoo!.

BidMach 919 updated 3y ago

CPU and GPU-accelerated Machine Learning Library.

Benchmarking

Berkeley SWIM Benchmark

real-world big data workload benchmark.

Estuary Benchmark Report 2 updated 1y ago

reproducible, vendor-neutral data warehouse benchmark.

Intel HiBench 1.5k (archived)

a Hadoop benchmark suite.

Deeplearning4j Benchmarks

UCSB 60 updated 2y ago

extended Yahoo Cloud Serving Benchmark for NoSQL databases.

Security

BDA 104 (archived)

The vulnerability detector for Hadoop and Spark

System Deployment

Apache Slider 78 (archived)

is a YARN application to deploy existing distributed applications on YARN.

Marathon 4.0k (archived)

Mesos framework for long-running services.

Linkis 3.4k updated 4mo ago

Linkis helps easily connect to various back-end computation/storage engines.

Applications

411 968 (archived)

an web application for alert management resulting from scheduled searches into Elasticsearch.

Adobe spindle 330 updated 11y ago

Next-generation web analytics processing with Scala, Spark, and Parquet.

Argus

Time series monitoring and alerting platform.

AthenaX 1.2k (archived)

a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).

Atlas 3.5k updated 4mo ago

a backend for managing dimensional time series data.

ElastAert 8.0k updated 1y ago

ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.

Eventhub 1.3k updated 4y ago

open source event analytics platform.

Hermes 851 updated 4mo ago

asynchronous message broker built on top of Kafka.

Kapacitor 2.4k updated 4mo ago

an open source framework for processing, monitoring, and alerting on time series data.

PivotalR 127 (archived)

R on Pivotal HD / HAWQ and PostgreSQL.

Rakam 795 updated 4y ago

open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.

SnappyData 1.0k updated 3y ago

a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.

Snowplow 7.0k updated 2mo ago

enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.

Substation 397 updated 6mo ago

Substation is a cloud native data pipeline and transformation toolkit written in Go.

Search engine and framework

Elassandra 1.7k updated 1y ago

is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.

LinkedIn Cleo 568 updated 12y ago

is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.

LinkedIn Zoie 378 updated 3y ago

is a realtime search/indexing system written in Java.

Facebook Faiss 39.9k updated 2mo ago

is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.

Annoy 14.2k updated 9mo ago

is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Weaviate 15.9k updated 4mo ago

Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.

MySQL forks and evolutions

ProxySQL 25 updated 8mo ago

High Performance Proxy for MySQL.

Memcached forks and evolutions

Twemproxy 12.4k updated 2y ago

A fast, light-weight proxy for memcached and redis.

Twitter Fatcache 1.3k (archived)

key/value cache for flash storage.

Twitter Twemcache 935 (archived)

fork of Memcache.

Embedded Databases

HanoiDB 311 updated 10y ago

Erlang LSM BTree Storage.

LevelDB 38.9k updated 4mo ago

a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.

RocksDB

embeddable persistent key-value store for fast storage based on LevelDB.

Business Intelligence

Blazer 4.8k updated 4mo ago

business intelligence made simple.

Lightdash 5.8k updated 2mo ago

The open source Looker alternative built on dbt

Metabase 46.5k updated 4mo ago

The simplest, fastest way to get business intelligence and analytics to everyone in your company.

Data Visualization

Airpal 2.8k (archived)

Web UI for PrestoDB.

Arbor 2.7k updated 6y ago

graph visualization library using web workers and jQuery.

Banana 671 updated 11mo ago

visualize logs and time-stamped data stored in Solr. Port of Kibana.

Bloomery 18 updated 9y ago

Web UI for Impala.

CartoDB 2.8k updated 1y ago

open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.

Chartist.js 99 updated 2y ago

another open source HTML5 Charts visualization.

Cubism 4.9k updated 1y ago

JavaScript library for time series visualization.

Dash 24.5k updated 4mo ago

Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required

DevExtreme React Chart

High-performance plugin-based React chart for Bootstrap and Material Design.

Echarts 66.0k updated 4mo ago

Baidus enterprise charts.

Envisionjs 1.6k updated 6y ago

dynamic HTML5 visualization.

Freeboard 6.5k updated 2y ago

pen source real-time dashboard builder for IOT and other web mashups.

Gephi 6.5k updated 2mo ago

An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.

Matplotlib 22.7k updated 3mo ago

plotting with Python.

Plotly.js 18.2k updated 3mo ago

The open source javascript graphing library that powers plotly.

Recline 2.3k updated 3mo ago

simple but powerful library for building data applications in pure Javascript and HTML.

Redash 28.3k updated 4mo ago

open-source platform to query and visualize data.

Sigma.js 12.0k updated 3mo ago

JavaScript library dedicated to graph drawing.

Superset 71.1k updated 4mo ago

a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.

Vega 11.9k updated 3mo ago

a visualization grammar.

Zeppelin 405 updated 9y ago

a notebook-style collaborative data analysis.

DataSphere Studio 3.3k updated 8mo ago

one-stop data application development management portal.

D3.compose 696 updated 3y ago

Compose complex, data-driven visualizations from reusable charts and components.

Peity 4.2k updated 2y ago

Progressive SVG bar, line and pie charts.

Internet of things and sensor data

NetLytics 9 updated 8y ago

Analytics platform to process network data on Spark.

Other Awesome Lists

awesome-awesomeness 33.3k updated 2y ago

awesome 459.8k updated 3mo ago

list 11.2k updated 4mo ago

awesome-awesome-awesome 2.2k updated 2y ago

awesome-analytics 4.3k updated 5mo ago

awesome-public-datasets 74.6k updated 2mo ago

awesome-graph-classification 4.8k updated 3y ago

awesome-network-embedding 2.6k updated 5y ago

awesome-community-detection 2.4k updated 7mo ago

awesome-decision-tree-papers 2.5k updated 7mo ago

awesome-fraud-detection-papers 1.8k updated 6mo ago

awesome-gradient-boosting-papers 1.0k updated 6mo ago

awesome-monte-carlo-tree-search-papers 696 updated 6mo ago

awesome-kafka 212 updated 5mo ago

Google Bigtable 54 updated 3y ago

Books

Distributed systems

Distributed Systems for fun and profit

Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.

Distributed Systems for fun and profit

Theory of distributed systems. Include parts about time and ordering, replication and impossibility results.

Streaming

Storm Applied 6.7k updated 2mo ago

Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.

Kafka Streams in Action 164 updated 2mo ago

Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.

Reactive Data Handling

Reactive Data Handling is a collection of five hand-picked chapters, selected by Manuel Bernhardt, that introduce you to building reactive applications capable of handling real-time processing with large data loads--free eBook!

Azure Data Engineering

A book about data engineering in general and the Azure platform specifically

Graph Based approach

Graph-Powered Machine Learning 16.9k updated 3mo ago

Alessandro Negro. Combine graph theory and models to improve machine learning projects

Big Data

Contents

Frameworks

Distributed Programming

Distributed Filesystem

Distributed Index

Key Map Data Model

Key-value Data Model

Graph Data Model

Columnar Databases

NewSQL Databases

Time-Series Databases

SQL-like processing

Spark

Data Ingestion

Service Programming

Scheduling

Machine Learning

Benchmarking

Security

System Deployment

Applications

Search engine and framework

MySQL forks and evolutions

Memcached forks and evolutions

Embedded Databases

Business Intelligence

Data Visualization

Internet of things and sensor data

Other Awesome Lists

Books

Distributed systems

Streaming

Graph Based approach