Big Data
Contents
Frameworks
general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
Distributed Programming
distributed data processing and storage system originally developed at AddThis.
A fast and simple framework for building and running distributed applications.
Libraries to enable building IBM Streams application in Java, Python or Scala.
Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Distributed Filesystem
Distributed Index
Key Map Data Model
Key-value Data Model
Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
is a simple persistent data store with very low latency and high throughput.
an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.
a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
Graph Data Model
a database for user interactions (likes, views, follows) with precomputed reads, supports HBase.
A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
a lightweight graph based database that does not require any third-party libraries.
Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
Columnar Databases
NewSQL Databases
Time-Series Databases
a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
Column oriented distributed data store ideal for powering interactive applications
Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
Highly-scalable, robust and fast, open source time series database with cluster functionality.
Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
SQL-like processing
Data Ingestion
a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
Service Programming
a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
Scheduling
Machine Learning
Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
An unsupervised machine learning library for graph structured data. Python
An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
Fast multilayer perceptron neural network library for iOS and Mac OS X.
All-in-one web-based IDE specialized for machine learning and data science.
Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
a temporal extension library for PyTorch Geometric .
Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
Benchmarking
System Deployment
Applications
an web application for alert management resulting from scheduled searches into Elasticsearch.
Next-generation web analytics processing with Scala, Spark, and Parquet.
a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
an open source framework for processing, monitoring, and alerting on time series data.
open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
Search engine and framework
is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
MySQL forks and evolutions
Memcached forks and evolutions
Embedded Databases
Business Intelligence
Data Visualization
open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
simple but powerful library for building data applications in pure Javascript and HTML.
a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
Internet of things and sensor data
Other Awesome Lists
Books
Distributed systems
Streaming
Storm Applied is a practical guide to using Apache Storm for the real-world tasks associated with processing and analyzing real-time data streams.
Kafka Streams in Action teaches you everything you need to know to implement stream processing on data flowing into your Kafka platform, allowing you to focus on getting more from your data without sacrificing time or effort.