Big Data
Contents
Frameworks
general-purpose data processing engine for both batch and stream analytics. It is based on a novel data model, which represents data via functions and processes data via column operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
High Throughput Real-time Stream Processing Framework.
A platform for reproducible and scalable machine learning and deep learning.
An extensible Java framework for building XML and non-XML (CSV, EDI, Java, etc...) streaming applications.
Distributed Programming
distributed data processing and storage system originally developed at AddThis.
MapReduce library for Clojure.
alternative MapReduce paradigm.
map-reduce for Clojure which compiles to Apache Pig.
A fast and simple framework for building and running distributed applications.
High performance distributed data processing in NodeJS.
Libraries to enable building IBM Streams application in Java, Python or Scala.
Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
Heron is a realtime, distributed, fault-tolerant stream processing engine from Twitter replacing Storm.
Scala library for Map Reduce jobs, built on Cascading.
Streaming MapReduce with Scalding and Storm, by Twitter.
Distributed Filesystem
Distributed Index
Key Map Data Model
an Internet-scale database, inspired by BigTable.
evolution of HBase made by Facebook.
is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
Transactions for HBase.
is accessed through a MySQL interface and use massive parallel processing to parallelize queries.
Key-value Data Model
an embedded key-value database for Go.
Key Value Database in .Net with Object DB Layer, RPC, dynamic IL and much more
a fast, embeddable, in-memory key/value database for Go with custom indexing and geospatial support.
is a protocol-compatible Server replacement for Redis.
Distributed database specialized in exporting data from Hadoop.
a distributed, in-memory, general purpose key-value data store that delivers microsecond performance at any scale.
a simple, fast, versioned, authenticated, embeddable key-value store database in pure Go(lang).
suitable for sensor data stored in a timeseries.
a scalable, next generation key-value and document store with a wide array of features, including consistency, fault tolerance and high performance.
is a simple persistent data store with very low latency and high throughput.
a decentralized datastore.
library to work with asynchronous key value stores, by Twitter.
an in-memory, NoSQL key/value database, with disk persistence and using the Raft consensus algorithm.
an efficient NoSQL database and a Lua application server.
a distributed key-value database powered by Rust and inspired by Google Spanner and HBase.
a geolocation data store, spatial index, and realtime geofence, supporting a variety of object types including latitude/longitude points, bounding boxes, XYZ tiles, Geohashes, and GeoJSON
key-value store that's replicated and sharded and provides atomic multirow writes.
Graph Data Model
a database for user interactions (likes, views, follows) with precomputed reads, supports HBase.
A scalable, distributed, low latency, high throughput graph database aimed at providing Google production level scale and throughput, with low enough latency to be serving real time user queries, over terabytes of structured data.
a lightweight graph based database that does not require any third-party libraries.
Gaffer by GCHQ is a framework that makes it easy to store large-scale graphs in which the nodes and edges have statistics.
open-source graph database.
graph traversal Language.
RDF-centric Map/Reduce framework.
a distributed in-memory data processing engine, underpinned by a strongly-typed in-memory key-value store and a general distributed computation engine.
framework for large scale graph processing.
distributed graph database.
Columnar Databases
NewSQL Databases
a distributed SQL database with the scalability of a KV store, while keeping the query capabilities of a relational database.
statistic oriented SQL database.
Scalable, Geo-Replicated, Transactional Datastore.
a clustered RDBMS built on optimistic concurrency control techniques.
linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
a relational database backed by Apache Kafka.
TiDB is a distributed SQL database. Inspired by the design of Google F1.
open source, high-performance, distributed SQL database compatible with PostgreSQL.
Time-Series Databases
similar to OpenTSDB but allows for Cassandra.
a time series database in C utilizing unique features of IoT to improve read/write throughput and reduce space needed to store data
Column oriented distributed data store ideal for powering interactive applications
Akumuli is a numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
Fast distributed metrics database
A distributed system designed to ingest and process time series data
Timely is a time series database application that provides secure access to time series data based on Accumulo and Grafana.
Highly-scalable, robust and fast, open source time series database with cluster functionality.
Thanos is a set of components to create a highly available metric system with unlimited storage capacity using multiple (existing) Prometheus deployments.
fast, scalable and resource-effective open-source TSDB compatible with Prometheus. Single-node and cluster versions included
Facebook's in-memory time-series database.
SQL-like processing
Data Ingestion
a distributed pub-sub messaging platform with a very flexible messaging model and an intuitive client API.
streamed log data aggregator.
Distributed streaming infrastructure built on cloud storage which makes it easy to mix and match batch and streaming paradigms.
open source stream processing software system.
framework for connecting disparate data sources with Hadoop.
distributed message queue system.
utility package for compressing sorted integer arrays.
log aggregator and dashboard.
log agregattor like Storm and Samza based on Chukwa.
is a service implementing Kafka log persistance.
linkedin's universal data ingestion framework.
sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.
continuous big data ingest infrastructure with a simple to use IDE.
an open source customer data infrastructure (segment, mParticle alternative) written in go.
An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT and the native Kafka protocol.
Service Programming
a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
libraries for working with LZOP-compressed data.
Scheduling
a platform to programmatically author, schedule and monitor workflows.
Distributed, easy to install, NodeJS based, task scheduler
a data orchestrator for machine learning, analytics, and ETL.
Scala DSL for agile scheduling of Hadoop jobs.
scheduling platform.
Machine Learning
Neural networks in JavaScript.
Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning.
Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
A vectorization and data preprocessing library for deep learning in Java and Scala. Part of the Deeplearning4j ecosystem.
Flexible and Extensible Machine Learning in Ruby.
scalable Machine Learning in Scalding.
A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
statistical, machine learning and math runtime with Hadoop. R and Python.
An unsupervised machine learning library for graph structured data. Python
An intuitive neural net API inspired by Torch that runs atop Theano and Tensorflow.
Lambdo is a workflow engine which significantly simplifies the analysis process by unifying feature engineering and machine learning operations.
A subsampling library for graph structured data. Python
Fast multilayer perceptron neural network library for iOS and Mac OS X.
All-in-one web-based IDE specialized for machine learning and data science.
A matrix library for the JVM. Numpy for Java.
Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
a temporal extension library for PyTorch Geometric .
Reinforcement learning for Java and Scala. Includes Deep-Q learning and A3C algorithms, and integrates with Open AI's Gym. Runs in the Deeplearning4j ecosystem.
scikit-learn: machine learning in Python.
A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
Library from Google for machine learning using data flow graphs.
System for serving machine learning predictions.
learning system sponsored by Microsoft and Yahoo!.
CPU and GPU-accelerated Machine Learning Library.
Benchmarking
real-world big data workload benchmark.
reproducible, vendor-neutral data warehouse benchmark.
a Hadoop benchmark suite.
extended Yahoo Cloud Serving Benchmark for NoSQL databases.
System Deployment
Applications
an web application for alert management resulting from scheduled searches into Elasticsearch.
Next-generation web analytics processing with Scala, Spark, and Parquet.
Time series monitoring and alerting platform.
a streaming analytics platform that enables users to run production-quality, large scale streaming analytics using Structured Query Language (SQL).
a backend for managing dimensional time series data.
ElastAlert is a simple framework for alerting on anomalies, spikes, or other patterns of interest from data in ElasticSearch.
open source event analytics platform.
asynchronous message broker built on top of Kafka.
an open source framework for processing, monitoring, and alerting on time series data.
R on Pivotal HD / HAWQ and PostgreSQL.
open-source real-time custom analytics platform powered by Postgresql, Kinesis and PrestoDB.
a distributed in-memory data store for real-time operational analytics, delivering stream analytics, OLTP (online transaction processing) and OLAP (online analytical processing) built on Spark in a single integrated cluster.
enterprise-strength web and event analytics, powered by Hadoop, Kinesis, Redshift and Postgres.
Substation is a cloud native data pipeline and transformation toolkit written in Go.
Search engine and framework
is a fork of Elasticsearch modified to run on top of Apache Cassandra in a scalable and resilient peer-to-peer architecture.
is a flexible software library for enabling rapid development of partial, out-of-order and real-time typeahead search.
is a realtime search/indexing system written in Java.
is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy.
is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.
Weaviate is a GraphQL-based semantic search engine with build-in (word) embeddings.
MySQL forks and evolutions
Memcached forks and evolutions
Embedded Databases
Business Intelligence
Data Visualization
Web UI for PrestoDB.
graph visualization library using web workers and jQuery.
visualize logs and time-stamped data stored in Solr. Port of Kibana.
Web UI for Impala.
open-source or freemium hosting for geospatial databases with powerful front-end editing capabilities and a robust API.
another open source HTML5 Charts visualization.
JavaScript library for time series visualization.
Analytical Web Apps for Python, R, Julia, and Jupyter. Built on top of plotly, no JS required
High-performance plugin-based React chart for Bootstrap and Material Design.
Baidus enterprise charts.
dynamic HTML5 visualization.
pen source real-time dashboard builder for IOT and other web mashups.
An award-winning open-source platform for visualizing and manipulating large graphs and network connections. It's like Photoshop, but for graphs. Available for Windows and Mac OS X.
plotting with Python.
The open source javascript graphing library that powers plotly.
simple but powerful library for building data applications in pure Javascript and HTML.
open-source platform to query and visualize data.
JavaScript library dedicated to graph drawing.
a data exploration platform designed to be visual, intuitive and interactive, making it easy to slice, dice and visualize data and perform analytics at the speed of thought.
a visualization grammar.
a notebook-style collaborative data analysis.
one-stop data application development management portal.
Compose complex, data-driven visualizations from reusable charts and components.