Awesome Data Engineering — Project Awesome

A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".

Rhombus

A time-series object store for Cassandra that handles all the complexity of building wide row indexes.

Dalmatiner DB 692 updated 7y ago

Fast distributed metrics database.

Blueflood 598 updated 1y ago

A distributed system designed to ingest and process time series data.

Timely 388 updated 4mo ago

A time series database application that provides secure access to time series data based on Accumulo and Grafana.

Other

Tarantool 3.6k updated 2mo ago

An in-memory database and application server.

GreenPlum

The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.

cayley 15.0k updated 6mo ago

An open-source graph database. Google.

Snappydata 1.0k updated 3y ago

OLTP + OLAP Database built on Apache Spark.

Data Comparison

datacompy 639 updated 2mo ago

A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.

dvt 502 updated 28d ago

Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.

koala-diff 4 updated 3mo ago

A high-performance Python library for comparing large datasets (CSV, Parquet) locally using Rust and Polars. It features zero-copy streaming to prevent OOM errors and generates interactive HTML data quality reports.

everyrow 36 updated 26d ago

AI-powered data operations SDK for Python. Semantic deduplication, fuzzy table merging, and intelligent row ranking using LLM agents.

Data Ingestion

ingestr 3.5k updated 1mo ago

CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.

Heka 3.5k (archived)

Data Acquisition and Processing Made Easy. Deprecated.

Gobblin 2.3k updated 2mo ago

Universal data ingestion framework for Hadoop from LinkedIn.

AWS Data Wrangler 4.1k updated 25d ago

Utility belt to handle data on AWS.

Google Sheets ETL 22 updated 2mo ago

Live import all your Google Sheets to your data warehouse.

db2lake 2 updated 8mo ago

Lightweight Node.js ETL framework for databases → data lakes/warehouses.

Kreuzberg 8.1k updated 28d ago

Polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion.

crdt-merge 2 updated 29d ago

Conflict-free merge for DataFrames, JSON, ML models & distributed agents — powered by CRDTs.

Kafka

BottledWater 6 (archived)

Change data capture from PostgreSQL into Kafka. Deprecated.

kafkat 503 updated 7y ago

Simplified command-line administration for Kafka brokers.

kafkacat 5.7k updated 1y ago

Generic command line non-JVM Apache Kafka producer and consumer.

pg-kafka 112 (archived)

A PostgreSQL extension to produce messages to Apache Kafka.

librdkafka 970 updated 28d ago

The Apache Kafka C/C++ library.

kafka-docker 7.0k updated 2y ago

Kafka in Docker.

kafka-manager 11.9k updated 2y ago

A tool for managing Apache Kafka.

kafka-node 2.7k updated 2y ago

Node.js client for Apache Kafka 0.8.

Secor 1.9k updated 2mo ago

Pinterest's Kafka to S3 distributed consumer.

Kafka-logger 45 (archived)

Kafka-winston logger for Node.js from Uber.

Kroxylicious 280 updated 28d ago

A Kafka Proxy, solving problems like encrypting your Kafka data at rest.

Meltano

Singer SDK

The fastest way to build custom data extractors and loaders compliant with the Singer Spec.

File System

Snakebite 859 (archived)

A pure python HDFS client.

smart_open 3.4k updated 1mo ago

Utils for streaming large files (S3, HDFS, gzip, bz2).

JuiceFS 13.5k updated 28d ago

A high-performance Cloud-Native file system driven by object storage for large-scale data storage.

SnackFS 14 updated 11y ago

A bite-sized, lightweight HDFS compatible file system built over Cassandra.

SeaweedFS 18 updated 3mo ago

Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".

S3QL 1.2k updated 2mo ago

A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.

Serialization format

Snappy 6.6k updated 2mo ago

A fast compressor/decompressor. Used with Parquet.

ProtoBuf 71.0k updated 2mo ago

Protocol Buffers - Google's data interchange format.

Kryo 6.5k updated 29d ago

A fast and efficient object graph serialization framework for Java.

Stream Processing

CocoIndex 7.2k updated 26d ago

An open source ETL framework to build fresh index for AI.

PipelineDB 2.7k updated 4y ago

The Streaming SQL Database.

Robinhood's Faust 1.9k updated 7mo ago

Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.

HStreamDB 725 updated 1y ago

The streaming database built for IoT data storage and real-time processing.

Kuiper 1.7k updated 2mo ago

An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.

Zilla 682 updated 2mo ago

An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.

SwimOS

A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.

Pathway 62.4k updated 2mo ago

Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.

Batch Processing

Spark

Deep Spark 197 (archived)

Connecting Apache Spark with different data stores. Deprecated.

Delight 346 (archived)

A free & cross platform monitoring tool (Spark UI / Spark History Server alternative).

Bistro 8 updated 7y ago

A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.

Substation 393 updated 4mo ago

A cloud native data pipeline and transformation toolkit written in Go.

dna-claude-analysis 36 updated 2mo ago

Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, etc.) and generating a terminal-style single-page HTML visualization.

Hive

Hivemall 313 (archived)

Scalable machine learning library for Hive/Hadoop.

PyHive 1.7k updated 1y ago

Python interface to Hive and Presto.

Charts and Dashboards

PyXley 2.3k updated 11mo ago

Python helpers for building dashboards using Flask and React.

Plotly 24.2k updated 28d ago

Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.

Apache Superset 71.1k updated 2mo ago

A modern, enterprise-ready business intelligence web application.

Metabase 46.5k updated 2mo ago

The easy, open source way for everyone in your company to ask questions and learn from data.

QueryGPT 32 updated 1mo ago

Natural language database query interface with automatic chart generation, supporting Chinese and English queries.

Workflow

Bruin 1.5k updated 28d ago

End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.

Luigi 18.7k updated 2mo ago

A Python module that helps you build complex pipelines of batch jobs.

CronQ

An application cron-like system. Used w/Luigi. Deprecated.

Airflow 45.2k updated 28d ago

A system to programmatically author, schedule, and monitor data pipelines.

Pinball 1.0k (archived)

DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.

Dagster 15.1k updated 2mo ago

An open-source Python library for building data applications.

Hamilton 2.5k updated 1mo ago

A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.

Kestra 26.6k updated 2mo ago

Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.

RudderStack 4.4k updated 2mo ago

A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.

PACE 39 updated 28d ago

An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)

Multiwoven 1.6k updated 2mo ago

The open-source reverse ETL, data activation platform for modern data teams.

Data Lake Management

lakeFS 5.3k updated 1mo ago

An open source platform that delivers resilience and manageability to object-storage based data lakes.

Project Nessie 1.5k updated 28d ago

A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.

Gravitino 2.9k updated 28d ago

An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.

ELK Elastic Logstash Kibana

docker-logstash 238 updated 10y ago

A highly configurable Logstash (1.4.4) - Docker image running Elasticsearch (1.7.0) - and Kibana (3.1.2).

elasticsearch-jdbc 2.8k updated 4y ago

JDBC importer for Elasticsearch.

ZomboDB 4.7k (archived)

PostgreSQL Extension that allows creating an index backed by Elasticsearch.

Docker

Gockerize 667 updated 8y ago

Package golang service into minimal Docker containers.

Flocker 3.4k updated 9y ago

Easily manage Docker containers & their data.

Weave 6.6k (archived)

Weaving Docker containers into applications.

Zodiac 200 updated 6y ago

A lightweight tool for easy deployment and rollback of dockerized applications.

cAdvisor 19.0k updated 2mo ago

Analyzes resource usage and performance characteristics of running containers.

Micro S3 persistence 14 updated 6y ago

Docker microservice for saving/restoring volume data to S3.

Rocker-compose 408 (archived)

Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.

Nomad 16.3k updated 2mo ago

A cluster manager, designed for both long-lived services and short-lived batch processing workloads.

Datasets

Realtime

Eventsim 537 updated 3mo ago

Event data simulator. Generates a stream of pseudo-random events from a set of users, designed to simulate web traffic.

Data Dumps

FirstData 156 updated 28d ago

The world's most comprehensive authoritative data source knowledge base. 160+ curated sources from governments, international organizations, and research institutions with MCP integration.

Monitoring

Prometheus

Prometheus.io 63.8k updated 28d ago

An open-source service monitoring system and time series database.

HAProxy Exporter 629 (archived)

Simple server that scrapes HAProxy stats and exports them via HTTP for Prometheus consumption.

Profiling

Data Profiler

Data Profiler 1.6k updated 8mo ago

The DataProfiler is a Python library designed to make data analysis, monitoring, and sensitive data detection easy.

Desbordante 476 updated 28d ago

An open-source data profiler specifically focused on discovery and validation of complex patterns in data.

Testing

Grai 313 updated 3mo ago

A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.

DQOps 189 updated 4mo ago

An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.

daffy 57 updated 2mo ago

Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.

Snowflake Emulator 28 updated 4mo ago

A Snowflake-compatible emulator for local development and testing.

Provero 11 updated 1mo ago

A vendor-neutral, declarative data quality engine. Define checks in YAML, run anywhere. Includes 16 built-in check types, SQL batch optimizer, anomaly detection, and data contracts.

Data Engineering

Contents

Databases

Relational

Key-Value

Column

Document

Graph

Distributed

Timeseries

Other

Data Comparison

Data Ingestion

Kafka

Meltano

File System

File System

Serialization format

Serialization format

Stream Processing

Batch Processing

Spark

Hive

Charts and Dashboards

Workflow

Data Lake Management

ELK Elastic Logstash Kibana

Docker

Datasets

Realtime

Data Dumps

Monitoring

Prometheus

Profiling

Data Profiler

Testing