Data Engineering
Contents
Databases
Relational
Key-Value
Column
Graph
Timeseries
Scalable datastore for metrics, events, and real-time analytics.
A scalable, distributed Time Series Database.
Fast scalable time series database.
A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
Column oriented distributed data store ideal for powering interactive applications.
A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
A time-series object store for Cassandra that handles all the complexity of building wide row indexes.
Fast distributed metrics database.
A distributed system designed to ingest and process time series data.
A time series database application that provides secure access to time series data based on Accumulo and Grafana.
Other
An in-memory database and application server.
The Greenplum Database (GPDB) - An advanced, fully featured, open source data warehouse. It provides powerful and rapid analytics on petabyte scale data volumes.
An open-source graph database. Google.
OLTP + OLAP Database built on Apache Spark.
Data Comparison
A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.
Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.
A high-performance Python library for comparing large datasets (CSV, Parquet) locally using Rust and Polars. It features zero-copy streaming to prevent OOM errors and generates interactive HTML data quality reports.
AI-powered data operations SDK for Python. Semantic deduplication, fuzzy table merging, and intelligent row ranking using LLM agents.
Data Ingestion
CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.
Data Acquisition and Processing Made Easy. Deprecated.
Universal data ingestion framework for Hadoop from LinkedIn.
Utility belt to handle data on AWS.
Live import all your Google Sheets to your data warehouse.
Lightweight Node.js ETL framework for databases → data lakes/warehouses.
Polyglot document intelligence library with a Rust core and bindings for Python, TypeScript, Go, and more. Extracts text, tables, and metadata from 62+ document formats for data pipeline ingestion.
Kafka
Change data capture from PostgreSQL into Kafka. Deprecated.
Simplified command-line administration for Kafka brokers.
Generic command line non-JVM Apache Kafka producer and consumer.
A PostgreSQL extension to produce messages to Apache Kafka.
The Apache Kafka C/C++ library.
Kafka in Docker.
A tool for managing Apache Kafka.
Node.js client for Apache Kafka 0.8.
Pinterest's Kafka to S3 distributed consumer.
Kafka-winston logger for Node.js from Uber.
A Kafka Proxy, solving problems like encrypting your Kafka data at rest.
File System
File System
A pure python HDFS client.
Utils for streaming large files (S3, HDFS, gzip, bz2).
A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
A bite-sized, lightweight HDFS compatible file system built over Cassandra.
Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
A file system that stores all its data online using storage services like Google Storage, Amazon S3, or OpenStack.
Serialization format
Stream Processing
An open source ETL framework to build fresh index for AI.
The Streaming SQL Database.
Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
The streaming database built for IoT data storage and real-time processing.
An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
A framework for building real-time streaming data processing applications that supports a wide range of ingestion sources.
Performant open-source Python ETL framework with Rust runtime, supporting 300+ data sources.
Batch Processing
Spark
A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
A cloud native data pipeline and transformation toolkit written in Go.
Charts and Dashboards
Python helpers for building dashboards using Flask and React.
Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
A modern, enterprise-ready business intelligence web application.
The easy, open source way for everyone in your company to ask questions and learn from data.
Natural language database query interface with automatic chart generation, supporting Chinese and English queries.
Workflow
End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.
A Python module that helps you build complex pipelines of batch jobs.
An application cron-like system. Used w/Luigi. Deprecated.
A system to programmatically author, schedule, and monitor data pipelines.
DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
An open-source Python library for building data applications.
A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
An open source framework that allows you to enforce agreements on how data should be accessed, used, and transformed, regardless of the data platform (Snowflake, BigQuery, DataBricks, etc.)
The open-source reverse ETL, data activation platform for modern data teams.
Data Lake Management
An open source platform that delivers resilience and manageability to object-storage based data lakes.
A Transactional Catalog for Data Lakes with Git-like semantics. Works with Apache Iceberg tables.
An open-source, unified metadata management for data lakes, data warehouses, and external catalogs.
ELK Elastic Logstash Kibana
Docker
Package golang service into minimal Docker containers.
Easily manage Docker containers & their data.
Weaving Docker containers into applications.
A lightweight tool for easy deployment and rollback of dockerized applications.
Analyzes resource usage and performance characteristics of running containers.
Docker microservice for saving/restoring volume data to S3.
Docker composition tool with idempotency features for deploying apps composed of multiple containers. Deprecated.
A cluster manager, designed for both long-lived services and short-lived batch processing workloads.
Monitoring
Profiling
Testing
A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.
A Snowflake-compatible emulator for local development and testing.