Data Engineering
Contents
Databases
Relational
Key-Value
Column
Graph
Timeseries
A scalable time series database based on Cassandra and Elasticsearch, by Spotify.
Column oriented distributed data store ideal for powering interactive applications.
A numeric time-series database. It can be used to capture, store and process time-series data in real-time. The word "akumuli" can be translated from esperanto as "accumulate".
Data Comparison
A Python library that facilitates the comparison of two DataFrames in Pandas, Polars, Spark and more. The library goes beyond basic equality checks by providing detailed insights into discrepancies at both row and column levels.
Data Validation Tool compares data from source and target tables to ensure that they match. It provides column validation, row validation, schema validation, custom query validation, and ad hoc SQL exploration.
Data Ingestion
CLI tool to copy data between databases with a single command. Supports 50+ sources including PostgreSQL, MySQL, MongoDB, Salesforce, Shopify to any data warehouse.
Kafka
File System
File System
A high-performance Cloud-Native file system driven by object storage for large-scale data storage.
A bite-sized, lightweight HDFS compatible file system built over Cassandra.
Seaweed-FS is a simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast! Instead of supporting full POSIX file system semantics, Seaweed-FS choose to implement only a key~file mapping. Similar to the word "NoSQL", you can call it as "NoFS".
Serialization format
Serialization format
Stream Processing
Forever scalable event processing & in-memory durable K/V store as a library with asyncio & static typing.
The streaming database built for IoT data storage and real-time processing.
An edge lightweight IoT data analytics/streaming software implemented by Golang, and it can be run at all kinds of resource-constrained edge devices.
An API gateway built for event-driven architectures and streaming that supports standard protocols such as HTTP, SSE, gRPC, MQTT, and the native Kafka protocol.
Batch Processing
Spark
A light-weight engine for general-purpose data processing including both batch and stream analytics. It is based on a novel unique data model, which represents data via functions and processes data via columns operations as opposed to having only set operations in conventional approaches like MapReduce or SQL.
Charts and Dashboards
Flask, JS, and CSS boilerplate for interactive, web-based visualization apps in Python.
A modern, enterprise-ready business intelligence web application.
Workflow
End-to-end data pipeline tool that combines ingestion, transformation (SQL + Python), and data quality in a single CLI. Connects to BigQuery, Snowflake, PostgreSQL, Redshift, and more. Includes VS Code extension with live previews.
A system to programmatically author, schedule, and monitor data pipelines.
DAG based workflow manager. Job flows are defined programmatically in Python. Support output passing between jobs.
A lightweight library to define data transformations as a directed-acyclic graph (DAG). If you like dbt for SQL transforms, you will like Hamilton for Python processing.
Scalable, event-driven, language-agnostic orchestration and scheduling platform to manage millions of workflows declaratively in code.
A warehouse-first Customer Data Platform that enables you to collect data from every application, website and SaaS platform, and then activate it in your warehouse and business tools.
Data Lake Management
ELK Elastic Logstash Kibana
Docker
A lightweight tool for easy deployment and rollback of dockerized applications.
Analyzes resource usage and performance characteristics of running containers.
Datasets
Realtime
Monitoring
Profiling
Data Profiler
Testing
A data catalog tool that integrates into your CI system exposing downstream impact testing of data changes. These tests prevent data changes which might break data pipelines or BI dashboards from making it to production.
An open-source data quality platform for the whole data platform lifecycle from profiling new data sources to applying full automation of data quality monitoring.
Decorator-first DataFrame contracts/validation (columns/dtypes/constraints) at function boundaries. Supports Pandas/Polars/PyArrow/Modin.