Awesome Apache Spark — Project Awesome

Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.

sparkmagic 1.4k updated 10mo ago

Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

itachi 61 updated 2y ago

A library that brings useful functions from modern database management systems to Apache Spark.

spark-daria 766 updated 10mo ago

A Scala library with essential Spark functions and extensions to make you more productive.

quinn 685 updated 1y ago

A native PySpark implementation of spark-daria.

Apache DataFu 122 updated 1y ago

A library of general purpose functions and UDF's.

Joblib Apache Spark Backend 249 updated 4mo ago

joblib backend for running tasks on Spark clusters.

SQL Data Sources

Spark XML 512 (archived)

XML parser and writer.

Spark Cassandra Connector 1.9k updated 1y ago

Cassandra support including data source and API and support for arbitrary queries.

Mongo-Spark 726 updated 4mo ago

Official MongoDB connector.

Storage

Delta Lake 8.6k updated 4mo ago

Storage layer with ACID transactions.

Apache Hudi 6.2k updated 3mo ago

Upserts, Deletes And Incremental Processing on Big Data..

Apache Iceberg 8.7k updated 4mo ago

Upserts, Deletes And Incremental Processing on Big Data..

lakeFS 5.3k updated 3mo ago

Integration with the lakeFS atomic versioned storage layer.

Bioinformatics

ADAM 1.0k updated 4mo ago

Set of tools designed to analyse genomics data.

Hail 1.1k updated 4mo ago

Genetic analysis framework.

GIS

Apache Sedona 2.3k updated 4mo ago

Cluster computing system for processing large-scale spatial data.

Graph Processing

GraphFrames 1.2k updated 3mo ago

Data frame based graph API.

neo4j-spark-connector 317 updated 4mo ago

Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.

Machine Learning Extension

JPMML-Spark 99 updated 5mo ago

PMML transformer library for Spark ML.

ModelDB 1.7k updated 2y ago

A system to manage machine learning models for spark.ml and scikit-learn <img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.

scikit-learn 65.9k updated 2mo ago

.

Sparkling Water 977 updated 8mo ago

H2O interoperability layer.

BigDL 8.7k (archived)

Distributed Deep Learning library.

MLeap 1.5k updated 4mo ago

Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.

Microsoft ML for Apache Spark 5.2k updated 3mo ago

A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.

MLflow 24.9k updated 4mo ago

Machine learning orchestration platform.