Awesome Apache Spark — Project Awesome

Polynote: an IDE-inspired polyglot notebook. It supports mixing multiple languages in one notebook, and sharing data between them seamlessly. It encourages reproducible notebooks with its immutable data model. Originating from Netflix.

sparkmagic 1.4k updated 6mo ago

Jupyter magics and kernels for working with remote Spark clusters, for interactively working with remote Spark clusters through Livy, in Jupyter notebooks.

General Purpose Libraries

itachi 61 updated 2y ago

A library that brings useful functions from modern database management systems to Apache Spark.

spark-daria 766 updated 6mo ago

A Scala library with essential Spark functions and extensions to make you more productive.

quinn 685 updated 1y ago

A native PySpark implementation of spark-daria.

Apache DataFu 122 updated 10mo ago

A library of general purpose functions and UDF's.

Joblib Apache Spark Backend 249 updated 9d ago

joblib backend for running tasks on Spark clusters.

SQL Data Sources

Spark XML 512 (archived)

XML parser and writer.

Spark Cassandra Connector 1.9k updated 11mo ago

Cassandra support including data source and API and support for arbitrary queries.

Mongo-Spark 726 updated 17d ago

Official MongoDB connector.

Storage

Delta Lake 8.6k updated 10d ago

Storage layer with ACID transactions.

Apache Hudi

Upserts, Deletes And Incremental Processing on Big Data..

Apache Iceberg 8.7k updated 9d ago

Upserts, Deletes And Incremental Processing on Big Data..

lakeFS 5.2k updated 9d ago

Integration with the lakeFS atomic versioned storage layer.

Bioinformatics

ADAM 1.0k updated 16d ago

Set of tools designed to analyse genomics data.

Hail 1.1k updated 10d ago

Genetic analysis framework.

GIS

Apache Sedona 2.3k updated 10d ago

Cluster computing system for processing large-scale spatial data.

Graph Processing

GraphFrames 1.1k updated 16d ago

Data frame based graph API.

neo4j-spark-connector 317 updated 9d ago

Bolt protocol based, Neo4j Connector with RDD, DataFrame and GraphX / GraphFrames support.

Machine Learning Extension

JPMML-Spark 99 updated 1mo ago

PMML transformer library for Spark ML.

ModelDB 1.7k updated 1y ago

A system to manage machine learning models for spark.ml and scikit-learn <img src="https://img.shields.io/github/last-commit/scikit-learn/scikit-learn.svg">.

scikit-learn 65.5k updated 10d ago

Sparkling Water 977 updated 4mo ago

H2O interoperability layer.

BigDL 8.7k (archived)

Distributed Deep Learning library.

MLeap 1.5k updated 23d ago

Execution engine and serialization format which supports deployment of o.a.s.ml models without dependency on SparkSession.

Microsoft ML for Apache Spark 5.2k updated 10d ago

A distributed ml library with support for LightGBM, Vowpal Wabbit, OpenCV, Deep Learning, Cognitive Services, and Model Deployment.

MLflow 24.9k updated 9d ago

Machine learning orchestration platform.