Hadoop
Framework for distributed storage and processing of very large data sets.
Contents
Hadoop
Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
Python MapReduce library written in Cython.
mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
HDFS-DU is an interactive visualization of the Hadoop distributed file system.
Hadoop log aggregator and dashboard
Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
Go-based toolkit for ETL and feature extraction on Hadoop
NoSQL
A developer-friendly Python library to interact with Apache HBase.
Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
Secondary Index for HBase
Data Management
Workflow, Lifecycle and Governance
Data Ingestion and Integration
DSL
Machine learning and natural language processing with Apache Pig
Open Source Big Data Security Analytics
Mozilla's utility library for Hadoop, HBase, Pig, etc.
Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
Libraries and Tools
Native go clients for Apache Hadoop YARN.
A native go client for HDFS
Web tool for the Confluent Schema Registry in order to create / view / search / evolve / view history & configure Avro schemas of your Kafka cluster.
Misc.
Hive Plugins
WebUI for query engines: Hive and Presto
(Perl - HiveServer2)
Python interface to Hive and Presto
An Open Source unit test framework for hadoop hive queries based on JUnit4
A super simple utility for testing Apache Hive scripts locally for non-Java developers.
Unit test framework for hive and hive-service
Flume Plugins
Packaging, Provisioning and Monitoring
Benchmark
The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.
The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.