Data Science
What is Data Science?
Agents
Frameworks
Tools
MCP server providing 13 data tools for AI agents: real-time crypto prices, IP geolocation, DNS lookups, web scraping to markdown, code execution, and screenshots. One API key for 40+ services.
61 production-ready AI API tools for data science workflows: code analysis, web scraping, NLP, image generation, crypto data, and search. REST API and MCP protocol support.
Research & Knowledge Retrieval
Training Resources
Tutorials
The Data Science Toolbox
General Machine Learning Packages
Deep Learning Packages
Miscellaneous Tools
A platform for reproducible and scalable machine learning and deep learning.
The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo
Template repository for data science lifecycle project
A graph sampling library for NetworkX with a Scikit-Learn like API.
An unsupervised machine learning extension library for NetworkX with a Scikit-Learn like API.
All-in-one web-based IDE for machine learning and data science. The workspace is deployed as a Docker container and is preloaded with a variety of popular data science libraries (e.g., Tensorflow, PyTorch) and dev tools (e.g., Jupyter, VS Code)
A Python-powered shell that enables integration, management and orchestration of data science libraries mostly written in Python, allowing you to build pipelines, code and command-based workflows. It can also be used as a kernel for Jupyter Notebook.
Lightweight, Python library for fast and reproducible machine learning experimentation. Introduces very simple interface that enables clean machine learning pipeline design.
Curated collection of the neural networks, transformers and models that make your machine learning work faster and more effective.
Fast DataFrame library for Rust and Python, designed as a faster alternative to Pandas
a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
Intel Nervana reference deep learning framework committed to best performance on all hardware.
An open source data visualization platform helping everyone to create simple, correct and embeddable charts. Also at github.com
An open source framework for automated feature engineering written in python
Cleansing, pre-processing, feature engineering, exploratory data analysis and easy ML with PySpark backend.
А fast and framework agnostic image augmentation library that implements a diverse set of augmentation techniques. Supports classification, segmentation, and detection out of the box. Was used to win a number of Deep Learning competitions at Kaggle, Topcoder and those that were a part of the CVPR workshops.
is a workflow engine that significantly simplifies data analysis by combining in one analysis pipeline (i) feature engineering and machine learning (ii) model training and prediction (iii) table population and column evaluation.
A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
Open-source data-intensive machine learning platform with a feature store. Ingest and manage features for both online (MySQL Cluster) and offline (Apache Hive) access, train and serve models at scale.
MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.
An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, etc).
An open source toolkit for using continuous integration in data science projects. Automatically train and test models in production-like environments with GitHub Actions & GitLab CI, and autogenerate visual reports on pull/merge requests.
a Julia-language backend combined with the Jupyter interactive environment
Platform to programmatically author, schedule, and monitor workflows
Open-source Python framework for creating reproducible, maintainable data science code
Lightweight library to author and manage reliable data transformations
Game theoretic approach to explain the output of any machine learning model
InterpretML implements the Explainable Boosting Machine (EBM), a modern, fully interpretable machine learning model based on Generalized Additive Models (GAMs). This open-source package also provides visualization tools for EBMs, other glass-box models, and black-box explanations
A Python library to ease preprocessing and feature engineering for tabular machine learning
An open-source project that automatically maps relationship networks by parsing public data using LLMs and visualizes it as an interactive graph.
Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, and more) and generating a terminal-style single-page HTML visualization.
Fast MATLAB-syntax runtime with automatic CPU/GPU execution and fused array kernels.
A terminal UI for experimenting with custom rule engines and selective LLM analysis on real-time data streams, without worrying about streaming infra or backpressure.
Open source “failure atlas” of 16 recurring issues in LLM and RAG pipelines, with observable symptoms and suggested fixes for data science teams.
An agentic LLM for autonomous data science, which can autonomously complete a wide range of data science tasks without human intervention.
Python Data Science Handbook: full text in Jupyter Notebooks
A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
A Python library that helps you encode your unstructured data into embeddings.
Ever been frustrated with cleaning up long, messy Jupyter notebooks? With LineaPy, an open source Python library, it takes as little as two lines of code to transform messy development code into production pipelines.
️ machine learning development environment for data science and AI/ML engineering teams
Python library for data-centric AI and automatically detecting various issues in ML datasets
AutoML to easily produce accurate predictions for image, text, tabular, time-series, and multi-modal data
An MLOps platform with experiment tracking, model production management, a model registry, and full data lineage to support your ML workflow from training straight through to production.
Evaluate, test, and ship LLM applications across your dev and production lifecycles.
Synthetic tabular data generation using GANs, Diffusion Models, and LLMs with adversarial filtering and privacy metrics.
Literature and Media
Bloggers
Blog for NLP and transfer learning!
Data Science with Esoteric programming languages
Journals, Publications and Magazines
YouTube Videos & Channels
- Interviews of industry experts about production ML
Fun
Infographics
Datasets
Structured dataset tracking 92 AI-attributed workforce reduction events affecting 453,748 workers across 12 countries and 11 sectors. JSON and CSV formats. CC-BY-4.0 licensed.
The world's most comprehensive authoritative data source knowledge base. 210+ curated sources from governments, international organizations, and research institutions. MCP integration for AI agents. MIT licensed.
Other Awesome Lists
Table
Socialize
GitHub Groups
Data Science Central is the industry's single resource for Big Data practitioners.
Data Scientist , Author , Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC
Twitter Accounts
Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University.
Data Scientist , Author , Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC