Data Science
What is Data Science?
Agents
Tools
MCP server providing 13 data tools for AI agents: real-time crypto prices, IP geolocation, DNS lookups, web scraping to markdown, code execution, and screenshots. One API key for 40+ services.
61 production-ready AI API tools for data science workflows: code analysis, web scraping, NLP, image generation, crypto data, and search. REST API and MCP protocol support. GitHub
Research & Knowledge Retrieval
Training Resources
Tutorials
A weekly data project aimed at the R ecosystem.
Free Courses
Roadmap to becoming an Artificial Intelligence Expert
Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2022.
A hands-on course to train and deploy a serverless API that predicts crypto prices.
The Data Science Toolbox
General Machine Learning Packages
A Java port of SciPy's signal processing module, offering filters, transformations, and other scientific computing utilities.
Deep Learning Packages
Free online meta-analysis platform with 11 interactive D3.js statistical charts (forest plot, funnel plot, Galbraith, L'Abbé, Baujat, etc.), 5 effect size measures, AI literature screening, and publication-ready report export. github.com
Miscellaneous Tools
A platform for reproducible and scalable machine learning and deep learning.
The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo
Template repository for data science lifecycle project
A general purpose recommender metrics library for fair evaluation.
A PyTorch based deep learning library for drug pair scoring.
Representation learning on dynamic graphs.
A graph sampling library for NetworkX with a Scikit-Learn like API.
An unsupervised machine learning extension library for NetworkX with a Scikit-Learn like API.
All-in-one web-based IDE for machine learning and data science. The workspace is deployed as a Docker container and is preloaded with a variety of popular data science libraries (e.g., Tensorflow, PyTorch) and dev tools (e.g., Jupyter, VS Code)
A Python-powered shell that enables integration, management and orchestration of data science libraries mostly written in Python, allowing you to build pipelines, code and command-based workflows. It can also be used as a kernel for Jupyter Notebook.
Lightweight, Python library for fast and reproducible machine learning experimentation. Introduces very simple interface that enables clean machine learning pipeline design.
Curated collection of the neural networks, transformers and models that make your machine learning work faster and more effective.
Pandas GUI
Fast DataFrame library for Rust and Python, designed as a faster alternative to Pandas
a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.
Intel Nervana reference deep learning framework committed to best performance on all hardware.
High performance distributed data processing in NodeJS
A machine learning package built for humans.
Intel Deep Learning Framework
An open source data visualization platform helping everyone to create simple, correct and embeddable charts. Also at github.com
An open source framework for automated feature engineering written in python
Cleansing, pre-processing, feature engineering, exploratory data analysis and easy ML with PySpark backend.
А fast and framework agnostic image augmentation library that implements a diverse set of augmentation techniques. Supports classification, segmentation, and detection out of the box. Was used to win a number of Deep Learning competitions at Kaggle, Topcoder and those that were a part of the CVPR workshops.
Open-source version control system for machine learning projects
is a workflow engine that significantly simplifies data analysis by combining in one analysis pipeline (i) feature engineering and machine learning (ii) model training and prediction (iii) table population and column evaluation.
A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.
Auto-Magical Experiment Manager, Version Control & DevOps for AI
Open-source data-intensive machine learning platform with a feature store. Ingest and manage features for both online (MySQL Cluster) and offline (Apache Hive) access, train and serve models at scale.
MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.
An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, etc).
An open source toolkit for using continuous integration in data science projects. Automatically train and test models in production-like environments with GitHub Actions & GitLab CI, and autogenerate visual reports on pull/merge requests.
An in-process SQL OLAP database management system
a Julia-language backend combined with the Jupyter interactive environment
Platform to programmatically author, schedule, and monitor workflows
Workflow management system for modern data stacks
Open-source Python framework for creating reproducible, maintainable data science code
Lightweight library to author and manage reliable data transformations
Game theoretic approach to explain the output of any machine learning model
InterpretML implements the Explainable Boosting Machine (EBM), a modern, fully interpretable machine learning model based on Generalized Additive Models (GAMs). This open-source package also provides visualization tools for EBMs, other glass-box models, and black-box explanations
Explaining the predictions of any machine learning classifier
Workflow automation platform for machine learning
Data build tool
Supercharged IDE for Data Science
A Python library to ease preprocessing and feature engineering for tabular machine learning
An open-source project that automatically maps relationship networks by parsing public data using LLMs and visualizes it as an interactive graph.
Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, and more) and generating a terminal-style single-page HTML visualization.
Fast MATLAB-syntax runtime with automatic CPU/GPU execution and fused array kernels.
A terminal UI for experimenting with custom rule engines and selective LLM analysis on real-time data streams, without worrying about streaming infra or backpressure.
Open source “failure atlas” of 16 recurring issues in LLM and RAG pipelines, with observable symptoms and suggested fixes for data science teams.
An agentic LLM for autonomous data science, which can autonomously complete a wide range of data science tasks without human intervention.
Python Data Science Handbook: full text in Jupyter Notebooks
A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
A Python library that helps you encode your unstructured data into embeddings.
Ever been frustrated with cleaning up long, messy Jupyter notebooks? With LineaPy, an open source Python library, it takes as little as two lines of code to transform messy development code into production pipelines.
️ machine learning development environment for data science and AI/ML engineering teams
Version and deploy your ML models following GitOps principles
Python library for data-centric AI and automatically detecting various issues in ML datasets
AutoML to easily produce accurate predictions for image, text, tabular, time-series, and multi-modal data
An MLOps platform with experiment tracking, model production management, a model registry, and full data lineage to support your ML workflow from training straight through to production.
Evaluate, test, and ship LLM applications across your dev and production lifecycles.
Workflow tool to automatically organize data visualization output
App framework for Machine Learning and Data Science projects
Create customizable UI components around machine learning models
Experiment tracking, dataset versioning, and model management
Automatic hyperparameter optimization software framework
Scalable hyperparameter tuning library
ML powered analytics engine for outlier/anomaly detection and root cause analysis
Literature and Media
Books
https://chriswhong.github.io/nyctaxi/
A Machine Learning Craftsmanship Blog
Blog for understanding Neural Networks!
Dedicated to clear explanations of machine learning!
Data Science and AI notes
Blog for Evolutionary Algorithms
Review and extract key concepts from academic papers
Data science blog
Mlu is developed amazon to help people in ml space you can learn everything from basics here with live diagrams
A Machine Learning Craftsmanship Blog
Blog for Evolutionary Algorithms
Blog for NLP and transfer learning!
Data Science with Esoteric programming languages
Fun
Datasets
Structured dataset tracking 92 AI-attributed workforce reduction events affecting 453,748 workers across 12 countries and 11 sectors. JSON and CSV formats. CC-BY-4.0 licensed.
The world's most comprehensive authoritative data source knowledge base. 210+ curated sources from governments, international organizations, and research institutions. MCP integration for AI agents. MIT licensed.
Other Awesome Lists
A curated list of data analysis tools, libraries and resources.
Socialize
GitHub Groups
Data Science Central is the industry's single resource for Big Data practitioners.
Data science instructor, and founder of Data School
Data Scientist , Author , Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC
Data Scientist , Author , Entrepreneur. Co-founder @DataCommunityDC. Founder @DistrictDataLab. #DataScience #BigData #DataDC
Pandas (Python Data Analysis library).