Data Science

Collection 28.7k stars GitHub

What is Data Science?

Microsoft are pleased to offer a 10-week, 20-lesson curriculum all about Data Science.

Agents

Frameworks

Production-ready AI agent development kit for Rust with model-agnostic design (Gemini, OpenAI, Anthropic), multiple agent types (LLM, Graph, Workflow), MCP support, and built-in telemetry.

Tools

Frostbyte MCP updated 27d ago

MCP server providing 13 data tools for AI agents: real-time crypto prices, IP geolocation, DNS lookups, web scraping to markdown, code execution, and screenshots. One API key for 40+ services.

Arch Tools

61 production-ready AI API tools for data science workflows: code analysis, web scraping, NLP, image generation, crypto data, and search. REST API and MCP protocol support. GitHub

Research & Knowledge Retrieval

BGPT MCP 10 updated 9d ago

MCP server that gives AI agents access to a database of scientific papers built from raw experimental data extracted from full-text studies. Returns 25+ structured fields per paper including methods, results, sample sizes, and quality scores. GitHub

Training Resources

Tutorials

#tidytuesday 8.0k updated 10d ago

A weekly data project aimed at the R ecosystem.

Data science your way 617 updated 5y ago

PySpark Cheatsheet 666 updated 3y ago

Tutorials of source code from the book Genetic Algorithms with Python by Clinton Sheppard 1.3k updated 3y ago

Tutorials to get started on signal processing for machine learning 84 updated 3y ago

Minimum Viable Study Plan for Machine Learning Interviews

Free Courses

AI Expert Roadmap 30.9k updated 6mo ago

Roadmap to becoming an Artificial Intelligence Expert

MLSys-NYU-2022 548 updated 3y ago

Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2022.

Hands-on Train and Deploy ML

A hands-on course to train and deploy a serverless API that predicts crypto prices.

MOOC's

Data Science Specialization 4.1k updated 5y ago

Colleges

A list of colleges and universities offering degrees in data science. 159 updated 5y ago

The Data Science Toolbox

Comparison

datacompy 639 updated 19d ago

DataComPy is a package to compare two Pandas DataFrames.

General Machine Learning Packages

scikit-multilearn 955 updated 2y ago

sklearn-expertsys 489 updated 8y ago

scikit-feature

scikit-rebate 421 updated 3y ago

seqlearn 704 updated 3y ago

sklearn-bayes 523 updated 4y ago

sklearn-crfsuite 434 updated 1mo ago

sklearn-deap 772 updated 2y ago

sigopt_sklearn 75 (archived)

sklearn-evaluation 3 updated 3y ago

scikit-image 6.5k updated 9d ago

scikit-opt

scikit-posthocs 382 updated 1mo ago

pystruct 670 updated 4y ago

xLearn 3.1k updated 2y ago

cuML 5.2k updated 8d ago

causalml 5.8k updated 12d ago

mlpack 5.6k updated 9d ago

MLxtend 5.1k updated 2mo ago

modAL 2.3k updated 2y ago

Sparkit-learn 1.1k updated 5y ago

hyperlearn 2.4k updated 1y ago

dlib 14.4k updated 15d ago

imodels 1.6k updated 1mo ago

jSciPy

A Java port of SciPy's signal processing module, offering filters, transformations, and other scientific computing utilities.

RuleFit 442 updated 2y ago

pyGAM 989 updated 2mo ago

Deepchecks 4.0k updated 3mo ago

XGBoost 28.2k updated 9d ago

LightGBM 18.2k updated 8d ago

CatBoost

PerpetualBooster 666 updated 27d ago

JAX 35.2k updated 8d ago

Deep Learning Packages

PyTorch 98.5k updated 9d ago

torchvision 17.6k updated 9d ago

torchtext 3.6k (archived)

torchaudio 2.8k updated 9d ago

ignite 4.7k updated 8d ago

PyTorchNet 1.7k updated 9d ago

PyToune 579 updated 11mo ago

skorch 6.2k updated 1mo ago

PyVarInf 362 updated 6y ago

pytorch_geometric 23.6k updated 10d ago

GPyTorch 3.9k updated 21d ago

pyro 9.0k updated 8mo ago

Catalyst 3.4k updated 9mo ago

pytorch_tabular 1.6k updated 9d ago

Yolov3 10.6k updated 15d ago

Yolov5 57.1k updated 15d ago

Yolov8 54.9k updated 8d ago

TensorFlow 194.4k updated 9d ago

TensorLayer 7.4k updated 3y ago

TFLearn 9.6k updated 1y ago

Sonnet 9.9k updated 1mo ago

tensorpack 6.3k updated 2y ago

TRFL 3.1k updated 3y ago

NeuPy 734 updated 1y ago

tfdeploy 355 updated 1y ago

tensorflow-upstream

TensorFlow Fold 1.8k (archived)

tensorlm 60 updated 3y ago

TensorLight 11 updated 3y ago

Mesh TensorFlow 1.6k (archived)

Ludwig 11.7k updated 16d ago

TF-Agents 3.0k updated 2mo ago

TensorForce 3.3k updated 1y ago

keras-contrib 1.6k (archived)

Hyperas 2.2k updated 3y ago

Elephas 1.6k updated 2y ago

Hera 490 updated 8y ago

Spektral 2.4k updated 2y ago

qkeras 577 updated 1mo ago

keras-rl 5.6k updated 2y ago

Talos 1.6k updated 1y ago

Netron 32.6k updated 9d ago

Resseract Lite 7 updated 1y ago

vizzu 2.0k updated 15d ago

TensorWatch 3.5k updated 15d ago

MetaReview

Free online meta-analysis platform with 11 interactive D3.js statistical charts (forest plot, funnel plot, Galbraith, L'Abbé, Baujat, etc.), 5 effect size measures, AI literature screening, and publication-ready report export. github.com

Miscellaneous Tools

Polyaxon 3.7k updated 24d ago

A platform for reproducible and scalable machine learning and deep learning.

The Data Science Lifecycle Process 527 updated 4y ago

The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo

Data Science Lifecycle Template Repo 200 updated 5y ago

Template repository for data science lifecycle project

RexMex 276 updated 2y ago

A general purpose recommender metrics library for fair evaluation.

ChemicalX

A PyTorch based deep learning library for drug pair scoring.

PyTorch Geometric Temporal 3.0k updated 6mo ago

Representation learning on dynamic graphs.

Little Ball of Fur 713 updated 3mo ago

A graph sampling library for NetworkX with a Scikit-Learn like API.

Karate Club 2.3k updated 1y ago

An unsupervised machine learning extension library for NetworkX with a Scikit-Learn like API.

ML Workspace 3.5k updated 1y ago

All-in-one web-based IDE for machine learning and data science. The workspace is deployed as a Docker container and is preloaded with a variety of popular data science libraries (e.g., Tensorflow, PyTorch) and dev tools (e.g., Jupyter, VS Code)

xonsh shell 9.3k updated 9d ago

A Python-powered shell that enables integration, management and orchestration of data science libraries mostly written in Python, allowing you to build pipelines, code and command-based workflows. It can also be used as a kernel for Jupyter Notebook.

steppy 136 (archived)

Lightweight, Python library for fast and reproducible machine learning experimentation. Introduces very simple interface that enables clean machine learning pipeline design.

steppy-toolkit 22 (archived)

Curated collection of the neural networks, transformers and models that make your machine learning work faster and more effective.

Pandas GUI 3.3k updated 10mo ago

Pandas GUI

Polars 37.8k updated 10d ago

Fast DataFrame library for Rust and Python, designed as a faster alternative to Pandas

Hydrosphere Mist 325 updated 5y ago

a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.

Nervana's python based Deep Learning Framework 3.9k (archived)

Intel Nervana reference deep learning framework committed to best performance on all hardware.

Skale 397 (archived)

High performance distributed data processing in NodeJS

Aerosolve

A machine learning package built for humans.

Intel framework 313 (archived)

Intel Deep Learning Framework

Datawrapper 1.4k updated 1y ago

An open source data visualization platform helping everyone to create simple, correct and embeddable charts. Also at github.com

Featuretools 7.6k updated 1mo ago

An open source framework for automated feature engineering written in python

Optimus 1.5k updated 1y ago

Cleansing, pre-processing, feature engineering, exploratory data analysis and easy ML with PySpark backend.

Albumentations 15.3k (archived)

А fast and framework agnostic image augmentation library that implements a diverse set of augmentation techniques. Supports classification, segmentation, and detection out of the box. Was used to win a number of Deep Learning competitions at Kaggle, Topcoder and those that were a part of the CVPR workshops.

DVC 15.5k updated 9d ago

Open-source version control system for machine learning projects

Lambdo 25 updated 5y ago

is a workflow engine that significantly simplifies data analysis by combining in one analysis pipeline (i) feature engineering and machine learning (ii) model training and prediction (iii) table population and column evaluation.

Feast 6.8k updated 10d ago

A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.

Trains 6.6k updated 10d ago

Auto-Magical Experiment Manager, Version Control & DevOps for AI

Hopsworks 1.3k updated 1y ago

Open-source data-intensive machine learning platform with a feature store. Ingest and manage features for both online (MySQL Cluster) and offline (Apache Hive) access, train and serve models at scale.

MindsDB 38.8k updated 10d ago

MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.

Lightwood 502 updated 1mo ago

A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.

AWS Data Wrangler 4.1k updated 10d ago

An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, etc).

CML 4.2k updated 10mo ago

An open source toolkit for using continuous integration in data science projects. Automatically train and test models in production-like environments with GitHub Actions & GitLab CI, and autogenerate visual reports on pull/merge requests.

DuckDB 36.9k updated 9d ago

An in-process SQL OLAP database management system

IJulia 2.9k updated 23d ago

a Julia-language backend combined with the Jupyter interactive environment

Apache Airflow 44.8k updated 9d ago

Platform to programmatically author, schedule, and monitor workflows

Prefect 21.9k updated 8d ago

Workflow management system for modern data stacks

Kedro 10.8k updated 9d ago

Open-source Python framework for creating reproducible, maintainable data science code

Hamilton 2.4k updated 10d ago

Lightweight library to author and manage reliable data transformations

SHAP 25.2k updated 21d ago

Game theoretic approach to explain the output of any machine learning model

InterpretML 6.8k updated 10d ago

InterpretML implements the Explainable Boosting Machine (EBM), a modern, fully interpretable machine learning model based on Generalized Additive Models (GAMs). This open-source package also provides visualization tools for EBMs, other glass-box models, and black-box explanations

LIME 12.1k updated 1y ago

Explaining the predictions of any machine learning classifier

flyte 6.9k updated 9d ago

Workflow automation platform for machine learning

dbt 12.5k updated 8d ago

Data build tool

zasper 2.3k updated 23d ago

Supercharged IDE for Data Science

skrub 1.6k updated 9d ago

A Python library to ease preprocessing and feature engineering for tabular machine learning

Chinese-Elite 68 updated 9d ago

An open-source project that automatically maps relationship networks by parsing public data using LLMs and visualizes it as an interactive graph.

dna-claude-analysis 25 updated 28d ago

Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, and more) and generating a terminal-style single-page HTML visualization.

RunMat

Fast MATLAB-syntax runtime with automatic CPU/GPU execution and fused array kernels.

Turbostream 16 updated 1mo ago

A terminal UI for experimenting with custom rule engines and selective LLM analysis on real-time data streams, without worrying about streaming infra or backpressure.

WFGY ProblemMap 1.7k updated 9d ago

Open source “failure atlas” of 16 recurring issues in LLM and RAG pipelines, with observable symptoms and suggested fixes for data science teams.

DeepAnalyze 3.9k updated 9d ago

An agentic LLM for autonomous data science, which can autonomously complete a wide range of data science tasks without human intervention.

Python Data Science Handbook 47.1k updated 1y ago

Python Data Science Handbook: full text in Jupyter Notebooks

Shapley 224 updated 3mo ago

A data-driven framework to quantify the value of classifiers in a machine learning ensemble.

Towhee 3.5k updated 1y ago

A Python library that helps you encode your unstructured data into embeddings.

LineaPy 669 updated 1y ago

Ever been frustrated with cleaning up long, messy Jupyter notebooks? With LineaPy, an open source Python library, it takes as little as two lines of code to transform messy development code into production pipelines.

envd 2.2k updated 14d ago

️ machine learning development environment for data science and AI/ML engineering teams

MLEM 718 (archived)

Version and deploy your ML models following GitOps principles

cleanlab 11.4k updated 2mo ago

Python library for data-centric AI and automatically detecting various issues in ML datasets

AutoGluon 10.1k updated 10d ago

AutoML to easily produce accurate predictions for image, text, tabular, time-series, and multi-modal data

Comet 171 updated 13d ago

An MLOps platform with experiment tracking, model production management, a model registry, and full data lineage to support your ML workflow from training straight through to production.

Opik 18.5k updated 8d ago

Evaluate, test, and ship LLM applications across your dev and production lifecycles.

teeplot 12 updated 1y ago

Workflow tool to automatically organize data visualization output

Streamlit 44.0k updated 8d ago

App framework for Machine Learning and Data Science projects

Gradio 42.1k updated 8d ago

Create customizable UI components around machine learning models

Weights & Biases 10.9k updated 9d ago

Experiment tracking, dataset versioning, and model management

Optuna 13.8k updated 8d ago

Automatic hyperparameter optimization software framework

Ray Tune 41.8k updated 9d ago

Scalable hyperparameter tuning library

Chaos Genius 775 (archived)

ML powered analytics engine for outlier/anomaly detection and root cause analysis

Literature and Media

Books

Interpretable Machine Learning: A Guide for Making Black Box Models Explainable 5.2k updated 2mo ago

Free GitHub version

JavaScript for Data Science

Free html page

NYC Taxi Visualization Blog 456 updated 1y ago

https://chriswhong.github.io/nyctaxi/

i am trask

A Machine Learning Craftsmanship Blog

Colah's Blog

Blog for understanding Neural Networks!

Distill

Dedicated to clear explanations of machine learning!

Chris Albon's Website

Data Science and AI notes

floydhub

Blog for Evolutionary Algorithms

Jingles 14 updated 2y ago

Review and extract key concepts from academic papers

Loic Tetrel updated 2mo ago

Data science blog

Mlu github 73 updated 5mo ago

Mlu is developed amazon to help people in ml space you can learn everything from basics here with live diagrams

i am trask 150 updated 3y ago

A Machine Learning Craftsmanship Blog

floydhub

Blog for Evolutionary Algorithms

Blog for NLP and transfer learning!

Sebastian's Blog

Blog for NLP and transfer learning!

Data Science with Esoteric programming languages

Andrew Carr

Data Science with Esoteric programming languages

Presentations

How to Share Data with a Statistician 6.7k updated 1y ago

Fun

Infographics

Choosing the Right Estimator 65.5k updated 9d ago

From https://scikit-learn.org/1.5/machinelearningmap.html#choosing-the-right-estimator

Datasets

AI Displacement Tracker

Structured dataset tracking 92 AI-attributed workforce reduction events affecting 453,748 workers across 12 countries and 11 sectors. JSON and CSV formats. CC-BY-4.0 licensed.

Open Data Sources

Public Git Archive 343 updated 6y ago

NAYN.CO Turkish News with categories 3 updated 6y ago

Covid-19 1.2k updated 26d ago

Covid-19 Google 119 updated 4y ago

5000 Images of Clothes 115 updated 5y ago

FirstData 144 updated 9d ago

The world's most comprehensive authoritative data source knowledge base. 210+ curated sources from governments, international organizations, and research institutions. MCP integration for AI agents. MIT licensed.