Data Science

Collection 29.0k stars GitHub

What is Data Science?

Data Science For Beginners 35.2k updated 3mo ago

Microsoft are pleased to offer a 10-week, 20-lesson curriculum all about Data Science.

Agents

Frameworks

Production-ready AI agent development kit for Rust with model-agnostic design (Gemini, OpenAI, Anthropic), multiple agent types (LLM, Graph, Workflow), MCP support, and built-in telemetry.

Tools

Frostbyte MCP updated 4mo ago

MCP server providing 13 data tools for AI agents: real-time crypto prices, IP geolocation, DNS lookups, web scraping to markdown, code execution, and screenshots. One API key for 40+ services.

Arch Tools

61 production-ready AI API tools for data science workflows: code analysis, web scraping, NLP, image generation, crypto data, and search. REST API and MCP protocol support.

Not Human Search updated 2mo ago

Search engine for AI agents that indexes 9,000+ AI tools and APIs, scoring each on agentic readiness (llms.txt, OpenAPI, MCP, ai-plugin.json). REST API and MCP server for programmatic tool discovery.

DeepAlpha 7 updated 2mo ago

AI crypto trading framework using LightGBM + XGBoost ensemble with 72 ML features. 70.9% walk-forward validated accuracy on out-of-sample data. Supports Bybit and Binance. MIT licensed, available on PyPI.

Research & Knowledge Retrieval

BGPT MCP 17 updated 3mo ago

MCP server that gives AI agents access to a database of scientific papers built from raw experimental data extracted from full-text studies. Returns 25+ structured fields per paper including methods, results, sample sizes, and quality scores.

Training Resources

Tutorials

#tidytuesday 8.1k updated 2mo ago

A weekly data project aimed at the R ecosystem.

Data science your way 617 updated 5y ago

PySpark Cheatsheet 666 updated 3y ago

Tutorials of source code from the book Genetic Algorithms with Python by Clinton Sheppard 1.3k updated 3y ago

Tutorials to get started on signal processing for machine learning 84 updated 3y ago

Minimum Viable Study Plan for Machine Learning Interviews 12.5k updated 2y ago

Free Courses

AI Expert Roadmap 30.9k updated 10mo ago

Roadmap to becoming an Artificial Intelligence Expert

MLSys-NYU-2022 548 updated 3y ago

Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2022.

Hands-on Train and Deploy ML 882 updated 2y ago

A hands-on course to train and deploy a serverless API that predicts crypto prices.

MOOC's

Data Science Specialization 4.1k updated 5y ago

Colleges

A list of colleges and universities offering degrees in data science. 159 updated 5y ago

The Data Science Toolbox

Comparison

datacompy 639 updated 4mo ago

DataComPy is a package to compare two Pandas DataFrames.

General Machine Learning Packages

scikit-multilearn 955 updated 2y ago

sklearn-expertsys 489 updated 9y ago

scikit-feature 1.6k updated 2y ago

scikit-rebate 421 updated 3y ago

seqlearn 704 updated 3y ago

sklearn-bayes 523 updated 4y ago

sklearn-crfsuite 434 updated 5mo ago

sklearn-deap 772 updated 2y ago

sigopt_sklearn 75 (archived)

sklearn-evaluation 3 updated 3y ago

scikit-image 6.5k updated 4mo ago

scikit-opt 6.5k updated 4mo ago

scikit-posthocs 382 updated 5mo ago

pystruct 670 updated 4y ago

xLearn 3.1k updated 2y ago

cuML 5.2k updated 4mo ago

causalml 5.8k updated 4mo ago

mlpack 5.6k updated 4mo ago

MLxtend 5.1k updated 6mo ago

modAL 2.3k updated 2y ago

Sparkit-learn 1.1k updated 5y ago

hyperlearn 2.4k updated 1y ago

dlib 14.4k updated 4mo ago

imodels 1.6k updated 2mo ago

jSciPy 20 updated 5mo ago

A Java port of SciPy's signal processing module, offering filters, transformations, and other scientific computing utilities.

RuleFit 442 updated 2y ago

pyGAM 989 updated 6mo ago

Deepchecks 4.0k updated 7mo ago

XGBoost 28.2k updated 4mo ago

LightGBM 18.3k updated 2mo ago

CatBoost 8.9k updated 2mo ago

PerpetualBooster 666 updated 4mo ago

JAX 35.5k updated 2mo ago

Deep Learning Packages

PyTorch 98.5k updated 4mo ago

torchvision 17.6k updated 4mo ago

torchtext 3.6k (archived)

torchaudio 2.8k updated 4mo ago

ignite 4.7k updated 4mo ago

PyTorchNet 1.7k updated 4mo ago

PyToune 579 updated 1y ago

skorch 6.2k updated 3mo ago

PyVarInf 362 updated 6y ago

pytorch_geometric 23.7k updated 2mo ago

GPyTorch 3.9k updated 4mo ago

pyro 9.0k updated 1y ago

Catalyst 3.4k updated 1y ago

pytorch_tabular 1.6k updated 4mo ago

Yolov3 10.6k updated 2mo ago

Yolov5 57.3k updated 2mo ago

Yolov8 56.5k updated 2mo ago

TensorFlow 194.9k updated 2mo ago

TensorLayer 7.4k updated 3y ago

TFLearn 9.6k updated 2y ago

Sonnet 9.9k updated 5mo ago

tensorpack 6.3k updated 3y ago

TRFL 3.1k updated 3y ago

NeuPy 734 updated 1y ago

tfdeploy 355 updated 1y ago

tensorflow-upstream 702 updated 2mo ago

TensorFlow Fold 1.8k (archived)

tensorlm 60 updated 4y ago

TensorLight 11 updated 3y ago

Mesh TensorFlow 1.6k (archived)

Ludwig 11.7k updated 2mo ago

TF-Agents 3.0k updated 6mo ago

TensorForce 3.3k updated 2y ago

keras-contrib 1.6k (archived)

Hyperas 2.2k updated 3y ago

Elephas 1.6k updated 3y ago

Hera 490 updated 9y ago

Spektral 2.4k updated 2y ago

qkeras 577 updated 5mo ago

keras-rl 5.6k updated 2y ago

Talos 1.6k updated 2y ago

Netron 32.6k updated 4mo ago

Resseract Lite 7 updated 1y ago

vizzu 2.0k updated 4mo ago

TensorWatch 3.5k updated 3mo ago

MetaReview

Free online meta-analysis platform with 11 interactive D3.js statistical charts (forest plot, funnel plot, Galbraith, L'Abbé, Baujat, etc.), 5 effect size measures, AI literature screening, and publication-ready report export.

Miscellaneous Tools

Polyaxon 3.7k updated 3mo ago

A platform for reproducible and scalable machine learning and deep learning.

The Data Science Lifecycle Process 527 updated 5y ago

The Data Science Lifecycle Process is a process for taking data science teams from Idea to Value repeatedly and sustainably. The process is documented in this repo

Data Science Lifecycle Template Repo 200 updated 6y ago

Template repository for data science lifecycle project

RexMex 276 updated 2y ago

A general purpose recommender metrics library for fair evaluation.

ChemicalX 776 updated 2y ago

A PyTorch based deep learning library for drug pair scoring.

PyTorch Geometric Temporal 3.0k updated 10mo ago

Representation learning on dynamic graphs.

Little Ball of Fur 713 updated 7mo ago

A graph sampling library for NetworkX with a Scikit-Learn like API.

Karate Club 2.3k updated 2y ago

An unsupervised machine learning extension library for NetworkX with a Scikit-Learn like API.

ML Workspace 3.5k updated 2y ago

All-in-one web-based IDE for machine learning and data science. The workspace is deployed as a Docker container and is preloaded with a variety of popular data science libraries (e.g., Tensorflow, PyTorch) and dev tools (e.g., Jupyter, VS Code)

xonsh shell 9.3k updated 2mo ago

A Python-powered shell that enables integration, management and orchestration of data science libraries mostly written in Python, allowing you to build pipelines, code and command-based workflows. It can also be used as a kernel for Jupyter Notebook.

steppy 136 (archived)

Lightweight, Python library for fast and reproducible machine learning experimentation. Introduces very simple interface that enables clean machine learning pipeline design.

steppy-toolkit 22 (archived)

Curated collection of the neural networks, transformers and models that make your machine learning work faster and more effective.

Pandas GUI 3.3k updated 1y ago

Pandas GUI

Polars 37.8k updated 4mo ago

Fast DataFrame library for Rust and Python, designed as a faster alternative to Pandas

Hydrosphere Mist 325 updated 5y ago

a service for exposing Apache Spark analytics jobs and machine learning models as realtime, batch or reactive web services.

Nervana's python based Deep Learning Framework 3.9k (archived)

Intel Nervana reference deep learning framework committed to best performance on all hardware.

Skale 397 (archived)

High performance distributed data processing in NodeJS

Aerosolve

A machine learning package built for humans.

Intel framework 313 (archived)

Intel Deep Learning Framework

Datawrapper 1.4k updated 1y ago

An open source data visualization platform helping everyone to create simple, correct and embeddable charts. Also at github.com

Featuretools 7.6k updated 5mo ago

An open source framework for automated feature engineering written in python

Optimus 1.5k updated 1y ago

Cleansing, pre-processing, feature engineering, exploratory data analysis and easy ML with PySpark backend.

Albumentations 15.3k (archived)

А fast and framework agnostic image augmentation library that implements a diverse set of augmentation techniques. Supports classification, segmentation, and detection out of the box. Was used to win a number of Deep Learning competitions at Kaggle, Topcoder and those that were a part of the CVPR workshops.

DVC 15.5k updated 4mo ago

Open-source version control system for machine learning projects

Lambdo 25 updated 5y ago

is a workflow engine that significantly simplifies data analysis by combining in one analysis pipeline (i) feature engineering and machine learning (ii) model training and prediction (iii) table population and column evaluation.

Feast 7.0k updated 2mo ago

A feature store for the management, discovery, and access of machine learning features. Feast provides a consistent view of feature data for both model training and model serving.

Trains 6.6k updated 4mo ago

Auto-Magical Experiment Manager, Version Control & DevOps for AI

Hopsworks 1.3k updated 1y ago

Open-source data-intensive machine learning platform with a feature store. Ingest and manage features for both online (MySQL Cluster) and offline (Apache Hive) access, train and serve models at scale.

MindsDB 39.1k updated 2mo ago

MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.

Lightwood 502 updated 5mo ago

A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.

AWS Data Wrangler 4.1k updated 2mo ago

An open-source Python package that extends the power of Pandas library to AWS connecting DataFrames and AWS data related services (Amazon Redshift, AWS Glue, Amazon Athena, Amazon EMR, etc).

CML 4.2k updated 1y ago

An open source toolkit for using continuous integration in data science projects. Automatically train and test models in production-like environments with GitHub Actions & GitLab CI, and autogenerate visual reports on pull/merge requests.

DuckDB 37.8k updated 2mo ago

An in-process SQL OLAP database management system

IJulia 2.9k updated 4mo ago

a Julia-language backend combined with the Jupyter interactive environment

Apache Airflow 45.2k updated 2mo ago

Platform to programmatically author, schedule, and monitor workflows

Prefect 21.9k updated 4mo ago

Workflow management system for modern data stacks

Kedro 10.9k updated 2mo ago

Open-source Python framework for creating reproducible, maintainable data science code

Hamilton 2.4k updated 4mo ago

Lightweight library to author and manage reliable data transformations

SHAP 25.4k updated 3mo ago

Game theoretic approach to explain the output of any machine learning model

InterpretML 6.8k updated 2mo ago

InterpretML implements the Explainable Boosting Machine (EBM), a modern, fully interpretable machine learning model based on Generalized Additive Models (GAMs). This open-source package also provides visualization tools for EBMs, other glass-box models, and black-box explanations

LIME 12.1k updated 2y ago

Explaining the predictions of any machine learning classifier

flyte 7.0k updated 2mo ago

Workflow automation platform for machine learning

dbt 12.5k updated 4mo ago

Data build tool

zasper 2.3k updated 4mo ago

Supercharged IDE for Data Science

skrub 1.6k updated 4mo ago

A Python library to ease preprocessing and feature engineering for tabular machine learning

Chinese-Elite 68 updated 4mo ago

An open-source project that automatically maps relationship networks by parsing public data using LLMs and visualizes it as an interactive graph.

dna-claude-analysis 25 updated 4mo ago

Personal genome analysis toolkit with Python scripts analyzing raw DNA data across 17 categories (health risks, ancestry, pharmacogenomics, nutrition, psychology, and more) and generating a terminal-style single-page HTML visualization.

RunMat 208 updated 2mo ago

Fast MATLAB-syntax runtime with automatic CPU/GPU execution and fused array kernels.

Turbostream 16 updated 5mo ago

A terminal UI for experimenting with custom rule engines and selective LLM analysis on real-time data streams, without worrying about streaming infra or backpressure.

WFGY ProblemMap 1.7k updated 2mo ago

Open source “failure atlas” of 16 recurring issues in LLM and RAG pipelines, with observable symptoms and suggested fixes for data science teams.

DeepAnalyze 3.9k updated 4mo ago

An agentic LLM for autonomous data science, which can autonomously complete a wide range of data science tasks without human intervention.

Python Data Science Handbook 47.1k updated 2y ago

Python Data Science Handbook: full text in Jupyter Notebooks

Shapley 224 updated 6mo ago

A data-driven framework to quantify the value of classifiers in a machine learning ensemble.

Towhee 3.5k updated 1y ago

A Python library that helps you encode your unstructured data into embeddings.

LineaPy 669 updated 1y ago

Ever been frustrated with cleaning up long, messy Jupyter notebooks? With LineaPy, an open source Python library, it takes as little as two lines of code to transform messy development code into production pipelines.

envd 2.2k updated 4mo ago

️ machine learning development environment for data science and AI/ML engineering teams

MLEM 718 (archived)

Version and deploy your ML models following GitOps principles

cleanlab 11.4k updated 6mo ago

Python library for data-centric AI and automatically detecting various issues in ML datasets

AutoGluon 10.1k updated 4mo ago

AutoML to easily produce accurate predictions for image, text, tabular, time-series, and multi-modal data

Comet 171 updated 4mo ago

An MLOps platform with experiment tracking, model production management, a model registry, and full data lineage to support your ML workflow from training straight through to production.

Opik 19.1k updated 2mo ago

Evaluate, test, and ship LLM applications across your dev and production lifecycles.

teeplot 12 updated 3mo ago

Workflow tool to automatically organize data visualization output

Streamlit 44.4k updated 2mo ago

App framework for Machine Learning and Data Science projects

Gradio 42.1k updated 4mo ago

Create customizable UI components around machine learning models

Weights & Biases 11.0k updated 2mo ago

Experiment tracking, dataset versioning, and model management

Optuna 14.1k updated 2mo ago

Automatic hyperparameter optimization software framework

Ray Tune 41.8k updated 4mo ago

Scalable hyperparameter tuning library

TabGAN 568 updated 3mo ago

Synthetic tabular data generation using GANs, Diffusion Models, and LLMs with adversarial filtering and privacy metrics.

FileShot.io 26 updated 4mo ago

Secure zero-knowledge encrypted file sharing (AES-256-GCM in-browser). No account required, MIT licensed, self-hostable, optional link expiry.

Disco 6 updated 3mo ago

Superhuman exploratory data analysis. Finds the feature interactions and subgroup effects in tabular data that LLMs and manual exploration miss — with p-values, effect sizes, and literature citations. Free for public data.

Literature and Media

Interpretable Machine Learning: A Guide for Making Black Box Models Explainable 5.2k updated 6mo ago

Free GitHub version

i am trask

A Machine Learning Craftsmanship Blog

Chris Albon's Website

Data Science and AI notes

floydhub

Blog for Evolutionary Algorithms

floydhub

Blog for Evolutionary Algorithms

Bloggers

NYC Taxi Visualization Blog 456 updated 2y ago

Colah's Blog

Blog for understanding Neural Networks!

Distill

Dedicated to clear explanations of machine learning!

Jingles 14 updated 3y ago

Review and extract key concepts from academic papers

Loic Tetrel updated 6mo ago

Data science blog

Mlu github 73 updated 9mo ago

Mlu is developed amazon to help people in ml space you can learn everything from basics here with live diagrams

i am trask 150 updated 4y ago

A Machine Learning Craftsmanship Blog

Andrew Carr

Data Science with Esoteric programming languages

datascopeanalytics

Digital transformation

i am trask

A Machine Learning Craftsmanship Blog

Colah's Blog

Blog for understanding Neural Networks!

Sebastian's Blog

Blog for NLP and transfer learning!

Chris Albon's Website

Data Science and AI notes

floydhub

Blog for Evolutionary Algorithms

Jingles

Review and extract key concepts from academic papers

Loic Tetrel

Data science blog

Mlu github

Mlu is developed amazon to help people in ml space you can learn everything from basics here with live diagrams

Blog for NLP and transfer learning!

Sebastian's Blog

Blog for NLP and transfer learning!

Data Science with Esoteric programming languages

Andrew Carr

Data Science with Esoteric programming languages

Presentations

How to Share Data with a Statistician 6.7k updated 1y ago

Books

JavaScript for Data Science

Free html page

Journals, Publications and Magazines

datatau.com/news

Like Hacker News, but for data

YouTube Videos & Channels

Mildlyoverfitted - Tutorials on intermediate ML/DL topics

- Tutorials on intermediate ML/DL topics

mlops.community - Interviews of industry experts about production ML

- Interviews of industry experts about production ML

ML Street Talk - Unabashedly technical and non-commercial, so you will hear no annoying pitches.

- Unabashedly technical and non-commercial, so you will hear no annoying pitches.

Neural networks by 3Blue1Brown

Neural networks from scratch by Sentdex

Manning Publications YouTube channel

mlops.community - Interviews of industry experts about production ML

Interviews of industry experts about production ML

Fun

Infographics

Choosing the Right Estimator 65.9k updated 2mo ago

https://scikit-learn.org/1.5/machinelearningmap.html#choosing-the-right-estimator

Datasets

AI Displacement Tracker

Structured dataset tracking 92 AI-attributed workforce reduction events affecting 453,748 workers across 12 countries and 11 sectors. JSON and CSV formats. CC-BY-4.0 licensed.

Open Data Sources 515 updated 8y ago

Covid-19 1.2k updated 4mo ago

Covid-19 Google 119 updated 5y ago

5000 Images of Clothes 115 updated 5y ago

FirstData 144 updated 4mo ago

The world's most comprehensive authoritative data source knowledge base. 210+ curated sources from governments, international organizations, and research institutions. MCP integration for AI agents. MIT licensed.

latamdata-py updated 3mo ago

Python package for one-line access to 38 open research datasets from Latin America (health, neuroscience, mental health, economics). pip install latamdata-py.

ZipCheckup 1 updated 3mo ago

Free ZIP-level environmental safety data for 42,000+ US ZIP codes: water quality, air quality, PFAS contamination, radon, lead, flood risk, and 11 more verticals. Public REST API, npm/PyPI packages, CC BY 4.0.

Fun

Public Git Archive 343 updated 6y ago

NAYN.CO Turkish News with categories 3 updated 6y ago

Other Awesome Lists

Other amazingly awesome lists can be found in the [awesome-awesomeness](https://github.com/bayandin/awesome-awesomeness) 33.3k updated 2y ago

Awesome Machine Learning 72.3k updated 3mo ago

lists 11.1k updated 4mo ago

awesome-dataviz 4.3k updated 2y ago

awesome-python 295.4k updated 2mo ago

Data Science IPython Notebooks. 28.9k updated 2y ago

awesome-r 6.4k updated 10mo ago

awesome-datasets 74.6k updated 2mo ago

awesome-Machine Learning & Deep Learning Tutorials 17.7k updated 2y ago

Awesome Data Science Ideas 692 (archived)

Machine Learning for Software Engineers 28.7k updated 2y ago

Awesome Machine Learning On Source Code 6.5k updated 5y ago

Awesome Community Detection 2.4k updated 7mo ago

Awesome Graph Classification 4.8k updated 3y ago

Awesome Decision Tree Papers 2.5k updated 6mo ago

Awesome Fraud Detection Papers 1.8k updated 6mo ago

Awesome Gradient Boosting Papers 1.0k updated 6mo ago

Awesome Computer Vision Models 541 updated 5y ago

Awesome Monte Carlo Tree Search 696 updated 6mo ago

100 NLP Papers 3.9k updated 5y ago

Awesome Game Datasets 1.0k updated 4mo ago

Data Science Interviews Questions 9.8k updated 5mo ago

Awesome Explainable Graph Reasoning 2.0k updated 4y ago

Awesome Drug Synergy, Interaction and Polypharmacy Prediction 98 updated 4y ago

Data Science Projects 2.6k updated 2y ago

Awesome Data Analysis 866 updated 4mo ago

A curated list of data analysis tools, libraries and resources.

Awesome Evidence Synthesis 5 updated 3mo ago

A curated list of open-source tools for systematic reviews, meta-analysis, and evidence synthesis.

Hobby

Awesome Music Production 1.4k updated 3mo ago

Table

Chaos Genius 775 (archived)

ML powered analytics engine for outlier/anomaly detection and root cause analysis

Grid Studio 8.8k (archived)

Grid studio is a web-based spreadsheet application with full integration of the Python programming language.

Desbordante 476 updated 2mo ago

An open-source data profiler specifically focused on discovery and validation of complex patterns, such as numerical association rules, differential dependencies, denial constraints, and more.