Hi there! You might find this guide helpful if:
For some great alternatives, jump to the end or check out Nam Vu's guide, Machine Learning for Software Engineers.
Of course, there is no easy path to expertise. Also, I'm not an expert! I just want to connect you with some great resources from experts. Applications of ML are all around us. I think it's in the public interest for more people to learn more about ML, especially hands-on, because there are many different ways to learn.
Whatever motivates you to dive into machine learning, if you know a bit of Python, these days you can get hands-on with a machine learning "Hello World!" in minutes.
You can install Python 3 and all of these packages in a few clicks with the Anaconda Python distribution. Anaconda is popular in Data Science and Machine Learning communities. (Use whichever tool you want.)
Some options you can use from your browser:
For other options, see:
Now, follow along with this brief exercise: An introduction to machine learning with scikit-learn. Do it in
ipython or a Jupyter Notebook, coding along and executing the code in a notebook.
You just classified some hand-written digits using scikit-learn. Neat huh?
Let's learn a bit more about Machine Learning, and a couple of common ideas and concerns. Read "A Visual Introduction to Machine Learning, Part 1" by Stephanie Yee and Tony Chu.
It won't take long. It's a beautiful introduction ... Try not to drool too much!
OK. Let's dive deeper.
Read "A Few Useful Things to Know about Machine Learning" by Prof. Pedro Domingos. It's densely packed with valuable information, but not opaque.
Take a little time with this one. Take notes. Don't worry if you don't understand it all yet.
The whole paper is packed with value, but I want to call out two points:
When you work on a real Machine Learning problem, you should focus your efforts on your domain knowledge and data before optimizing your choice of algorithms. Prefer to do simple things until you have to increase complexity. You should not rush into neural networks because you think they're cool. To improve your model, get more data. Then use your knowledge of the problem to explore and process the data. You should only optimize the choice of algorithms after you have gathered enough data, and you've processed it well.
First, download an interview with Prof. Domingos on the _Data Skeptic_podcast (2018). Prof. Domingos wrote the paper we read earlier. You might also start reading his book, The Master Algorithm by Prof. Pedro Domingos, a clear and accessible overview of machine learning. (It's available as an audiobook too.)
Next, subscribe to more machine learning and data science podcasts! These are great, low-effort resources that you can casually learn more from. To learn effectively, listen over time, with plenty of headspace. By the way, don't speed up technical podcasts, that can hinder your comprehension.
Subscribe to Talking Machines.
I suggest this listening order:
OK! Take a break, come back refreshed.
Next, play along from one or more of notebooks.
Find more great Jupyter Notebooks when you're ready:
Pick one of the courses below and start on your way.
Also, it's recommended to grab a textbook to use as an in-depth reference. The two I saw recommended most often were Understanding Machine Learning and Elements of Statistical Learning. You only need to use one of the two options as your main reference; here's some context/comparison to help you pick which one is right for you. You can download each book free as PDFs at those links - so grab them!
It's hard to make time available every week. So, you can try to study more effectively within the time you have available. Here are some ways to do that:
I am not a machine learning expert. I'm just a software developer and these resources/tips were useful to me as I learned some ML on the side.
microsoft/Data-Science-For-Beginners— added in 2021 — "10-week, 20-lesson curriculum all about Data Science. Each lesson includes pre-lesson and post-lesson quizzes, written instructions to complete the lesson, a solution, and an assignment. Our project-based pedagogy allows you to learn while building, a proven way for new skills to 'stick'."
Start with the support forums and chats related to the course(s) you're taking.
Check out datascience.stackexchange.com and stats.stackexchange.com – such as the tag, machine-learning. There are some subreddits, like /r/LearningMachineLearning and /r/MachineLearning.
Don't forget about meetups. Also, nowadays there are many active and helpful online communities around the ML ecosystem. Look for chat invitations on project pages and so on.
You'll want to get more familiar with Pandas.
dask: A Pandas-like interface, but for larger-than-memory data and "under the hood" parallelism.
Some good cheat sheets I've come across. (Please submit a Pull Request to add other useful cheat sheets.)
wzchen/probability-cheatsheet- "This cheatsheet is a 10-page reference in probability that covers a semester's worth of introductory probability. The cheatsheet is based off of Harvard's introductory probability course, Stat 110. It is co-authored by former Stat 110 Teaching Fellow William Chen and Stat 110 Professor Joe Blitzstein."
"Machine learning systems automatically learn programs from data." Pedro Domingos, in "A Few Useful Things to Know about Machine Learning." The programs you generate will require maintenance. Like any way of creating programs faster, you can rack up technical debt.
Here is the abstract of Machine Learning: The High-Interest Credit Card of Technical Debt:
Machine learning offers a fantastically powerful toolkit for building complex systems quickly. This paper argues that it is dangerous to think of these quick wins as coming for free. Using the framework of technical debt, we note that it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning. The goal of this paper is highlight several machine learning specific risk factors and design patterns to be avoided or refactored where possible. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, changes in the external world, and a variety of system-level anti-patterns.
If you're following this guide, you should read that paper. You can also listen to a podcast episode interviewing one of the authors of this paper.
That's not a comprehensive list, only a collection of starting-points to learn more.
What are some ways to practice?
You need practice. On Hacker News, user olympus commented to say you could use competitions to practice and evaluate yourself. Kaggle and ChaLearn are hubs for Machine Learning competitions. (You can find more competitions here or here.)
You also need understanding. You should review what Kaggle competition winners say about their solutions, for example, the "No Free Hunch" blog. These might be over your head at first but once you're starting to understand and appreciate these, you know you're getting somewhere.
Competitions and challenges are just one way to practice! Machine Learning isn't just about Kaggle competitions.
Here's a complementary way to practice: do practice studies.
How can you come up with interesting questions? Here's one way. Pick a day each week to look for public datasets and write down some questions that come to mind. Also, sign up for Data is Plural, a newsletter of interesting datasets. When a question inspires you, try exploring it with the skills you're learning.
I think the best advice is to tell people to always present their methods clearly and to avoid over-interpreting their results. Part of being an expert is knowing that there's rarely a clear answer, especially when you're working with real data.
As you repeat this process, your practice studies will become more scientific, interesting, and focused. Also, here's a video about the scientific method in data science.)
ossu/data-sciencehas a Discord server and newsletter
OpenReview.net "aims to promote openness in scientific communication, particularly the peer review process."
- Open Peer Review: We provide a configurable platform for peer review that generalizes over many subtle gradations of openness, allowing conference organizers, journals, and other "reviewing entities" to configure the specific policy of their choice. We intend to act as a testbed for different policies, to help scientific communities experiment with open scholarship while addressing legitimate concerns regarding confidentiality, attribution, and bias.
- Open Publishing: Track submissions, coordinate the efforts of editors, reviewers and authors, and host… Sharded and distributed for speed and reliability.
- Open Access: Free access to papers for all, free paper submissions. No fees.
- Open Discussion: Hosting of accepted papers, with their reviews, comments. Continued discussion forum associated with the paper post acceptance. Publication venue chairs/editors can control structure of review/comment forms, read/write access, and its timing.
- Open Directory: Collection of people, with conflict-of-interest information, including institutions and relations, such as co-authors, co-PIs, co-workers, advisors/advisees, and family connections.
- Open Recommendations: Models of scientific topics and expertise. Directory of people includes scientific expertise. Reviewer-paper matching for conferences with thousands of submissions, incorporating expertise, bidding, constraints, and reviewer balancing of various sorts. Paper recommendation to users.
- Open API: We provide a simple REST API [...]
- Open Source: We are committed to open source. Many parts of OpenReview are already in the OpenReview organization on GitHub. Some further releases are pending a professional security review of the codebase.
OpenReview.net is created by Andrew McCallum’s Information Extraction and Synthesis Laboratory in the College of Information and Computer Sciences at University of Massachusetts Amherst
OpenReview.net is built over an earlier version described in the paper Open Scholarship and Peer Review: a Time for Experimentation published in the ICML 2013 Peer Review Workshop.
OpenReview is a long-term project to advance science through improved peer review, with legal nonprofit status through Code for Science & Society. We gratefully acknowledge the support of the great diversity of OpenReview Sponsors––scientific peer review is sacrosanct, and should not be owned by any one sponsor.
If you are learning about MLOps but find it overwhelming, these resources might help you get your bearings:
Recommended awesomelists to save/star/watch:
Take note: some experts warn us not to get too far ahead of ourselves, and encourage learning ML fundamentals before moving onto deep learning. That's paraphrasing from some of the linked coursework in this guide — for example, Prof. Andrew Ng encourages building foundations in ML before studying DL. Perhaps you're ready for that now, or perhaps you'd like to get started soon and learn some DL in parallel to your other ML learnings.
When you're ready to dive into Deep Learning, here are some helpful resources.
explosion/thincis an interesting library that wraps PyTorch, TensorFlow and MXNet models.
fastai/fastbookby Jeremy Howard and Sylvain Gugger — "an introduction to deep learning, fastai and PyTorch."
labmlai/annotated_deep_learning_paper_implementations— "Implementations/tutorials of deep learning papers with side-by-side notes." 50+ of them! Really nicely annotated and explained.
cog, "containers for machine learning." It's an open-source tool for putting models into reproducible Docker containers.
Machine Learning can be powerful, but it is not magic.
Whenever you apply Machine Learning to solve a problem, you are going to be working in some specific problem domain. To get good results, you or your team will need "substantive expertise" (to re-use a phrase from earlier), which is related to "domain knowledge." Learn what you can, for yourself... But you should also collaborate with experts. You'll have better results if you collaborate with subject-matter experts and domain experts.
I couldn't say it better:
Machine learning won’t figure out what problems to solve. If you aren’t aligned with a human need, you’re just going to build a very powerful system to address a very small—or perhaps nonexistent—problem.
That quote is from "The UX of AI" by Josh Lovejoy. In other words, You Are Not The User. Suggested reading: Martin Zinkevich's "Rules of ML Engineering", Rule #23: "You are not a typical end user"
See also: the MLOps section!
If you are working with data-intensive applications at all, I'll recommend this book:
Here are some additional Data Science resources:
... Bayesian ideas have had a big impact in machine learning in the past 20 years or so because of the flexibility they provide in building structured models of real world phenomena. Algorithmic advances and increasing computational resources have made it possible to fit rich, highly structured models which were previously considered intractable.
This is just a small
These next two links are not related to ML. But since you're here, I have a hunch you might find them interesting too:
Here are some other guides to learning Machine Learning. They can be alternatives or supplements to this guide.