Most Machine Learning projects have very similar structure. Following this structure will help you to deal with the majority of ML challenges, as well as give you an overview of things you need to be able to do to be a successful ML developer. Master each task separately, and then learn to put them together by applying this template to practical projects.
Here are the main steps of a Machine Learning project:
- Define the problem - look at the big picture.
- Get the data.
- Explore and analyze the data.
- Prepare data.
- Explore the possible models and short-list the best ones. Train and evaluate them to pick the best one.
- Fine-tune the selected model to improve the results.
- Present your results.
- Launch your model into production, monitor and maintain your system.
Some tasks can be combined, broken down into subtasks, or tweaked to suit your needs.
For your convenience, here's a to-do list you can copy paste into your empty jupyter notebook to help you get started:
# Load the libraries # Load the dataset # Set aside validation dataset # Analyze data with descriptive statistics # Visualize data # Clean data # Select features # Transform/Standardize features # Train, evaluate, and compare several models # Fine-tune your model with Grid Search or Ensemble methods # Test your model on validation dataset # Train your model using the entire dataset # Save the model, it's parameters, etc.
Now let's go through each of the steps in more detail.
Define your objective in business terms. How will your solution be used, what value will it provide? What are the current solutions/workarounds? How will you measure your results? What is performance is necessary to reach your business objective? Have you encountered similar problems, can you reuse existing tools/libraries? Can you consult a domain expert? How would you solve this problem without machine learning?
Understand the kind of data you need, and decide how you're going to get it. Sometimes it's as easy as downloading a dataset, sometimes you need to setup a process for collecting it. Get the data and convert it into a convenient format. Set aside the test dataset right away.
Use descriptive statistics and data visualization to analyze and understand your dataset. Ask questions and form an understanding of your problem. Understand what each of the features means, what type it is(categorical, numerical, text, etc?), what values are missing, how noisy it is, how useful it is, how is it distributed(normal, uniform, logarithmic, etc?).
Study the corellations between attributes using data visualisation.
Identify data transformations/cleaning you might need to apply.
Think about if there's some extra data that would be useful that you can get access to.
Prepare the data to best expose the structure of the problem and relationships between input and output variables.
- Clean the data to remove duplicates, remove outliers, fill in missing values.
- Do feature engineering to develop new features, and feature selection to remove the redundant ones.
- Transform features to standardize or normalize them.
Make sure to write functions for all of your data transformations so you could easily apply them to new data or use them in the future projects.
Train models from different categories (linear, naive Bayes, SVM, Random Forests, ANNs, etc.) with default parameters. Evaluate them and compare the results.
Make sure to use cross-validation, compute the mean and standard deviation of the performance measure.
Analyze the most significant variables in each algorithm. Analyze the types of errors the models make. What data would a human have used to avoid these errors?
Short-list the promising models, preferably ones that make different types of errors.
Use things like GridSearch to find the optimal combination of parameters. Remember that if you have written your data transformations as functions, you can search through their parameters as well(whether or not you scale data or drop some features for example). Try using ensemble methods to combine predictions from multiple models.
Once you've done as well as you can - measure your model's performance using test set. Then create a standalone model using the optimal parameters you have found, and save it into a file for later use.
Document what you have done, create a nice presentation. Explain how your solution achieves business goal. Present things you've learned along the way, list your assumptions and system's limitations.
Plug your model into production data. Write unit tests. Write monitoring code to regularly check your model's performance and send you alert when it drops. Monitor your input data as well(in case something goes wrong before your model).
Automate retraining your model regularly on new data.
First time go through all the steps as quickly as possible, to make sure you have all the pieces in place, and have a baseline to improve upon.
Then iterate, cycle between the steps to improve perofrmance.
Make sure to try each step, even if you aren't sure it's useful now, you can build on top of it later.
Remember that your goal is to maximize accuracy. Treat changes as experiments, and measure how they impact the accuracy.
Tweak and adapt this template to better fit your workflow and the specific project you're working on.