Machine Learning Workflow
In order to demonstrate the end-to-end process of building a predictive model (machine learning or supervised learning), we have created an easy-to-comprehend workflow. The first step is to design the problem, then source and prepare the data, which leads to coding the model for training and evaluation, and, finally, deploying the model. In the scope of this chapter, we will keep the model explanation to a bare minimum, as it will be covered again in detail in chapters 4 and 5.
The following figure describes the workflow required to build a predictive model starting from preparing the data to deploying the model:
Figure 3.5: Machine learning workflow.
Design the Problem
Once we identify the domain of work, brainstorming on the designing of the problem is carried out. The idea is to first define the problem as a regression or classification problem. Once that is done, we choose the right target variable, along with identifying the features. The target variable is important because it decides how the training will take place. A supervised learning algorithm keeps the target variable at the center, while it tries to find a pattern from the given set of features.
Source and Prepare Data
Data gathering and preparation is a painstaking job, mainly when the data sources are diverse and many. With each data source, the challenges are different and hence the time taken to process it varies. Data sources with tabular data are the easiest to process provided they do not contain a lot of garbage information, whereas textual data is the hardest to clean because of its free-flowing nature.
Code the Model
Once the data is prepared and ready, we take up the task of choosing the right model. Most often, the experts first go with one baseline model to gauge the predictability power of the algorithm using input features and the target variable. Then, one can either directly try the state-of-the-art algorithms or decide to go with a trial-and-error method (of trying to use all the possible models). One must understand that there is no right or wrong model, and everything depends on the data. In coding, the data is randomly divided into training and testing. The code is written to train the model on the training dataset, and evaluation happens on the testing data. This ensures that the model does not underperform when it is deployed in the real world.
Train and Evaluate
Model evaluation is the important part of the model, where its usability in practice is decided. Based on a given set of model evaluation metrics, we need to decide, after all the trial and error, the best model. In each iteration, metrics such as the R-squared value, accuracy, precision, and F-score are computed. Usually, the entire data is divided into training and testing data (with a third split for validation set also often included). The model is trained on the training data and tested on the testing data. This separation ensures that the model is not doing any rote learning. In more technical terms, the model is not overfitting (more on this in the Evaluation Metrics section in this chapter). Usually, at this stage of the workflow, one could decide to go back and include more variables, train the model, and redeploy. The process is repeated until the accuracy (or the other metrics of importance) of the model reaches a plateau.
We use a random number generator function like sample() in R for splitting the data randomly into different parts as done in the next exercise 2, step 2.
Exercise 41: Creating a Train-and-Test Dataset Randomly Generated by the Beijing PM2.5 Dataset
In this exercise, we will create a randomly generated train-and-test dataset from the Beijing PM2.5 dataset. We will reuse the PM25 object created in the earlier exercise.
Perform the following steps:
- Create a num_index variable and set it to a value equal to the number of observations in the Beijing's PM2.5 dataset:
num_index <- nrow(PM25)
- Using the sample() function, randomly select 70% of the num_index values, and store them in train_index:
train_index <- sample(1:num_index, 0.7*nrow(PM25))
- Use train_index to select a random subset of rows from the Beijing PM2.5 dataset and store them in a DataFrame named PM25_Train:
PM25_Train <- PM25[train_index,]
- Store the remaining observation into a DataFrame named PM25_Test:
PM25_Test <- PM25[-train_index,]
The exercise shows a simple example for creating the train-and-test set. A randomly selected set for training and testing ensures that the model has no bias and learns well from all the possible examples before being used in the real world on unseen data.
Deploy the Model
Once the best model is selected, the next step is to enable the model output to be used by a business application. The model is hosted as a REpresentational State Transfer (REST) API. These APIs are a way to host a web application as an endpoint that listens to any request for a model call and usually returns a JSON object as a response.
Deployment of the model is becoming an essential part of all machine learning projects in the industry. A model that is not deployable is no good for a company, and perhaps, merely serves the purpose of R&D. An increasing number of professionals are specializing in model deployment, which is sometimes a tedious and complicated process. In order to give the model deployment its due importance, we have given it a dedicated chapter, that is Chapter 8, Model Deployment.