上QQ阅读APP看书，第一时间看更新

Supervised learning

Supervised learning is a type of learning where the model is fed with enough information and knowledge and closely supervised to learn, so that, based on the learning it has done, it can predict the outcome for a new dataset.

Here, the model is trained in supervision mode, similar to supervision by teachers, where we feed the model with enough training data containing the input/predictors and train it and show the correct answers or output. So, based on this, it learns and will become capable of predicting the output for unseen data that may come in the future.

A classic example of this would be the standard Iris dataset. The Iris dataset consists of three species of iris and for each species, the sepal length, sepal width, petal length, and petal width is given. And for a specific pattern of the four parameters, the label is provided as to what species such a set should belong to. With this learning in place, the model will be able to predict the label—in this case, the iris species, based on the feature set—in this case, the four parameters.

Supervised learning algorithms try to model relationships and dependencies between the target prediction output and the input features such that we can predict the output values for new data based on those relationships which it learned from the previous datasets.

The following diagram will give you an idea of what supervised learning is. The data with labels is given as input to build the model through supervised learning algorithms. This is the training phase. Then the model is used to predict the class label for any input data without the label. This is the testing phase:

Again, in supervised learning algorithms, the predicted output could be a discrete/categorical value or it could be a continuous value based on the type of scenario considered and the dataset taken into consideration. If the output predicted is a discrete/categorical value, such algorithms fall under the classification algorithms, and if the output predicted is a continuous value, such algorithms fall under the regression algorithms.

If there is a set of emails and you want to learn from them and be able to tell which emails belong to the spam category and which emails belong to the non-spam category, then the algorithm to be used for this purpose will be a supervised learning algorithm belonging to the classification type. Here, you need to feed the model with a set of emails and feed enough knowledge to the model about the attributes, based on which it would segregate the email to either the spam category or the non-spam category. So the predicted output would be a categorical value, that is, spam or non-spam.

Let's take the use case where based on a given set of parameters, we need to predict what would be the price of a house in a given area. This cannot be a categorical value. It is going to be a range or a continuous value and also be subject to change on a regular basis. In this problem, the model also needs to be provided with sufficient knowledge, based on which it is going to predict the pricing value. This type of algorithm belongs to the supervised learning regression category of algorithms.

There are various algorithms belonging to the supervised category of the machine learning family:

K-nearest neighbors
Naive Bayes
Decision trees
Linear regression
Logistic regression
Support vector machines
Random forest