# Machine Learning Classification in Python – Part 2: Model Implementation and Performance Determination

This is the second part of our series **Automated Classification in Python** where we use several methods to classify the UCI Machine Learning record “Adult”. In this article, after generating data from Part 1, we will discuss in more detail how we implement the different models and then compare their performance.

**Classification**

We use different scikit-learn models for the classification. The scikit-learn library is open-source and provides a variety of tools for data analysis. Our goal is to compare the performance of the following models: Logistic Regression, Decision Trees, Gradient Boosted Trees, k-Nearest Neighbors, Naive Bayes, and a neural net. Below we briefly summarize the methods used.

In general, the models use the training record to create a function that describes the relationship between features and targets. After training, we can apply this function to data outside the training dataset to predict its targets. More about how these feature works are summarized in our blog. Below we show our implementation of a model using the following code example:

def do_decision_tree(self): """ Return a decision tree classifier object. """ clf = tree.DecisionTreeClassifier() return clf.fit(self.x_train, self.y_train)

We create a scikit-learn Decision Tree Classifier object and then train it on our training dataset.

With the scikit-learn method “predict,” it is possible to determine a model for single data or an entire data set. Based on the results obtained, we determine the performance of the model.

**Classification Models**

In this section, we briefly summarize the operation of each procedure. For a more detailed explanation of the model, we have linked the appropriate article from our blog and the scikit-learn documentation.

**Lo****gistic Regression**

Logistic regression transforms a linear regression into a sigmoid function. Thus, logistic regression always outputs values between 0 and 1. In general, coefficients of the individual independent features are adjusted so that the separate objects can be assigned to their targets as accurately as possible.

**Decision Tree**

Decision tree models are simple decision trees in which each node examines an attribute, after which the next node retrieves an attribute. This happens until a “leaf” is reached that indicates the respective classification case. For more information on Decision Trees, see our article on classification.

**Gradient Boosted Trees**

Gradient Boosted Trees are models that consist of several individual decision trees. The goal is to design a complex representation out of many simple models. The loss function is optimized for a differentiable loss function by the coefficients of the individual trees so that the loss function is as small as possible.

**k-nearest Neighbors**

In the k-nearest Neighbors model, the individual objects are viewed in a vector space and classified in relation to the k closest neighbors. A detailed explanation of this model can be found in our article on classification.

**Naive Bayes**

The Naive Bayes model considers objects as vectors whose entries each correspond to a feature. The probability of each entry belonging to the target is determined and summed up. The object is assigned to the target whose summed probability is highest. Further information on the Naive Bayes model can also be found in our article on classification.

**Neural network**

Neural networks consist of one or more layers of individual neurons. When training a neural network, the weighted connections are adapted in such a way that a loss function is minimized. We have summarized more information on the structure and function of neural networks in our article, “What are artificial neural networks?”

**Performance Determination**

To determine the performance of our models, we wrote a function, “get_classification_report”. This function gives us the metrics; precision, recall, F-Score, and the area under the ROC curve, as well as the required key figures, which we describe in detail in our article, by calling the scikit-learn function, “metrics.classification_report” and “metrics.confusion_matrix”.

def get_classification_report(self, clf, x, y, label=""): """ Return a ClassificationReport object. Arguments: clf: The classifier to be reported. x: The feature values. y: The target values to validate the predictions. label: (optionally) sets a label.""" roc_auc = roc_auc_score(y, clf.predict_proba(x)[:,1]) fpr, tpr, thresholds = roc_curve(y, clf.predict_proba(x)[:,1]) return ClassificationReport(metrics.confusion_matrix(y, clf.predict(x)), metrics.classification_report(y, clf.predict(x)), roc_auc, fpr, tpr, thresholds , label)

Only the receiver operating characteristic is obtained in the form of a graph that is generated with the help of the Python library matplotlib and the scikit-learn functions “roc_auc_score” and “roc_curve”.

**Evaluation**

Finally, we discuss how the models have performed on our data.

Klassifikator | Precision | Recall | F-Score | Area under Curve | Trainingszeit (Sekunden) |

Logistic Regression | 83% | 84% | 83% | 87% | 0,8 |

Decision Tree | 82% | 82% | 82% | 77% | 0,5 |

Gradient Boosted Trees | 88% | 88% | 87% | 93% | 142,9 (2 Minuten) |

k-nearest Neighbors | 84% | 84% | 84% | 88% | 2,1 |

Naive Bayes | 84% | 83% | 83% | 89% | 0,1 |

Neuronal Networks | 83% | 83% | 83% | 89% | 1.746 (29 Minuten) |

A direct comparison shows that neural networks need very long training times in order to achieve adequate results. With our settings, the training lasted about 29 minutes and a precision of 83% was achieved. In contrast, the Naive Bayes classifier achieved the same accuracy in just one-tenth of a second.

The best result was achieved with the Gradient Boosted Trees. This classifier managed to achieve a precision of 88% in 143 seconds. In addition to the relatively poor result of the neuronal net for the required training duration, it must be said that our model was most likely not chosen optimally.

In addition, the training of the neural net and the gradient boosted trees weren’t done on GPUs, which would have reduced the training duration.

Photo by **Christina Morillo **from **Pexels**

# This Might Also Interest You

### Machine Learning Clustering in Python

In this article, we show different methods for clustering in Python. Clustering is the combination of different objects in groups of similar objects. For example, the segmentation of different groups of buyers in retail.

### Machine Learning Performance Indicators

In this article, we present a set of metrics that can be used to compare and evaluate different methods and trained models for classification problems. In our article, Machine Learning Classification in Python - Part 1: Data Profiling and Preprocessing, we introduce different classification methods whose performance can be compared with the metrics described here.

### Machine Learning Classification in Python – Part 1: Data Profiling and Preprocessing

This is the first part of the series, Automated Classification in Python, in which we demonstrate how to classify a given data set using machine learning classification techniques. In the following article, we show the analysis and processing of the freely available "Adult" data set for classification. We have also published our script together with the data sets and documentation on GitHub. The dataset comes from the Machine Learning Repository of the University of California Irvine. This currently contains 473 datasets (last accessed: May 10, 2019) that are available for machine learning applications. The "Adult" data set is based on US census data. The goal is to use the given data to determine whether a person earns more or less than $ 50,000 a year.

### Model Validation and Overfitting

The model validation procedure describes the method of checking the performance of a statistical or data-analytical model.A common method for validating neural networks is k-fold cross-validation. In doing so one divides training data set into k subsets. One of the subsets represents the test set.

The remaining subsets then serve as the training set. The training set is used to teach the model. By the ratio of the correct results on the test set, it is possible to determine the degree of generalization of the model. The test set is then swapped with a training set and the performance is determined again until each set has finally functioned as a test set. At the end of the process, the average degree of generalization is calculated to estimate the performance of the model. The advantage of this method is that you get a relatively variant-free performance estimate. The reason for this is to prevent important structures in the training data from being excluded.

### Clustering with Machine Learning

The clustering problem is the grouping of objects or data into clusters. In contrast to classification, they are not predetermined but result from the similarity of the different objects considered. Similarity means the spatial distance of the objects, represented as vectors. There are different methods to determine this distance, e.g. the Euclidean or Minkowski metrics. Already when determining the similarity one notices that there is no clear cluster determination.

### Machine Learning Method for Data Classification

Classification methods divide objects into predefined categories according to their characteristics with the help from classifiers. A classifier is a function that maps an input to a class and its aim is to find suitable rules to which the data can be assigned to the respective class. Normally this is done in machine learning by using a supervised learning approach. In the following article, we briefly introduce the most common machine learning methods for solving classification problems.

### What are Artificial Neural Networks?

An artificial neural network is a mathematical model based on biological neural networks. Within the network, groups of neurons are combined in multiple layers and then connected to each other. A neural network always has at least one input layer and one output layer. Theoretically, there could be any number of layers or even hidden layers in between; this is called deep learning. The more hidden layers a network has means a higher degree of complexity and depth in addition to higher computing power.

### Where Does the Machine Learning and AI Megatrend Come From?

The Ideas of Machine Learning (ML), neural networks, and Artificial Intelligence (AI) are often a hot topic and appear to be discussed everywhere. In this article, we briefly summarize the development of the last ten years and explain why this trend will be applied more and more in all economic sectors.

### What is Machine Learning?

Machine Learning enables computers to learn knowledge from data without explicitly programming it. This knowledge is a function that assigns a suitable output to an input. An algorithm adapts the function until it achieves the desired results. One speaks of the training of the network. In recent years, a number of training procedures have been established that only lead to good results with the availability of very large data sets. The phrase "data is the new oil" often refers to the fact that companies with richer data sets can train more powerful models.