While working in the field of data science we often need to work for classification problems. There are so many classification algorithms in machine learning. In this article, we will explore 8 of those most popular machine learning algorithms to solve classification problems.
What is Classification Problem?
If you are new to data science, you first understand types of machine learning algorithms before you jump into classification. There are mainly two types of algorithms in machine learning. Those are:
- Supervised algorithm
- Un-Supervised algorithm
Classification algorithm comes under supervised algorithm where we know what to predict. Now, the supervised algorithm also can be divided into two parts:
- Regression problem:
- Classification problem
In classification, the goal is to assign data points to predefined categories or classes (like: “Yes”, “No”, etc.). If your output class is a discrete variable, then you should use a classification algorithm. For example, classifying emails as spam or not spam, identifying whether an image contains a cat or a dog, or predicting whether a patient has a disease or not, etc.
On the other hand, if your output variable contains continuous numerical value, you need to use Regression algorithm. This type of algorithm only predicts continuous variables. For instance, predicting house prices based on features like square fett area, number of bedrooms, location, etc.
If you are new to data science field and want to make your career in this field then you should first clear your basic concept. For that I will highly suggest you to take this course: Machine Learning Specialization – Andrew Ng.
Popular Classification Algorithms
Now let me list down best and most popular 8 machine learning algorithms to solve almost all classification problems. If you learn these algorithms then you will surely feel confidence to solve any classification task.
1. Logistic Regression
The very first classification algorithm you should learn is Logistic regression. It is one of the most simple and widely used machine learning algorithms for classification.
Logistic Regression is a linear model that assumes that the probability of a data instance belonging to a certain class is a function of a linear combination of its features. This classification algorithm can handle binary (two-class) or multi-class classification problems by using different variants, such as one-vs-all or softmax.
Logistic regression is easy to implement, interpret, and scale to large datasets. However, it may not perform well when the data is not linearly separable or has complex relation among the features.
2. K-Nearest Neighbors (KNN)
KNN is a non-parametric and lazy machine learning algorithm for classification. It does not learn any explicit model from the training data, but instead stores all the data instances and their labels.
When a new data instance is given, it finds the K most similar instances (based on some distance metric) from the stored data and assigns the majority label among them to the new instance.
Think of it as finding friends who are most similar to you. In KNN, we look at the neighbors (data points) closest to us and see what category they belong to. If most of your neighbors are doing something, you might do it too.
KNN also can handle both binary and multi-class classification problems and can adapt to non-linear data. However, it may suffer from high computational costs, memory requirements, and sensitivity to noise and irrelevant features.
3. Support Vector Machine (SVM)
SVM is a powerful and popular machine-learning algorithm for classification. It is based on the idea of finding a hyperplane that separates the data instances of different classes with the maximum margin.
It’s like drawing a line to separate things in a clear way. Imagine you have two types of fruits, and SVM helps you draw a line to keep them separate, like distinguishing between apples and oranges.
SVM can handle both linear and non-linear data by using different kernels, such as linear, polynomial, radial basis function (RBF), or sigmoid. It can also handle multi-class classification problems by using strategies such as one-vs-one or one-vs-all.
SVM has high accuracy, robustness, and generalization ability. However, it may be computationally expensive, sensitive to hyperparameters and kernel choice, and difficult to interpret.
Because of the computation issue, I only implement SVM for small datasets with high number of variables.
4. Decision Tree
Decision tree is a hierarchical and intuitive machine learning algorithm for classification. It is based on the idea of splitting the data into smaller subsets based on some criteria (such as information gain or gini index) until each subset contains only one class or reaches a predefined limit.
Think of it as a game of 20 questions. You start with a big question and keep asking smaller questions to figure out what something is. Imagine you’re trying to guess what kind of pet your friend has. You can ask a series of yes-or-no questions to figure it out, just like a decision tree.
You start with a big question: “Is it a mammal?” If the answer is “No,” you immediately know it’s not a dog or a cat, maybe a fish or a bird. Like the way you proceed with small small questions. By asking these smaller questions one after the other, you narrow down your guess step by step until you have a good idea of what kind of pet your friend has.
A decision tree can be represented as a tree-like structure, where each node represents a feature test, each branch represents an outcome of the test, and each leaf represents a class label.
This classification algorithm can handle both binary and multi-class classification problems and can handle both numerical and categorical features. It is easy to understand, implement, and visualize. However, it may be prone to overfitting, instability, and bias.
5. Random Forest
Random forest is an ensemble machine learning algorithm for classification that combines multiple decision trees. It is based on the idea of creating a diverse set of decision trees by using random subsets of features and data instances (also known as bagging or bootstrap aggregating) and then aggregating their predictions by using majority voting or averaging.
Picture it as a group of friends (trees) who vote on a decision. Each friend has their opinion, and you listen to everyone’s advice to make the best choice. It’s like asking a bunch of people for their thoughts.
Random forest can handle both binary and multi-class classification problems and can handle both numerical and categorical features. Random forest has high accuracy, robustness, and generalization ability. It can also provide feature importance and out-of-bag error estimates.
6. Naive Bayes
Naive Bayes is a probabilistic machine learning algorithm for classification that is based on Bayes’ theorem. It assumes that the probability of a data instance belongs to a certain class is proportional to the product of the prior probability of that class and the conditional probabilities of each feature given that class.
It also assumes that the features are independent of each other given the class label (hence naive). Naive Bayes can handle both binary and multi-class classification problems and can handle both numerical and categorical features (by using different variants such as Gaussian or multinomial).
Naive Bayes is fast, simple, and scalable to large datasets. However, it may not perform well when the data violates the independence assumption or has zero-frequency values.
Imagine it as a detective who uses clues to solve a mystery. Naive Bayes looks at different pieces of evidence and combines them to make a decision, like figuring out if an email is spam or not based on certain words.
7. Neural Network
Neural network is a complex and powerful machine learning algorithm that is inspired by the structure and function of biological neurons. It consists of multiple layers of artificial neurons (also known as nodes or units) that are connected by weighted links (also known as edges or synapses).
Each neuron receives inputs from other neurons or external sources, applies an activation function (such as sigmoid or relu) to them, and produces an output. The output of one layer serves as the input of the next layer, until the final layer predicts the class label.
Neural network can handle both binary and multi-class classification problems and can handle both numerical and categorical features (by using different encoding schemes such as one-hot or embedding).
Neural network can learn complex and non-linear patterns from the data and has high accuracy and generalization ability. However, it may be computationally expensive, sensitive to hyperparameters and initialization, and difficult to interpret.
It is an ensemble learning method that combines the predictions of multiple weak models to produce a stronger prediction. XGBoost stands for “Extreme Gradient Boosting” and it has become one of the most popular and widely used machine learning algorithms due to its ability to handle large datasets and its ability to achieve state-of-the-art performance in many machine learning tasks such as classification and regression.
Now let me quickly show you how you can implement each of those algorithms in Python. For this demo tutorial, I am going to use iris dataset. It contains information about three different species of iris flowers: setosa, versicolor, and virginica. Each iris sample has four features: sepal length, sepal width, petal length, and petal width, all measured in centimeters.
You can load this IRIS dataset using
sklearn.datasets() function in Python. Let’s load this dataset and split it into train and test datasets.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split # Load the Iris dataset iris = load_iris() X = iris.data y = iris.target # Split the dataset into 70% training and 30% testing X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Logistic Regression in Python
Let’s first implement logistic regression in Python. sklearn package makes it easy to implement almost all popular machine learning algorithms in Python. You just need to install this package using command:
pip install scikit-learn.
# Logistic Regression from sklearn.linear_model import LogisticRegression # Create and fit the logistic regression model logistic_model = LogisticRegression() logistic_model.fit(X_train, y_train) # Make predictions logistic_predictions = logistic_model.predict(X_test) logistic_predictions
K-Nearest Neighbors (KNN) in Python
In the similar way you can use
KNeighborsClassifier function of sklearn library to implement K-nearest neighbors (KNN) algorithm in Python. Below is the code to do that.
# K-Nearest Neighbors (KNN) from sklearn.neighbors import KNeighborsClassifier # Create and fit the KNN model knn_model = KNeighborsClassifier(n_neighbors=3) knn_model.fit(X_train, y_train) # Make predictions knn_predictions = knn_model.predict(X_test)
Support Vector Machine (SVM) in Python
SVC is the function of
sklearn library that you can use to play with support vector machine algorithm in Python.
# Support Vector Machine (SVM) from sklearn.svm import SVC # Create and fit the SVM model svm_model = SVC(kernel='linear') svm_model.fit(X_train, y_train) # Make predictions svm_predictions = svm_model.predict(X_test)
Decision Tree Python Implementation
In the similar manner, you can implement decision tree also using
sklearn in Python. One thing you might notice the coding style same for all algorithms while using sklearn library.
# Decision Tree from sklearn.tree import DecisionTreeClassifier # Create and fit the decision tree model decision_tree_model = DecisionTreeClassifier() decision_tree_model.fit(X_train, y_train) # Make predictions decision_tree_predictions = decision_tree_model.predict(X_test)
Random Forest in Python
Below is the Python code to use Random Forrest algorithm using sklearn library.
# Random Forest from sklearn.ensemble import RandomForestClassifier # Create and fit the random forest model random_forest_model = RandomForestClassifier() random_forest_model.fit(X_train, y_train) # Make predictions random_forest_predictions = random_forest_model.predict(X_test)
Naive Bayes Algorithm in Python
This the Python code to implement Naive Bayes Algorithm using sklearn library. If you want to learn more in-depth implementation of of Naive Bayes algorithm then you can read this tutorial: Naive Bayes algorithm in Machine Learning with Python.
# Naive Bayes from sklearn.naive_bayes import GaussianNB # Create and fit the Naive Bayes model naive_bayes_model = GaussianNB() naive_bayes_model.fit(X_train, y_train) # Make predictions naive_bayes_predictions = naive_bayes_model.predict(X_test)
XGBoost Python Implementation
Sklearn library does not support XGBoost algorithm. To implement this machine learning algorithm for classification, you need to install below Python package.
pip install xgboost
Once you installed that package, you can use below code to implement XGBoost algorithm in Python.
# XGBoost from xgboost import XGBClassifier # Create and fit the XGBoost model xgboost_model = XGBClassifier() xgboost_model.fit(X_train, y_train) # Make predictions xgboost_predictions = xgboost_model.predict(X_test)
Neural Network in Python
Neural network implementation is little bit different than other machine learning algorithms. To implement neural network in Python, you need to install TensorFlow library using this command:
pip install tensorflow.
The TensorFlow library supports Python version 3.7 or higher, so please ensure that you are using Python version 3.7 or a newer version. If you want to install Tensorflow with GPU then I will suggest you to read this article: Install TensorFlow GPU with Jupiter Notebook for Windows.
import numpy as np import tensorflow as tf from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score # Standardize the data (important for neural networks) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # Create a simple neural network with Keras model = tf.keras.Sequential([ tf.keras.layers.Input(shape=(X_train.shape,)), # Input layer tf.keras.layers.Dense(64, activation='relu'), # Hidden layer with 64 neurons and ReLU activation tf.keras.layers.Dense(3, activation='softmax') # Output layer with softmax activation for multi-class classification ]) # Compile the model model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # Train the model model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0) # Make predictions y_pred_prob = model.predict(X_test) y_pred = np.argmax(y_pred_prob, axis=1) # Calculate accuracy nn_accuracy = accuracy_score(y_test, y_pred)
Calculate and Print Accuracy
Finally we can calculate and print the accuracy of all of the above-listed machine learning algorithm and see which algorithm is performing better for our classification task.
# Evaluate model performance (example: accuracy) logistic_accuracy = accuracy_score(y_test, logistic_predictions) knn_accuracy = accuracy_score(y_test, knn_predictions) svm_accuracy = accuracy_score(y_test, svm_predictions) decision_tree_accuracy = accuracy_score(y_test, decision_tree_predictions) random_forest_accuracy = accuracy_score(y_test, random_forest_predictions) naive_bayes_accuracy = accuracy_score(y_test, naive_bayes_predictions) xgboost_accuracy = accuracy_score(y_test, xgboost_predictions) # Print accuracy for each model print("Logistic Regression Accuracy:", logistic_accuracy) print("K-Nearest Neighbors Accuracy:", knn_accuracy) print("SVM Accuracy:", svm_accuracy) print("Decision Tree Accuracy:", decision_tree_accuracy) print("Random Forest Accuracy:", random_forest_accuracy) print("Naive Bayes Accuracy:", naive_bayes_accuracy) print("XGBoost Accuracy:", xgboost_accuracy) print("Neural Network Accuracy:", nn_accuracy)
Logistic Regression Accuracy: 0.9777777777777777 K-Nearest Neighbors Accuracy: 0.9555555555555556 SVM Accuracy: 0.9555555555555556 Decision Tree Accuracy: 0.9111111111111111 Random Forest Accuracy: 0.9333333333333333 Naive Bayes Accuracy: 0.9333333333333333 XGBoost Accuracy: 0.9333333333333333 Neural Network Accuracy: 0.8888888888888888
As you can see Logistic Regression is giving best accuracy among other algorithms for our classification task (IRIS dataset).
Choosing the Right Algorithm
For example, if your dataset has small number of observations (rows) then neural network will not be able to perform a good result. In that case, you have to think of a different algorithm.
If you want to interpret your machine learning algorithm or in other words if want to understand which variables are impacting how much for certain predictions then you must use the Logistic Regression Algorithm. This is because no other algorithm is better at interpreting prediction results than Logistic Regression.
If your dataset is small and has high number of variables then you should go with Support Vector Machine or SVM algorithm. It should give you higher accuracy than other algorithms.
If you are working with categorical dataset, then you can try decision tree or Random forest or blindly you can go with Random Forrest or XGBoost. Although you can apply techniques like one hot encoding and try other classification algorithms also.
If you are working with text data classification problem and the size of your dataset is small then you can try Naive Bayes algorithm. It can be more effective than any other classification algorithm.
Finally, if you are not sure which machine learning algorithm to use for your classification task then you can apply all algorithms and compare required accuracy parameters like F1 score, Precision, Recall, confusion matrix, etc.
This is it for this article. If you have any questions or suggestions regarding this article, please let me know in the comment section below.
Again if you are new to data science field and want to make your career in this field then you should first clear your basic concept. For that, I will highly suggest you to take this course: Machine Learning Specialization – Andrew Ng.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.