Classification model with Python code from Scratch
This is step by step Machine learning Classification models with Python code from Scratch using Scikit learn library. In this tutorial, i have explain each and every steps with code and compile result. To evaluated this model we used publicly available kaggal Salary classification dataset . There are two classes with 15 features . There are few null values and used various method to filled up, we also used feature selection method and nearly popular 7 machine learning model to understand it completely.
In this section, we imported important libraries like numpy, pandas, sklearn seaborn and matplotlib etc.
#Import important Libraries from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, MinMaxScaler #library to normalize from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, auc, precision_recall_curve, roc_curve from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.pipeline import Pipeline import pandas as pd import numpy as np import seaborn as sns from matplotlib import pyplot as plt
In this part we read dataset and find the number of row and column of dataset. Classification model with Python code from Scratch.
df = pd.read_csv("salary.csv") nRow, nCol = df.shape print(nRow) print(nCol)
Here we perform some preprocessing technique to organize the data . Here tolist() function convert column into list .
df=pd.DataFrame(df) df_1=df.columns.tolist() print(df_1)
It help to visualize the whole dataset.
df.head(5)
This function gives the information about Null values against each features.
df.isnull().sum()
This function help to convert categorical data into numerical value to intemperately easily using label Encoder.
from sklearn.preprocessing import LabelEncoder def labelencoder(df): #It convert the catorigcal and string data into numerical values to for c in df.columns: #interperate easily. if df.dtype=='object': df = df.fillna('N') lbl = LabelEncoder() lbl.fit(list(df.values)) df = lbl.transform(df.values) return df
Visualization of data after performing label Encoder technique.
data1=labelencoder(df) data1
It drop the salary column as a target value and consider all other values as it feature .
Labels=data1['salary'] dataX=data1.drop('salary',axis=1)
There are various method to to handle Missing and NAN values . Here we used Simple Imputer to filled the missing values using means strategy.
from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') imputer = imputer.fit(dataX) X = imputer.transform(dataX)
In feature selection method the recursive feature estimator (RFE) and Brota is extensively used . In this tutorial we used Brota with random forest classifier. It select useful features and drop less useful features.
from sklearn.ensemble import RandomForestClassifier model =RandomForestClassifier(max_depth=1) from boruta import BorutaPy feat_selector = BorutaPy(model, n_estimators='auto', verbose=1, random_state=101) feat_selector.fit(X,Labels) print(feat_selector.support_) print(feat_selector.ranking_) X_filtered1 = feat_selector.transform(X)
Here we normalized the whole selected features using MinMaxScaler . Actually MinMaxScaler is like Final_selected = (Final_selected – Final_selected.mean()) / Final_selected.std(). It give the values between 0 and 1.
Final_selected=X_filtered1 from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) Final_selected = scaler.fit_transform(Final_selected)
This step divided the whole dataset into training and testing by keeping test size 20%.
X_train, X_test,y_train, y_test = train_test_split(Final_selected,Labels, test_size=0.20,random_state=9) print('training features =',X_train.shape) print('testing features =',y_train.shape) print('training labels=',X_test.shape) print('testing labels =',y_test.shape)
Now our dataset is completely well organized and ready to feed the model. The overall preprocessing technique is difficult part in AI and machine learning. Now we used different model one by one and see the output performance in terms of accuracy and confusion matrix. Note , we have two class classification. It also giving performance parameter precision , recall fi-score. lets start….
Random Forest (RF) Classifier
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
Support Vector Machine (SVM) Classifier
from sklearn.svm import SVC from sklearn import svm #model = svm.LinearSVC(multi_class="ovr") model = svm.SVC(kernel='rbf', gamma=7.9, C=20, decision_function_shape='ovo') model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
Decision Tree (DT) Classifier
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
Gradient Boosting Machine (GBM) Classifier
from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(random_state=101) model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
K nearest Neighbor (KNN) Classifier
from sklearn.neighbors import KNeighborsClassifier model = KNeighborsClassifier(n_neighbors=2) model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
EXtreme Gradient Boosting (XGB) Classifier
from xgboost import XGBClassifier model = XGBClassifier() model.fit(X_train, y_train) prediction = model.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
Multi Layer Perceptron (MLP) Classifier
from sklearn.neural_network import MLPClassifier MLP = MLPClassifier(max_iter=1500,activation='relu', learning_rate_init=0.001,shuffle=True, learning_rate='constant', beta_1=0.999, beta_2=0.9 , momentum=0.88, power_t=0.9, solver='lbfgs', alpha=1e-6, random_state=101) MLP.fit(X_train, y_train) prediction = MLP.predict(X_test) from sklearn.metrics import classification_report, accuracy_score, confusion_matrix print(classification_report(y_test, prediction)) print(accuracy_score(y_test, prediction)) print(confusion_matrix(y_test, prediction))
Conclusion:
In this tutorial, i try to cover all traditional machine learning classifier with complete python code to understand classification problem. It is very helpful for all beginners and machine learning aspiration. This help to build you basics in the field of machine learning and data science. In this whole , i try to discuss data load, data preparation, and model development. This is classification problem and we will share with you regression problem in few coming days. Keep in touch . Classification model with Python code from Scratch
If you have any question, any idea or anything’s about this tutorial pleased comment.
Read more: Random forest python code: