This is my assignment of final project of IBM online course Machine Learning with Python via Coursera.
In this notebook we will build a classifier to predict whether a loan case will be paid off or not. We will load a historical dataset from previous loan applications, clean the data, and apply different classification algorithm on the data. The results is reported as the accuracy of each classifier, using the following metrics when these are applicable.
Algorithms
Evaluation
K Nearest Neibhbor (KNN) Decision Tree (DT) Support Vector Machine (SVM) Logistic Regression (LR)
Jaccard index F1-score LogLoass
Table of Contents
In this notebook we try to practice all the classification algorithms that we learned in this course.
We load a dataset using Pandas library, and apply the following algorithms, and find the best one for this specific dataset by accuracy evaluation methods.
K Nearest Neibhbor (KNN)
Decision Tree (DT)
Support Vector Machine (SVM)
Logistic Regression (LR)
1. Environment And Dataset Lets first load required libraries:
1 2 3 4 5 6 7 8 9 10 11 import itertoolsimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom matplotlib.ticker import NullFormatterimport matplotlib.ticker as tickerimport warningswarnings.filterwarnings('ignore' ) %matplotlib inline
1.1. About Dataset This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
Field
Description
Loan_status
Whether a loan is paid off on in collection
Principal
Basic principal loan amount at the
Terms
Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date
When the loan got originated and took effects
Due_date
Since it’s one-time payoff schedule, each loan has one single due date
Age
Age of applicant
Education
Education of applicant
Gender
The gender of applicant
Lets download the dataset
!wget -O loan_train.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv
1.2. Load Data From CSV File 1 2 df = pd.read_csv('loan_train.csv' ) df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
0
0
0
PAIDOFF
1000
30
9/8/2016
10/7/2016
45
High School or Below
male
1
2
2
PAIDOFF
1000
30
9/8/2016
10/7/2016
33
Bechalor
female
2
3
3
PAIDOFF
1000
15
9/8/2016
9/22/2016
27
college
male
3
4
4
PAIDOFF
1000
30
9/9/2016
10/8/2016
28
college
female
4
6
6
PAIDOFF
1000
30
9/9/2016
10/8/2016
29
college
male
(346, 10)
1.3. Convert To Date Time Object 1 2 3 df['due_date' ] = pd.to_datetime(df['due_date' ]) df['effective_date' ] = pd.to_datetime(df['effective_date' ]) df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
0
0
0
PAIDOFF
1000
30
2016-09-08
2016-10-07
45
High School or Below
male
1
2
2
PAIDOFF
1000
30
2016-09-08
2016-10-07
33
Bechalor
female
2
3
3
PAIDOFF
1000
15
2016-09-08
2016-09-22
27
college
male
3
4
4
PAIDOFF
1000
30
2016-09-09
2016-10-08
28
college
female
4
6
6
PAIDOFF
1000
30
2016-09-09
2016-10-08
29
college
male
2. Pre-processing With Data Visualization Libs used in this section.
1 2 from sklearn import preprocessingimport seaborn as sns
2.1. Visulaition The Data Let’s see how many of each class is in our data set
1 df['loan_status' ].value_counts()
PAIDOFF 260
COLLECTION 86
Name: loan_status, dtype: int64
260 people have paid off the loan on time while 86 have gone into collection
Lets plot some columns to underestand data better:
1 2 3 # Please install seborn libary if you did not have done so. # notice: installing seaborn might takes a few minutes. !conda install -c anaconda seaborn -y
Let’s see the Loan status of different pricipal, per gender.
1 2 3 4 5 6 bins = np.linspace(df.Principal.min(), df.Principal.max(), 10 ) g = sns.FacetGrid(df, col="Gender" , hue="loan_status" , palette="Set1" , col_wrap=2 ) g.map(plt.hist, 'Principal' , bins=bins, ec="k" ) g.axes[-1 ].legend() plt.show()
Let’s see the Loan status of different age, per gender.
1 2 3 4 5 6 bins = np.linspace(df.age.min(), df.age.max(), 10 ) g = sns.FacetGrid(df, col="Gender" , hue="loan_status" , palette="Set1" , col_wrap=2 ) g.map(plt.hist, 'age' , bins=bins, ec="k" ) g.axes[-1 ].legend() plt.show()
Lets look at the day of the week people get the loan.
1 2 3 4 5 6 df['dayofweek' ] = df['effective_date' ].dt.dayofweek bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10 ) g = sns.FacetGrid(df, col="Gender" , hue="loan_status" , palette="Set1" , col_wrap=2 ) g.map(plt.hist, 'dayofweek' , bins=bins, ec="k" ) g.axes[-1 ].legend() plt.show()
We see that people who get the loan at the end of the week dont pay it off, so lets use Feature binarization to set a threshold values less then day 4
1 2 df['weekend' ] = df['dayofweek' ].apply(lambda x: 1 if (x>3 ) else 0 ) df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
dayofweek
weekend
0
0
0
PAIDOFF
1000
30
2016-09-08
2016-10-07
45
High School or Below
male
3
0
1
2
2
PAIDOFF
1000
30
2016-09-08
2016-10-07
33
Bechalor
female
3
0
2
3
3
PAIDOFF
1000
15
2016-09-08
2016-09-22
27
college
male
3
0
3
4
4
PAIDOFF
1000
30
2016-09-09
2016-10-08
28
college
female
4
1
4
6
6
PAIDOFF
1000
30
2016-09-09
2016-10-08
29
college
male
4
1
2.2. Categorical Encoding For categorical variables, we need convert them into numeric format (encoding) to fit in our Machine Learning algorithms. Whereas there are many method to implement, we introduce two common approaches here, Label Encoding and One Hot Encoding, for feature gender and education here, repspctively.
You may find more information here: Guide to Encoding Categorical Values in Python
2.2.1. Gender Lets look at gender first:
1 df.groupby(['Gender' ])['loan_status' ].value_counts(normalize=True )
Gender loan_status
female PAIDOFF 0.865385
COLLECTION 0.134615
male PAIDOFF 0.731293
COLLECTION 0.268707
Name: loan_status, dtype: float64
86 % of female pay there loans while only 73 % of males pay there loan
Lets convert male to 0 and female to 1:
1 2 df['Gender' ].replace(to_replace=['male' ,'female' ], value=[0 ,1 ],inplace=True ) df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
dayofweek
weekend
0
0
0
PAIDOFF
1000
30
2016-09-08
2016-10-07
45
High School or Below
0
3
0
1
2
2
PAIDOFF
1000
30
2016-09-08
2016-10-07
33
Bechalor
1
3
0
2
3
3
PAIDOFF
1000
15
2016-09-08
2016-09-22
27
college
0
3
0
3
4
4
PAIDOFF
1000
30
2016-09-09
2016-10-08
28
college
1
4
1
4
6
6
PAIDOFF
1000
30
2016-09-09
2016-10-08
29
college
0
4
1
2.2.2. Education For education, there are four different values in data set, we will apply One Hot Encoding to this feature.
1 df.groupby(['education' ])['loan_status' ].value_counts(normalize=True )
education loan_status
Bechalor PAIDOFF 0.750000
COLLECTION 0.250000
High School or Below PAIDOFF 0.741722
COLLECTION 0.258278
Master or Above COLLECTION 0.500000
PAIDOFF 0.500000
college PAIDOFF 0.765101
COLLECTION 0.234899
Name: loan_status, dtype: float64
Befor One Hot Encoding
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
dayofweek
weekend
0
0
0
PAIDOFF
1000
30
2016-09-08
2016-10-07
45
High School or Below
0
3
0
1
2
2
PAIDOFF
1000
30
2016-09-08
2016-10-07
33
Bechalor
1
3
0
2
3
3
PAIDOFF
1000
15
2016-09-08
2016-09-22
27
college
0
3
0
3
4
4
PAIDOFF
1000
30
2016-09-09
2016-10-08
28
college
1
4
1
4
6
6
PAIDOFF
1000
30
2016-09-09
2016-10-08
29
college
0
4
1
Unnamed: 0
Unnamed: 0.1
Principal
terms
age
Gender
dayofweek
weekend
count
346.000000
346.000000
346.000000
346.000000
346.000000
346.000000
346.000000
346.000000
mean
202.167630
202.167630
943.641618
22.653179
30.939306
0.150289
3.682081
0.592486
std
115.459715
115.459715
109.425530
7.991006
6.039418
0.357872
2.614912
0.492084
min
0.000000
0.000000
300.000000
7.000000
18.000000
0.000000
0.000000
0.000000
25%
107.250000
107.250000
900.000000
15.000000
27.000000
0.000000
0.250000
0.000000
50%
204.500000
204.500000
1000.000000
30.000000
30.000000
0.000000
5.000000
1.000000
75%
298.750000
298.750000
1000.000000
30.000000
35.000000
0.000000
6.000000
1.000000
max
399.000000
399.000000
1000.000000
30.000000
51.000000
1.000000
6.000000
1.000000
1 2 features = ['Principal' ,'terms' ,'age' ,'Gender' ,'education' ] df[features].head()
Principal
terms
age
Gender
education
0
1000
30
45
0
High School or Below
1
1000
30
33
1
Bechalor
2
1000
15
27
0
college
3
1000
30
28
1
college
4
1000
30
29
0
college
One Hot Encoding
Use One Hot Encoding technique to conver categorical varables to binary variables and append them to the feature Data Frame.
Here we use pandas function get_dummies .
1 2 3 4 Feature = df[['Principal' ,'terms' ,'age' ,'Gender' ,'weekend' ]] Feature = pd.concat([Feature,pd.get_dummies(df['education' ])], axis=1 ) Feature.drop(['Master or Above' ], axis = 1 ,inplace=True ) Feature.head()
Principal
terms
age
Gender
weekend
Bechalor
High School or Below
college
0
1000
30
45
0
0
0
1
0
1
1000
30
33
1
0
1
0
0
2
1000
15
27
0
0
0
0
1
3
1000
30
28
1
1
0
0
1
4
1000
30
29
0
1
0
0
1
2.3. Feature Selection Lets defind feature sets, X:
Principal
terms
age
Gender
weekend
Bechalor
High School or Below
college
0
1000
30
45
0
0
0
1
0
1
1000
30
33
1
0
1
0
0
2
1000
15
27
0
0
0
0
1
3
1000
30
28
1
1
0
0
1
4
1000
30
29
0
1
0
0
1
What are our lables?
1 2 y = df['loan_status' ].values y[0 :5 ]
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
dtype=object)
2.4. Normalize Data Data Standardization give data zero mean and unit variance (technically should be done after train test split )
1 2 X= preprocessing.StandardScaler().fit(X).transform(X) X[0 :5 ]
array([[ 0.51578458, 0.92071769, 2.33152555, -0.42056004, -1.20577805,
-0.38170062, 1.13639374, -0.86968108],
[ 0.51578458, 0.92071769, 0.34170148, 2.37778177, -1.20577805,
2.61985426, -0.87997669, -0.86968108],
[ 0.51578458, -0.95911111, -0.65321055, -0.42056004, -1.20577805,
-0.38170062, -0.87997669, 1.14984679],
[ 0.51578458, 0.92071769, -0.48739188, 2.37778177, 0.82934003,
-0.38170062, -0.87997669, 1.14984679],
[ 0.51578458, 0.92071769, -0.3215732 , -0.42056004, 0.82934003,
-0.38170062, -0.87997669, 1.14984679]])
3. Classification Now, it is your turn, use the training set to build an accurate model. Then use the test set to report the accuracy of the model You should use the following algorithm:
K Nearest Neighbor(KNN)
Decision Tree
Support Vector Machine
Logistic Regression
Common Libararies Needed for all models
1 2 from sklearn.model_selection import train_test_split as splitfrom sklearn import metrics
3.1. K Nearest Neighbor(KNN) Notice: You should find the best k to build the model with the best accuracy.
warning: You should not use the loan_test.csv for finding the best k, however, you can split your train_loan.csv into train and test to find the best k .
1 from sklearn.neighbors import KNeighborsClassifier
1 2 3 4 5 6 7 8 9 10 11 12 X_train, X_test, y_train, y_test = split( X, y, test_size = 0.2 , random_state = 4 ) def Train_Pre_Accurany (k) : neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train) yhat = neigh.predict(X_test) Test_Accuracy = metrics.accuracy_score(y_test, yhat) return Test_Accuracy
1 2 3 4 5 6 7 8 9 10 11 12 13 Ks = 15 Test_accuracies = np.zeros(Ks) best_accuracy = 0 best_k = 0 for k in range(1 , Ks): accuracy = Train_Pre_Accurany(k) Test_accuracies[k] = accuracy if accuracy > best_accuracy: best_accuracy = accuracy best_k = k best_k
7
array([0. , 0.67142857, 0.65714286, 0.71428571, 0.68571429,
0.75714286, 0.71428571, 0.78571429, 0.75714286, 0.75714286,
0.67142857, 0.7 , 0.72857143, 0.7 , 0.7 ])
1 2 3 4 5 6 7 8 9 10 11 plt.plot(range(1 ,Ks), Test_accuracies[1 :], 'b' ) plt.annotate('Best K:{}, accuracy:{:.3f}' .format(best_k, best_accuracy), xy=(best_k, best_accuracy), xytext = ( best_k + 2 , best_accuracy), arrowprops=dict(facecolor='black' , shrink=0.1 ) ) line = [(best_k, best_k), (min(Test_accuracies[1 :]), best_accuracy),'r' ,] plt.plot(*line) plt.ylabel('Accuracy' ) plt.xlabel('K' ) plt.show();
7
1 2 kNeighbor = KNeighborsClassifier(n_neighbors = best_k).fit(X_train,y_train) sum(y_test == kNeighbor.predict(X_test))/len(y_test)
0.7857142857142857
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=7, p=2,
weights='uniform')
3.2. Decision Tree 1 2 from sklearn.tree import DecisionTreeClassifierfrom sklearn import tree
1 2 3 4 5 loanTree = DecisionTreeClassifier(criterion="entropy" , max_depth = 4 ) loanTree.fit(X_train,y_train) yhat = loanTree.predict(X_train) sum(y_train == yhat )/len(y_train)
0.7463768115942029
Visualization
1 2 3 from sklearn.externals.six import StringIOimport pydotplusimport matplotlib.image as mpimg
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 dot_data = StringIO() filename = 'loanTree.png' featureNames = Feature.columns targetNames = ['PAIDOFF' , 'COLLECTION' ] out = tree.export_graphviz(loanTree, feature_names = featureNames, out_file = dot_data, class_names=np.unique(y), filled=True , special_characters=True , rotate=False , ) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png(filename) img = mpimg.imread(filename) plt.figure(figsize=(100 ,200 )) plt.imshow(img, interpolation='nearest' )
<matplotlib.image.AxesImage at 0x7f9e2f1b37f0>
3.3. Support Vector Machine 1 2 3 4 5 6 from sklearn import svmsvm_Model = svm.SVC(kernel='rbf' ) svm_Model.fit(X_train, y_train) yhat = svm_Model.predict(X_test) sum(y_test == yhat)/len(y_test)
0.7428571428571429
3.4. Logistic Regression 1 2 from sklearn.linear_model import LogisticRegression
1 2 3 4 5 6 LR_Model = LogisticRegression( C = 0.01 , solver = 'liblinear' ) LR_Model.fit(X_train, y_train) lrYhat = LR_Model.predict(X_test) lrYhat_prob = LR_Model.predict_proba(X_test) sum(lrYhat == y_test)/len(y_test)
0.6857142857142857
4. Model Evaluation Using Test Data Set 1 2 3 4 from sklearn.metrics import jaccard_similarity_scorefrom sklearn.metrics import f1_scorefrom sklearn.metrics import log_lossfrom sklearn.metrics import jaccard_score
First, download and load the test set:
!wget -O loan_test.csv https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv
4.1. Load Test Data Set For Evaluation 1 2 test_df = pd.read_csv('loan_test.csv' ) test_df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
0
1
1
PAIDOFF
1000
30
9/8/2016
10/7/2016
50
Bechalor
female
1
5
5
PAIDOFF
300
7
9/9/2016
9/15/2016
35
Master or Above
male
2
21
21
PAIDOFF
1000
30
9/10/2016
10/9/2016
43
High School or Below
female
3
24
24
PAIDOFF
1000
30
9/10/2016
10/9/2016
26
college
male
4
35
35
PAIDOFF
800
15
9/11/2016
9/25/2016
29
Bechalor
male
4.2. Pre-processing Test Data Set 1 2 3 4 test_df['due_date' ] = pd.to_datetime(test_df['due_date' ]) test_df['effective_date' ] = pd.to_datetime(test_df['effective_date' ]) test_df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
0
1
1
PAIDOFF
1000
30
2016-09-08
2016-10-07
50
Bechalor
female
1
5
5
PAIDOFF
300
7
2016-09-09
2016-09-15
35
Master or Above
male
2
21
21
PAIDOFF
1000
30
2016-09-10
2016-10-09
43
High School or Below
female
3
24
24
PAIDOFF
1000
30
2016-09-10
2016-10-09
26
college
male
4
35
35
PAIDOFF
800
15
2016-09-11
2016-09-25
29
Bechalor
male
1 test_df['loan_status' ].value_counts()
PAIDOFF 40
COLLECTION 14
Name: loan_status, dtype: int64
1 2 3 4 test_df['dayofweek' ] = test_df['effective_date' ].dt.dayofweek test_df['weekend' ] = test_df['dayofweek' ].apply(lambda x: 1 if x > 3 else 0 ) test_df['Gender' ].replace(to_replace=['male' ,'female' ], value=[0 ,1 ], inplace = True ) test_df.head()
Unnamed: 0
Unnamed: 0.1
loan_status
Principal
terms
effective_date
due_date
age
education
Gender
dayofweek
weekend
0
1
1
PAIDOFF
1000
30
2016-09-08
2016-10-07
50
Bechalor
1
3
0
1
5
5
PAIDOFF
300
7
2016-09-09
2016-09-15
35
Master or Above
0
4
1
2
21
21
PAIDOFF
1000
30
2016-09-10
2016-10-09
43
High School or Below
1
5
1
3
24
24
PAIDOFF
1000
30
2016-09-10
2016-10-09
26
college
0
5
1
4
35
35
PAIDOFF
800
15
2016-09-11
2016-09-25
29
Bechalor
0
6
1
1 2 3 4 test_Feature = test_df[['Principal' ,'terms' ,'age' ,'Gender' ,'weekend' ]] test_Feature = pd.concat([test_Feature, pd.get_dummies(test_df['education' ])], axis = 1 ) test_Feature.drop(['Master or Above' ], axis = 1 , inplace = True ) test_Feature.head()
Principal
terms
age
Gender
weekend
Bechalor
High School or Below
college
0
1000
30
50
1
0
1
0
0
1
300
7
35
0
1
0
0
0
2
1000
30
43
1
1
0
1
0
3
1000
30
26
0
1
0
0
1
4
800
15
29
0
1
1
0
0
1 2 X_test = preprocessing.StandardScaler().fit(test_Feature).transform(test_Feature) X_test[0 :5 ]
array([[ 0.49362588, 0.92844966, 3.05981865, 1.97714211, -1.30384048,
2.39791576, -0.79772404, -0.86135677],
[-3.56269116, -1.70427745, 0.53336288, -0.50578054, 0.76696499,
-0.41702883, -0.79772404, -0.86135677],
[ 0.49362588, 0.92844966, 1.88080596, 1.97714211, 0.76696499,
-0.41702883, 1.25356634, -0.86135677],
[ 0.49362588, 0.92844966, -0.98251057, -0.50578054, 0.76696499,
-0.41702883, -0.79772404, 1.16095912],
[-0.66532184, -0.78854628, -0.47721942, -0.50578054, 0.76696499,
2.39791576, -0.79772404, -0.86135677]])
1 2 y_test = test_df['loan_status' ].values y_test[0 :5 ]
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'],
dtype=object)
array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF',
'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'COLLECTION',
'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION',
'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION',
'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION',
'COLLECTION'], dtype=object)
1 2 3 4 5 kNByhat = kNeighbor.predict(X_test) dTreeYhat = loanTree.predict(X_test) svmYhat = svm_Model.predict(X_test) lrYhat = LR_Model.predict(X_test) lrYhat_prob = LR_Model.predict_proba(X_test)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 data = [] idx = ['KNN' , 'Decision Tree' , 'SVM' ,'LogisticRegression' ] for yhat, Model_Name in zip([kNByhat, dTreeYhat, svmYhat, lrYhat],idx ): jaccard = jaccard_similarity_score ( y_test, yhat) F1 = f1_score(y_test, yhat, average='weighted' ) data.append({'Jaccard' : jaccard, 'F1-score' :F1, }) print('{}' .format('-' *42 )) print('{:<21} Jaccard index: {:.2f}' .format(Model_Name, jaccard)) print('{:<21} F1-score: {:.2f}' .format('' ,F1)) data[-1 ]['LogLoss' ] = log_loss(y_test, lrYhat_prob) print('{:<21} LogLoss: {:.2f}' .format('' ,data[-1 ]['LogLoss' ]))
------------------------------------------
KNN Jaccard index: 0.67
F1-score: 0.63
------------------------------------------
Decision Tree Jaccard index: 0.72
F1-score: 0.74
------------------------------------------
SVM Jaccard index: 0.80
F1-score: 0.76
------------------------------------------
LogisticRegression Jaccard index: 0.74
F1-score: 0.66
LogLoss: 0.57
5. Summary You should be able to report the accuracy of the built model using different evaluation metrics:
1 2 3 4 accuracy_report = pd.DataFrame(data, index = ['KNN' , 'Decision Tree' , 'SVM' ,'LogisticRegression' ], columns = ['Jaccard' ,'F1-score' ,'LogLoss' ], )
1 2 pd.set_option('precision' , 2 ) accuracy_report
Jaccard
F1-score
LogLoss
KNN
0.67
0.63
NaN
Decision Tree
0.72
0.74
NaN
SVM
0.80
0.76
NaN
LogisticRegression
0.74
0.66
0.57
Want to learn more?
IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler
Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is IBM’s leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of Watson Studio users today with a free account at Watson Studio
Thanks for completing this lesson!
Saeed Aghabozorgi , PhD is a Data Scientist in IBM with a track record of developing enterprise level applications that substantially increases clients’ ability to turn data into actionable knowledge. He is a researcher in data mining field and expert in developing advanced analytic methods like machine learning and statistical modelling on large datasets.
Copyright © 2018 Cognitive Class . This notebook and its source code are released under the terms of the MIT License .