How to Implement the K Nearest Neighbors Machine Learning Algorithm

WEBVTT

1
00:00:03.120 --> 00:00:04.230
In this guide,

2
00:00:04.230 --> 00:00:07.680
we're gonna be talking about our next classification method,

3
00:00:07.680 --> 00:00:12.003
called k-nearest neighbor, or k-NN for short.

4
00:00:13.400 --> 00:00:17.380
Like the decision tree, k-NN can either be a regression

5
00:00:17.380 --> 00:00:20.120
or a classification algorithm.

6
00:00:20.120 --> 00:00:22.490
But as a classification algorithm,

7
00:00:22.490 --> 00:00:26.740
k-nearest neighbor works by assuming similar objects exist

8
00:00:26.740 --> 00:00:28.690
in close proximity.

9
00:00:28.690 --> 00:00:30.680
Or if we were to visualize it,

10
00:00:30.680 --> 00:00:33.640
similar things are near each other.

11
00:00:33.640 --> 00:00:37.230
This also means that because k-NN relies on distance

12
00:00:37.230 --> 00:00:39.740
to replace objects for classification,

13
00:00:39.740 --> 00:00:42.910
normalizing the training data is very important

14
00:00:42.910 --> 00:00:45.703
and can improve the overall accuracy dramatically.

15
00:00:47.160 --> 00:00:49.594
Like all the other algorithms that we've worked with,

16
00:00:49.594 --> 00:00:52.730
k-NN is a form of supervised learning,

17
00:00:52.730 --> 00:00:55.380
so it requires a labeled training

18
00:00:55.380 --> 00:00:58.360
and testing set to learn how new observations

19
00:00:58.360 --> 00:00:59.753
should be classified.

20
00:01:00.750 --> 00:01:02.290
Once the training is complete,

21
00:01:02.290 --> 00:01:05.294
k-NN stores what it's learned into memory

22
00:01:05.294 --> 00:01:07.850
and that extracts that information

23
00:01:07.850 --> 00:01:09.763
to make predictions when needed.

24
00:01:10.990 --> 00:01:12.470
Because k-nearest neighbor

25
00:01:12.470 --> 00:01:15.320
is making classification decision based strictly

26
00:01:15.320 --> 00:01:16.860
on feature similarities,

27
00:01:16.860 --> 00:01:20.070
it doesn't really care about how the data was generated.

28
00:01:20.070 --> 00:01:24.033
Thus we consider k-NN to be a discriminative model.

29
00:01:25.680 --> 00:01:26.993
Along with all of that,

30
00:01:26.993 --> 00:01:31.840
k-nearest neighbor is also what we call a lazy learner,

31
00:01:31.840 --> 00:01:35.160
meaning it only makes localized approximations

32
00:01:35.160 --> 00:01:37.390
and saves all of the actual computation

33
00:01:37.390 --> 00:01:39.563
until the function is evaluated.

34
00:01:41.750 --> 00:01:44.870
Overall, k-nearest neighbor is a simple

35
00:01:44.870 --> 00:01:47.630
and intuitive algorithm that is often used

36
00:01:47.630 --> 00:01:50.380
for tasks like recommendation systems,

37
00:01:50.380 --> 00:01:53.453
stock prediction and risk analysis.

38
00:01:54.840 --> 00:01:58.560
Machine learning developers also like using k-NN

39
00:01:58.560 --> 00:02:02.290
because it's fully capable of handling both binary

40
00:02:02.290 --> 00:02:04.393
and multi-class problems.

41
00:02:05.290 --> 00:02:10.290
Not only that, because k-NN uses a memory-based approach,

42
00:02:10.410 --> 00:02:14.168
it can immediately adapt when new training data comes in,

43
00:02:14.168 --> 00:02:16.690
allowing the algorithm to respond quickly

44
00:02:16.690 --> 00:02:20.143
to changes in the input during real-time use.

45
00:02:21.840 --> 00:02:26.280
Unfortunately, k-NN also has a couple glaring shortcomings

46
00:02:26.280 --> 00:02:27.313
to be aware of.

47
00:02:28.580 --> 00:02:32.070
The first of which is that as the dataset grows,

48
00:02:32.070 --> 00:02:36.453
the efficiency and speed of the algorithm declines rapidly.

49
00:02:37.300 --> 00:02:40.180
A k-nearest neighbor model also will struggle

50
00:02:40.180 --> 00:02:42.370
when the data is unbalanced.

51
00:02:42.370 --> 00:02:45.510
So if we were to use the spam and ham dataset

52
00:02:45.510 --> 00:02:47.410
from the last example,

53
00:02:47.410 --> 00:02:49.560
the k-NN model would struggle

54
00:02:49.560 --> 00:02:52.783
because of the over representation of ham entries.

55
00:02:53.920 --> 00:02:57.570
Now, the last and maybe the biggest limitation

56
00:02:57.570 --> 00:03:01.510
is something that we call the curse of dimensionality,

57
00:03:01.510 --> 00:03:05.780
meaning k-NN works well with a small number

58
00:03:05.780 --> 00:03:10.300
of input variables but as the number of variables grows,

59
00:03:10.300 --> 00:03:13.163
k-NN really begins to struggle.

60
00:03:15.810 --> 00:03:18.590
Knowing all of that, let's go ahead

61
00:03:18.590 --> 00:03:21.900
and look at how we can implement k-nearest neighbor

62
00:03:21.900 --> 00:03:23.563
to help make predictions.

63
00:03:25.230 --> 00:03:27.380
For this example, I thought it'd be fun

64
00:03:27.380 --> 00:03:30.240
to go through one of the very first research projects

65
00:03:30.240 --> 00:03:32.220
that I worked on when I was learning

66
00:03:32.220 --> 00:03:34.730
how to do oncology modeling.

67
00:03:34.730 --> 00:03:37.370
And the project was to build a model

68
00:03:37.370 --> 00:03:41.003
that could help determine if a tumor is benign or malignant.

69
00:03:42.030 --> 00:03:45.760
Mind you, we didn't actually create the original model,

70
00:03:45.760 --> 00:03:47.802
we just used the existing dataset

71
00:03:47.802 --> 00:03:50.270
as a learning tool and introduction

72
00:03:50.270 --> 00:03:51.893
into oncology modeling.

73
00:03:54.340 --> 00:03:56.050
But to get into it,

74
00:03:56.050 --> 00:03:58.190
in our dataset, we have a matrix

75
00:03:58.190 --> 00:04:00.250
containing nine different features,

76
00:04:00.250 --> 00:04:02.300
all of which are used to predict

77
00:04:02.300 --> 00:04:05.223
what class a tumor most likely belongs to.

78
00:04:10.360 --> 00:04:12.830
And looking at just the class,

79
00:04:12.830 --> 00:04:17.830
there are 44 benign samples and 239 malignant samples.

80
00:04:20.820 --> 00:04:22.650
Similar to the last guide,

81
00:04:22.650 --> 00:04:24.954
we're gonna need to import pandas,

82
00:04:24.954 --> 00:04:27.423
the train_test_split function,

83
00:04:28.460 --> 00:04:30.480
and a few different metric methods

84
00:04:30.480 --> 00:04:33.163
that we're gonna be using to evaluate the model.

85
00:04:34.810 --> 00:04:36.330
Other than that, we're gonna need

86
00:04:36.330 --> 00:04:38.630
to import the StandardScaler,

87
00:04:38.630 --> 00:04:41.880
which we talked about back in the feature scaling guide,

88
00:04:41.880 --> 00:04:44.433
as well as the KNeighborsClassifer.

89
00:04:46.580 --> 00:04:49.290
The feature vector is made up of every column,

90
00:04:49.290 --> 00:04:50.703
except the final one.

91
00:04:51.690 --> 00:04:54.340
And that's because we're gonna be using that column

92
00:04:54.340 --> 00:04:57.023
for the target or a class variable.

93
00:04:58.660 --> 00:04:59.750
Before we run this,

94
00:04:59.750 --> 00:05:02.520
I'm gonna also need to pass in .values

95
00:05:02.520 --> 00:05:04.033
for both of these.

96
00:05:06.600 --> 00:05:08.550
That way, they're both NumPy arrays

97
00:05:08.550 --> 00:05:10.340
before we go and try to split them

98
00:05:10.340 --> 00:05:12.353
between training and testing sets.

99
00:05:13.800 --> 00:05:15.910
There's not gonna be anything new

100
00:05:15.910 --> 00:05:18.750
with the training and testing split.

101
00:05:18.750 --> 00:05:20.800
So once we run through that,

102
00:05:20.800 --> 00:05:23.713
we can move forward with building the model.

103
00:05:24.980 --> 00:05:26.305
In the last example,

104
00:05:26.305 --> 00:05:29.240
before we could build the classification model,

105
00:05:29.240 --> 00:05:31.680
we had to tokenize the training set

106
00:05:31.680 --> 00:05:32.923
to get a word count.

107
00:05:33.940 --> 00:05:35.667
We don't have to do that here

108
00:05:35.667 --> 00:05:39.200
but instead, we're gonna need to standardize the data

109
00:05:39.200 --> 00:05:40.943
before we can do anything else.

110
00:05:42.150 --> 00:05:43.690
If you don't remember how to do that

111
00:05:43.690 --> 00:05:45.350
from the feature scaling guide,

112
00:05:45.350 --> 00:05:48.710
the first step is to build the classifier object

113
00:05:48.710 --> 00:05:51.293
and assign it to the StandardScaler.

114
00:05:58.440 --> 00:06:01.430
Then we'll do a reassignment with x_train

115
00:06:01.430 --> 00:06:03.593
and call the fit_transform method.

116
00:06:11.640 --> 00:06:14.400
And then we'll do the same thing with the test set

117
00:06:14.400 --> 00:06:16.650
but here, we don't need to do any fitting.

118
00:06:16.650 --> 00:06:19.883
So we'll only need to use the transform method.

119
00:06:25.070 --> 00:06:27.350
With the feature scaling all taken care of,

120
00:06:27.350 --> 00:06:30.890
we can go ahead and build the classification model.

121
00:06:30.890 --> 00:06:35.020
And we'll start off by creating the classifier object

122
00:06:35.020 --> 00:06:37.220
and assigning it to the KNeighborsClassifer.

123
00:06:41.460 --> 00:06:43.370
Then the last thing that we'll need to do

124
00:06:43.370 --> 00:06:46.370
is use the fit method to create the model

125
00:06:46.370 --> 00:06:48.270
and pass in both of the training sets.

126
00:07:01.320 --> 00:07:03.100
Before we finish up,

127
00:07:03.100 --> 00:07:05.140
we're going to evaluate the model

128
00:07:05.140 --> 00:07:07.610
by using three of the evaluation methods

129
00:07:07.610 --> 00:07:09.483
that we used in the last guide.

130
00:07:10.730 --> 00:07:13.910
But real quick, let's create a prediction object

131
00:07:13.910 --> 00:07:15.153
before we do that.

132
00:07:23.380 --> 00:07:26.260
Okay, so the first evaluation method

133
00:07:26.260 --> 00:07:28.633
is going to be the confusion_matrix.

134
00:07:40.160 --> 00:07:42.553
Next is the f1_score.

135
00:07:56.147 --> 00:07:58.933
And the third is the accuracy_score.

136
00:08:09.390 --> 00:08:11.040
Moving into the console,

137
00:08:11.040 --> 00:08:14.003
let's again go through what each of these mean.

138
00:08:15.700 --> 00:08:17.440
So the first one that we'll go through

139
00:08:17.440 --> 00:08:19.918
is the confusion_matrix

140
00:08:19.918 --> 00:08:21.740
and I should probably mention

141
00:08:21.740 --> 00:08:25.140
that it's not always a two-by-two matrix.

142
00:08:25.140 --> 00:08:27.180
It's just that both of the examples

143
00:08:27.180 --> 00:08:28.430
that we've worked with

144
00:08:28.430 --> 00:08:31.153
have been binary classification examples.

145
00:08:32.130 --> 00:08:34.880
If instead we had three class labels,

146
00:08:34.880 --> 00:08:37.923
we'd be getting a three-by-three matrix instead.

147
00:08:39.700 --> 00:08:42.010
Anyway, first and foremost,

148
00:08:42.010 --> 00:08:45.690
the matrix is always broken down by category.

149
00:08:45.690 --> 00:08:48.740
True negative, false negative,

150
00:08:48.740 --> 00:08:52.083
false positive and true positive.

151
00:08:53.370 --> 00:08:58.010
And in this case, a benign tumor consists a negative result

152
00:08:58.010 --> 00:09:01.570
and a malignant tumor is a positive result.

153
00:09:01.570 --> 00:09:05.240
So a true negative means the model predicted a negative

154
00:09:05.240 --> 00:09:09.040
or a benign result and was correct.

155
00:09:09.040 --> 00:09:10.960
A false negative is the outcome

156
00:09:10.960 --> 00:09:12.810
that we really want to limit

157
00:09:12.810 --> 00:09:15.390
because this is where the model predicted the tumor

158
00:09:15.390 --> 00:09:18.323
was benign but it was actually malignant.

159
00:09:19.580 --> 00:09:22.658
A false positive is another incorrect result

160
00:09:22.658 --> 00:09:25.220
but this time, it's when the model predicted

161
00:09:25.220 --> 00:09:27.260
that the tumor is malignant

162
00:09:27.260 --> 00:09:28.993
but it was actually benign.

163
00:09:29.850 --> 00:09:33.580
In this case, the misdiagnosis isn't as dangerous

164
00:09:33.580 --> 00:09:35.890
because more testing will be done,

165
00:09:35.890 --> 00:09:39.163
which will eventually uncover the misdiagnosis.

166
00:09:40.980 --> 00:09:44.510
Then the final category is true positive

167
00:09:44.510 --> 00:09:47.693
where a malignant tumor was predicted correctly.

168
00:09:49.060 --> 00:09:51.230
Getting back to our evaluation,

169
00:09:51.230 --> 00:09:53.510
it looks like our model did a pretty good job

170
00:09:53.510 --> 00:09:56.593
of limiting the number of misdiagnosed cases.

171
00:09:58.340 --> 00:10:00.430
Now for the f1_score,

172
00:10:00.430 --> 00:10:03.180
it's a little more complicated than how I described it.

173
00:10:04.530 --> 00:10:07.720
But it's essentially the accuracy of the model

174
00:10:07.720 --> 00:10:10.680
and it takes into account all of the correct predictions,

175
00:10:10.680 --> 00:10:14.303
as well as the false positives and false negatives.

176
00:10:15.450 --> 00:10:19.810
Whereas the accuracy_score only calculates the percentage

177
00:10:19.810 --> 00:10:21.493
of correct predictions.

178
00:10:23.200 --> 00:10:25.070
Before we finish the guide up,

179
00:10:25.070 --> 00:10:28.180
there's one last thing that I'd like to go over.

180
00:10:28.180 --> 00:10:32.350
And that's how to change the k-value of the model.

181
00:10:32.350 --> 00:10:34.870
We're gonna talk about the k-value a lot more

182
00:10:34.870 --> 00:10:36.333
in the next couple guides

183
00:10:36.333 --> 00:10:38.770
but for now, you just need to know

184
00:10:38.770 --> 00:10:40.960
that k is a parameter referring

185
00:10:40.960 --> 00:10:43.230
to the number of nearest neighbors

186
00:10:43.230 --> 00:10:44.853
that vote on the outcome.

187
00:10:46.770 --> 00:10:49.630
The default value for k is five

188
00:10:49.630 --> 00:10:51.630
but we can easily change that

189
00:10:51.630 --> 00:10:55.647
by passing in n_neighbors,

190
00:10:55.647 --> 00:10:57.933
followed by the new number.

191
00:11:00.140 --> 00:11:03.420
Again this is gonna be a really important topic,

192
00:11:03.420 --> 00:11:05.840
so we'll be spending a lot more time on this

193
00:11:05.840 --> 00:11:07.333
in the next few guides.

194
00:11:08.870 --> 00:11:11.360
But for now, I will wrap it up

195
00:11:11.360 --> 00:11:13.010
and I'll see you in the next one.