- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.120 --> 00:00:04.230
In this guide,
2
00:00:04.230 --> 00:00:07.680
we're gonna be talking about our next classification method,
3
00:00:07.680 --> 00:00:12.003
called k-nearest neighbor, or k-NN for short.
4
00:00:13.400 --> 00:00:17.380
Like the decision tree, k-NN can either be a regression
5
00:00:17.380 --> 00:00:20.120
or a classification algorithm.
6
00:00:20.120 --> 00:00:22.490
But as a classification algorithm,
7
00:00:22.490 --> 00:00:26.740
k-nearest neighbor works by assuming similar objects exist
8
00:00:26.740 --> 00:00:28.690
in close proximity.
9
00:00:28.690 --> 00:00:30.680
Or if we were to visualize it,
10
00:00:30.680 --> 00:00:33.640
similar things are near each other.
11
00:00:33.640 --> 00:00:37.230
This also means that because k-NN relies on distance
12
00:00:37.230 --> 00:00:39.740
to replace objects for classification,
13
00:00:39.740 --> 00:00:42.910
normalizing the training data is very important
14
00:00:42.910 --> 00:00:45.703
and can improve the overall accuracy dramatically.
15
00:00:47.160 --> 00:00:49.594
Like all the other algorithms that we've worked with,
16
00:00:49.594 --> 00:00:52.730
k-NN is a form of supervised learning,
17
00:00:52.730 --> 00:00:55.380
so it requires a labeled training
18
00:00:55.380 --> 00:00:58.360
and testing set to learn how new observations
19
00:00:58.360 --> 00:00:59.753
should be classified.
20
00:01:00.750 --> 00:01:02.290
Once the training is complete,
21
00:01:02.290 --> 00:01:05.294
k-NN stores what it's learned into memory
22
00:01:05.294 --> 00:01:07.850
and that extracts that information
23
00:01:07.850 --> 00:01:09.763
to make predictions when needed.
24
00:01:10.990 --> 00:01:12.470
Because k-nearest neighbor
25
00:01:12.470 --> 00:01:15.320
is making classification decision based strictly
26
00:01:15.320 --> 00:01:16.860
on feature similarities,
27
00:01:16.860 --> 00:01:20.070
it doesn't really care about how the data was generated.
28
00:01:20.070 --> 00:01:24.033
Thus we consider k-NN to be a discriminative model.
29
00:01:25.680 --> 00:01:26.993
Along with all of that,
30
00:01:26.993 --> 00:01:31.840
k-nearest neighbor is also what we call a lazy learner,
31
00:01:31.840 --> 00:01:35.160
meaning it only makes localized approximations
32
00:01:35.160 --> 00:01:37.390
and saves all of the actual computation
33
00:01:37.390 --> 00:01:39.563
until the function is evaluated.
34
00:01:41.750 --> 00:01:44.870
Overall, k-nearest neighbor is a simple
35
00:01:44.870 --> 00:01:47.630
and intuitive algorithm that is often used
36
00:01:47.630 --> 00:01:50.380
for tasks like recommendation systems,
37
00:01:50.380 --> 00:01:53.453
stock prediction and risk analysis.
38
00:01:54.840 --> 00:01:58.560
Machine learning developers also like using k-NN
39
00:01:58.560 --> 00:02:02.290
because it's fully capable of handling both binary
40
00:02:02.290 --> 00:02:04.393
and multi-class problems.
41
00:02:05.290 --> 00:02:10.290
Not only that, because k-NN uses a memory-based approach,
42
00:02:10.410 --> 00:02:14.168
it can immediately adapt when new training data comes in,
43
00:02:14.168 --> 00:02:16.690
allowing the algorithm to respond quickly
44
00:02:16.690 --> 00:02:20.143
to changes in the input during real-time use.
45
00:02:21.840 --> 00:02:26.280
Unfortunately, k-NN also has a couple glaring shortcomings
46
00:02:26.280 --> 00:02:27.313
to be aware of.
47
00:02:28.580 --> 00:02:32.070
The first of which is that as the dataset grows,
48
00:02:32.070 --> 00:02:36.453
the efficiency and speed of the algorithm declines rapidly.
49
00:02:37.300 --> 00:02:40.180
A k-nearest neighbor model also will struggle
50
00:02:40.180 --> 00:02:42.370
when the data is unbalanced.
51
00:02:42.370 --> 00:02:45.510
So if we were to use the spam and ham dataset
52
00:02:45.510 --> 00:02:47.410
from the last example,
53
00:02:47.410 --> 00:02:49.560
the k-NN model would struggle
54
00:02:49.560 --> 00:02:52.783
because of the over representation of ham entries.
55
00:02:53.920 --> 00:02:57.570
Now, the last and maybe the biggest limitation
56
00:02:57.570 --> 00:03:01.510
is something that we call the curse of dimensionality,
57
00:03:01.510 --> 00:03:05.780
meaning k-NN works well with a small number
58
00:03:05.780 --> 00:03:10.300
of input variables but as the number of variables grows,
59
00:03:10.300 --> 00:03:13.163
k-NN really begins to struggle.
60
00:03:15.810 --> 00:03:18.590
Knowing all of that, let's go ahead
61
00:03:18.590 --> 00:03:21.900
and look at how we can implement k-nearest neighbor
62
00:03:21.900 --> 00:03:23.563
to help make predictions.
63
00:03:25.230 --> 00:03:27.380
For this example, I thought it'd be fun
64
00:03:27.380 --> 00:03:30.240
to go through one of the very first research projects
65
00:03:30.240 --> 00:03:32.220
that I worked on when I was learning
66
00:03:32.220 --> 00:03:34.730
how to do oncology modeling.
67
00:03:34.730 --> 00:03:37.370
And the project was to build a model
68
00:03:37.370 --> 00:03:41.003
that could help determine if a tumor is benign or malignant.
69
00:03:42.030 --> 00:03:45.760
Mind you, we didn't actually create the original model,
70
00:03:45.760 --> 00:03:47.802
we just used the existing dataset
71
00:03:47.802 --> 00:03:50.270
as a learning tool and introduction
72
00:03:50.270 --> 00:03:51.893
into oncology modeling.
73
00:03:54.340 --> 00:03:56.050
But to get into it,
74
00:03:56.050 --> 00:03:58.190
in our dataset, we have a matrix
75
00:03:58.190 --> 00:04:00.250
containing nine different features,
76
00:04:00.250 --> 00:04:02.300
all of which are used to predict
77
00:04:02.300 --> 00:04:05.223
what class a tumor most likely belongs to.
78
00:04:10.360 --> 00:04:12.830
And looking at just the class,
79
00:04:12.830 --> 00:04:17.830
there are 44 benign samples and 239 malignant samples.
80
00:04:20.820 --> 00:04:22.650
Similar to the last guide,
81
00:04:22.650 --> 00:04:24.954
we're gonna need to import pandas,
82
00:04:24.954 --> 00:04:27.423
the train_test_split function,
83
00:04:28.460 --> 00:04:30.480
and a few different metric methods
84
00:04:30.480 --> 00:04:33.163
that we're gonna be using to evaluate the model.
85
00:04:34.810 --> 00:04:36.330
Other than that, we're gonna need
86
00:04:36.330 --> 00:04:38.630
to import the StandardScaler,
87
00:04:38.630 --> 00:04:41.880
which we talked about back in the feature scaling guide,
88
00:04:41.880 --> 00:04:44.433
as well as the KNeighborsClassifer.
89
00:04:46.580 --> 00:04:49.290
The feature vector is made up of every column,
90
00:04:49.290 --> 00:04:50.703
except the final one.
91
00:04:51.690 --> 00:04:54.340
And that's because we're gonna be using that column
92
00:04:54.340 --> 00:04:57.023
for the target or a class variable.
93
00:04:58.660 --> 00:04:59.750
Before we run this,
94
00:04:59.750 --> 00:05:02.520
I'm gonna also need to pass in .values
95
00:05:02.520 --> 00:05:04.033
for both of these.
96
00:05:06.600 --> 00:05:08.550
That way, they're both NumPy arrays
97
00:05:08.550 --> 00:05:10.340
before we go and try to split them
98
00:05:10.340 --> 00:05:12.353
between training and testing sets.
99
00:05:13.800 --> 00:05:15.910
There's not gonna be anything new
100
00:05:15.910 --> 00:05:18.750
with the training and testing split.
101
00:05:18.750 --> 00:05:20.800
So once we run through that,
102
00:05:20.800 --> 00:05:23.713
we can move forward with building the model.
103
00:05:24.980 --> 00:05:26.305
In the last example,
104
00:05:26.305 --> 00:05:29.240
before we could build the classification model,
105
00:05:29.240 --> 00:05:31.680
we had to tokenize the training set
106
00:05:31.680 --> 00:05:32.923
to get a word count.
107
00:05:33.940 --> 00:05:35.667
We don't have to do that here
108
00:05:35.667 --> 00:05:39.200
but instead, we're gonna need to standardize the data
109
00:05:39.200 --> 00:05:40.943
before we can do anything else.
110
00:05:42.150 --> 00:05:43.690
If you don't remember how to do that
111
00:05:43.690 --> 00:05:45.350
from the feature scaling guide,
112
00:05:45.350 --> 00:05:48.710
the first step is to build the classifier object
113
00:05:48.710 --> 00:05:51.293
and assign it to the StandardScaler.
114
00:05:58.440 --> 00:06:01.430
Then we'll do a reassignment with x_train
115
00:06:01.430 --> 00:06:03.593
and call the fit_transform method.
116
00:06:11.640 --> 00:06:14.400
And then we'll do the same thing with the test set
117
00:06:14.400 --> 00:06:16.650
but here, we don't need to do any fitting.
118
00:06:16.650 --> 00:06:19.883
So we'll only need to use the transform method.
119
00:06:25.070 --> 00:06:27.350
With the feature scaling all taken care of,
120
00:06:27.350 --> 00:06:30.890
we can go ahead and build the classification model.
121
00:06:30.890 --> 00:06:35.020
And we'll start off by creating the classifier object
122
00:06:35.020 --> 00:06:37.220
and assigning it to the KNeighborsClassifer.
123
00:06:41.460 --> 00:06:43.370
Then the last thing that we'll need to do
124
00:06:43.370 --> 00:06:46.370
is use the fit method to create the model
125
00:06:46.370 --> 00:06:48.270
and pass in both of the training sets.
126
00:07:01.320 --> 00:07:03.100
Before we finish up,
127
00:07:03.100 --> 00:07:05.140
we're going to evaluate the model
128
00:07:05.140 --> 00:07:07.610
by using three of the evaluation methods
129
00:07:07.610 --> 00:07:09.483
that we used in the last guide.
130
00:07:10.730 --> 00:07:13.910
But real quick, let's create a prediction object
131
00:07:13.910 --> 00:07:15.153
before we do that.
132
00:07:23.380 --> 00:07:26.260
Okay, so the first evaluation method
133
00:07:26.260 --> 00:07:28.633
is going to be the confusion_matrix.
134
00:07:40.160 --> 00:07:42.553
Next is the f1_score.
135
00:07:56.147 --> 00:07:58.933
And the third is the accuracy_score.
136
00:08:09.390 --> 00:08:11.040
Moving into the console,
137
00:08:11.040 --> 00:08:14.003
let's again go through what each of these mean.
138
00:08:15.700 --> 00:08:17.440
So the first one that we'll go through
139
00:08:17.440 --> 00:08:19.918
is the confusion_matrix
140
00:08:19.918 --> 00:08:21.740
and I should probably mention
141
00:08:21.740 --> 00:08:25.140
that it's not always a two-by-two matrix.
142
00:08:25.140 --> 00:08:27.180
It's just that both of the examples
143
00:08:27.180 --> 00:08:28.430
that we've worked with
144
00:08:28.430 --> 00:08:31.153
have been binary classification examples.
145
00:08:32.130 --> 00:08:34.880
If instead we had three class labels,
146
00:08:34.880 --> 00:08:37.923
we'd be getting a three-by-three matrix instead.
147
00:08:39.700 --> 00:08:42.010
Anyway, first and foremost,
148
00:08:42.010 --> 00:08:45.690
the matrix is always broken down by category.
149
00:08:45.690 --> 00:08:48.740
True negative, false negative,
150
00:08:48.740 --> 00:08:52.083
false positive and true positive.
151
00:08:53.370 --> 00:08:58.010
And in this case, a benign tumor consists a negative result
152
00:08:58.010 --> 00:09:01.570
and a malignant tumor is a positive result.
153
00:09:01.570 --> 00:09:05.240
So a true negative means the model predicted a negative
154
00:09:05.240 --> 00:09:09.040
or a benign result and was correct.
155
00:09:09.040 --> 00:09:10.960
A false negative is the outcome
156
00:09:10.960 --> 00:09:12.810
that we really want to limit
157
00:09:12.810 --> 00:09:15.390
because this is where the model predicted the tumor
158
00:09:15.390 --> 00:09:18.323
was benign but it was actually malignant.
159
00:09:19.580 --> 00:09:22.658
A false positive is another incorrect result
160
00:09:22.658 --> 00:09:25.220
but this time, it's when the model predicted
161
00:09:25.220 --> 00:09:27.260
that the tumor is malignant
162
00:09:27.260 --> 00:09:28.993
but it was actually benign.
163
00:09:29.850 --> 00:09:33.580
In this case, the misdiagnosis isn't as dangerous
164
00:09:33.580 --> 00:09:35.890
because more testing will be done,
165
00:09:35.890 --> 00:09:39.163
which will eventually uncover the misdiagnosis.
166
00:09:40.980 --> 00:09:44.510
Then the final category is true positive
167
00:09:44.510 --> 00:09:47.693
where a malignant tumor was predicted correctly.
168
00:09:49.060 --> 00:09:51.230
Getting back to our evaluation,
169
00:09:51.230 --> 00:09:53.510
it looks like our model did a pretty good job
170
00:09:53.510 --> 00:09:56.593
of limiting the number of misdiagnosed cases.
171
00:09:58.340 --> 00:10:00.430
Now for the f1_score,
172
00:10:00.430 --> 00:10:03.180
it's a little more complicated than how I described it.
173
00:10:04.530 --> 00:10:07.720
But it's essentially the accuracy of the model
174
00:10:07.720 --> 00:10:10.680
and it takes into account all of the correct predictions,
175
00:10:10.680 --> 00:10:14.303
as well as the false positives and false negatives.
176
00:10:15.450 --> 00:10:19.810
Whereas the accuracy_score only calculates the percentage
177
00:10:19.810 --> 00:10:21.493
of correct predictions.
178
00:10:23.200 --> 00:10:25.070
Before we finish the guide up,
179
00:10:25.070 --> 00:10:28.180
there's one last thing that I'd like to go over.
180
00:10:28.180 --> 00:10:32.350
And that's how to change the k-value of the model.
181
00:10:32.350 --> 00:10:34.870
We're gonna talk about the k-value a lot more
182
00:10:34.870 --> 00:10:36.333
in the next couple guides
183
00:10:36.333 --> 00:10:38.770
but for now, you just need to know
184
00:10:38.770 --> 00:10:40.960
that k is a parameter referring
185
00:10:40.960 --> 00:10:43.230
to the number of nearest neighbors
186
00:10:43.230 --> 00:10:44.853
that vote on the outcome.
187
00:10:46.770 --> 00:10:49.630
The default value for k is five
188
00:10:49.630 --> 00:10:51.630
but we can easily change that
189
00:10:51.630 --> 00:10:55.647
by passing in n_neighbors,
190
00:10:55.647 --> 00:10:57.933
followed by the new number.
191
00:11:00.140 --> 00:11:03.420
Again this is gonna be a really important topic,
192
00:11:03.420 --> 00:11:05.840
so we'll be spending a lot more time on this
193
00:11:05.840 --> 00:11:07.333
in the next few guides.
194
00:11:08.870 --> 00:11:11.360
But for now, I will wrap it up
195
00:11:11.360 --> 00:11:13.010
and I'll see you in the next one.