- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.120 --> 00:00:04.640
In this guide,
2
00:00:04.640 --> 00:00:08.199
we're gonna continue our discussion on k-nearest neighbors
3
00:00:08.199 --> 00:00:11.070
but this time, we're going to focus more
4
00:00:11.070 --> 00:00:12.900
on what k means
5
00:00:12.900 --> 00:00:17.003
and how to create a k-nearest neighbor visualization.
6
00:00:19.190 --> 00:00:22.025
Hopefully you remember but in the last guide,
7
00:00:22.025 --> 00:00:25.300
I mentioned that k-nearest neighbor works
8
00:00:25.300 --> 00:00:27.680
by assuming similar objects exist
9
00:00:27.680 --> 00:00:29.163
in close proximity.
10
00:00:30.420 --> 00:00:33.270
So when a new data point comes in,
11
00:00:33.270 --> 00:00:36.213
it's classified in the group that it's closest to.
12
00:00:38.120 --> 00:00:40.320
Now, that' all fine and dandy
13
00:00:40.320 --> 00:00:43.600
when an input obviously belongs to a certain class
14
00:00:44.500 --> 00:00:48.250
but how does the k-nearest algorithm determine the class
15
00:00:48.250 --> 00:00:50.683
when the choice isn't so obvious?
16
00:00:52.520 --> 00:00:54.660
Well, that's precisely when the number
17
00:00:54.660 --> 00:00:56.673
of neighbors might come into play.
18
00:00:58.770 --> 00:01:01.940
Let's say we're working with this smaller dataset
19
00:01:01.940 --> 00:01:04.010
and we have a new observation
20
00:01:04.010 --> 00:01:07.463
that's pretty close to being in the middle of both classes.
21
00:01:09.240 --> 00:01:12.390
My question to you is what class
22
00:01:12.390 --> 00:01:14.223
should the data point belong to?
23
00:01:17.160 --> 00:01:19.630
The answer isn't so obvious
24
00:01:19.630 --> 00:01:22.853
because it all depends on what the k value is.
25
00:01:25.030 --> 00:01:27.260
If the k value is one,
26
00:01:27.260 --> 00:01:29.680
k-nearest neighbors will calculate the distance
27
00:01:29.680 --> 00:01:33.100
from the new observation to the nearest point
28
00:01:33.100 --> 00:01:34.293
of each class.
29
00:01:36.010 --> 00:01:38.960
And whatever distance is the closest
30
00:01:38.960 --> 00:01:41.470
will indicate what class the observation
31
00:01:41.470 --> 00:01:42.743
should be assigned to,
32
00:01:43.680 --> 00:01:46.163
which in this case is Class B.
33
00:01:47.930 --> 00:01:51.060
But if the k value is set to three,
34
00:01:51.060 --> 00:01:53.080
the three nearest neighbors
35
00:01:53.080 --> 00:01:55.083
will be used to decide the class.
36
00:01:56.600 --> 00:01:59.950
So KNN will find the three closet points
37
00:01:59.950 --> 00:02:02.680
for each class and whichever class
38
00:02:02.680 --> 00:02:05.020
has the smallest total distance
39
00:02:05.020 --> 00:02:06.793
will be the assigned class.
40
00:02:08.540 --> 00:02:11.453
And in this case, it looks like it'll be Class A.
41
00:02:14.150 --> 00:02:17.890
Ultimately, the class the new observation is assigned to
42
00:02:17.890 --> 00:02:20.963
will be determined by what k value is chosen.
43
00:02:22.080 --> 00:02:24.600
And unfortunately for us,
44
00:02:24.600 --> 00:02:26.610
there's no hard and fast rule
45
00:02:26.610 --> 00:02:30.290
to help us choose what k value to use.
46
00:02:30.290 --> 00:02:32.060
Like the other algorithms,
47
00:02:32.060 --> 00:02:34.410
you just need to make sure you avoid under
48
00:02:34.410 --> 00:02:35.493
and over fitting.
49
00:02:37.450 --> 00:02:39.720
Now that you have a better overall understanding
50
00:02:39.720 --> 00:02:42.130
of what the k value means,
51
00:02:42.130 --> 00:02:45.723
let's take a look at how to create a visualization.
52
00:02:48.540 --> 00:02:52.490
The libraries that we're gonna start with are numpy,
53
00:02:52.490 --> 00:02:54.997
pyplot, the train_test_splitter,
54
00:02:56.250 --> 00:03:00.450
KNeighborsClassifier and the accuracy_score function
55
00:03:00.450 --> 00:03:02.193
from scikit.metrics.
56
00:03:04.130 --> 00:03:06.010
And for this example,
57
00:03:06.010 --> 00:03:07.660
I thought it would be a good idea
58
00:03:08.637 --> 00:03:10.587
to use one of sk-learn's practice sets.
59
00:03:11.680 --> 00:03:13.790
Just like any other import,
60
00:03:13.790 --> 00:03:17.380
we'll say from sklearn.datasets
61
00:03:20.040 --> 00:03:21.683
import make_moons.
62
00:03:25.370 --> 00:03:27.700
And then to be able to use the dataset,
63
00:03:27.700 --> 00:03:29.770
we'll need to assign it to an object,
64
00:03:29.770 --> 00:03:33.740
so we'll say x, y equals make_moons.
65
00:03:36.130 --> 00:03:40.970
Then inside the parens, let's pass in n_samples
66
00:03:40.970 --> 00:03:43.293
and set that to 150.
67
00:03:44.310 --> 00:03:47.570
We'll make the noise equal .35,
68
00:03:47.570 --> 00:03:50.483
which will give us a little bit of variance in the data.
69
00:03:51.660 --> 00:03:55.363
And then we'll also pass in random_states equals zero.
70
00:04:00.210 --> 00:04:02.880
And since none of this is gonna be new,
71
00:04:02.880 --> 00:04:04.100
I already went through
72
00:04:04.100 --> 00:04:06.770
and set up the training and testing split,
73
00:04:06.770 --> 00:04:09.943
as well as the classifier and accuracy_score.
74
00:04:11.640 --> 00:04:14.090
Then I'm gonna run what we have so far,
75
00:04:14.090 --> 00:04:16.253
just to make sure that everything works.
76
00:04:18.390 --> 00:04:19.700
And we're good there,
77
00:04:19.700 --> 00:04:22.533
so let's go through what we need to graph this.
78
00:04:26.670 --> 00:04:28.140
We're gonna start out like normal.
79
00:04:28.140 --> 00:04:30.313
Pass in plt.scatter.
80
00:04:33.524 --> 00:04:34.630
And then I'll open this up
81
00:04:34.630 --> 00:04:37.163
so you can see the x_train object array.
82
00:04:38.660 --> 00:04:42.100
And what we're doing is plotting the first feature column
83
00:04:42.100 --> 00:04:44.080
along the x-axis
84
00:04:44.080 --> 00:04:46.010
and the second feature column
85
00:04:46.010 --> 00:04:47.593
will be along the y-axis.
86
00:04:59.300 --> 00:05:01.210
When we run it again,
87
00:05:01.210 --> 00:05:04.600
we have all of the training observations in place
88
00:05:04.600 --> 00:05:06.280
but the obvious problem
89
00:05:06.280 --> 00:05:08.700
is that we don't know what class any
90
00:05:08.700 --> 00:05:10.203
of these points belong to.
91
00:05:11.303 --> 00:05:14.900
To fix that, we'll just have to assign the color parameter
92
00:05:14.900 --> 00:05:16.483
to the y_train set.
93
00:05:26.000 --> 00:05:27.430
And now when we run it,
94
00:05:27.430 --> 00:05:29.763
we have two distinct classes.
95
00:05:31.070 --> 00:05:34.280
Then if you wanna change the colors from purple to yellow,
96
00:05:34.280 --> 00:05:36.523
there's a couple ways of doing that as well.
97
00:05:37.610 --> 00:05:40.520
If you wanna choose your own specific colors,
98
00:05:40.520 --> 00:05:41.353
you can do that
99
00:05:41.353 --> 00:05:45.003
by importing the ListedColorMap from Matplotlib.
100
00:05:53.350 --> 00:05:55.910
Then create your own colormap object
101
00:05:55.910 --> 00:05:58.593
with a list of the colors that you wanna use.
102
00:06:29.010 --> 00:06:31.690
The other option, which is a little bit easier
103
00:06:31.690 --> 00:06:34.380
is to use one of the predefined colormaps
104
00:06:34.380 --> 00:06:36.423
that Matplotlib already offers.
105
00:06:37.600 --> 00:06:39.810
And if you look through the documentation,
106
00:06:39.810 --> 00:06:42.410
there's a bunch of different options to choose from.
107
00:07:08.010 --> 00:07:11.150
Now if we go and change the k value,
108
00:07:11.150 --> 00:07:14.190
we can see the accuracy of the model change
109
00:07:14.190 --> 00:07:16.050
but there's not apparent change
110
00:07:16.050 --> 00:07:17.623
to the actual graph.
111
00:07:18.650 --> 00:07:20.510
So to help with that,
112
00:07:20.510 --> 00:07:23.000
we can create a visual representation
113
00:07:23.000 --> 00:07:24.513
of the decision boundaries.
114
00:07:25.750 --> 00:07:28.171
And bear with me as we go through this
115
00:07:28.171 --> 00:07:30.463
because it's a little complicated.
116
00:07:31.920 --> 00:07:33.810
The first thing that we need to do
117
00:07:33.810 --> 00:07:37.353
is create something called a numpy.meshgrid.
118
00:07:38.270 --> 00:07:42.260
And that gives us the ability to create a rectangular grid
119
00:07:42.260 --> 00:07:46.113
that represents either Cartesian or matrix indexing.
120
00:07:47.960 --> 00:07:51.930
So if we start with two separate one-dimensional arrays,
121
00:07:51.930 --> 00:07:56.050
the meshgrid function returns two two-dimensional arrays,
122
00:07:56.050 --> 00:07:59.683
representing the x and y coordinates of all of the points.
123
00:08:01.460 --> 00:08:05.800
And notice how the x_1 array now have three rows.
124
00:08:05.800 --> 00:08:08.220
And that's because the original y array
125
00:08:08.220 --> 00:08:10.140
had a length of three.
126
00:08:10.140 --> 00:08:13.250
And the y_1 array has four columns
127
00:08:13.250 --> 00:08:16.633
because the original x array had a length of four.
128
00:08:18.230 --> 00:08:20.033
Now, getting back to our code,
129
00:08:20.033 --> 00:08:23.090
the first step to setting up the meshgrid
130
00:08:23.090 --> 00:08:25.403
is defining how big it needs to be.
131
00:08:26.810 --> 00:08:31.520
We'll make the x_min and x_max parameters first.
132
00:08:31.520 --> 00:08:34.110
And remember, the data we're graphing
133
00:08:34.110 --> 00:08:35.670
along the x-axis comes
134
00:08:35.670 --> 00:08:38.020
from the first feature column.
135
00:08:38.020 --> 00:08:39.683
So we'll pass that in.
136
00:08:42.780 --> 00:08:45.370
Then we can apply NumPy's min method
137
00:08:45.370 --> 00:08:48.820
to return the minimum along the x-axis
138
00:08:48.820 --> 00:08:51.810
and we wanna extend the parameters a little bit
139
00:08:51.810 --> 00:08:53.640
beyond the minimum value.
140
00:08:53.640 --> 00:08:56.693
So we'll also pass in -.5.
141
00:08:58.350 --> 00:09:00.380
Then we're gonna do the same thing
142
00:09:00.380 --> 00:09:03.080
but this time pass in the max method,
143
00:09:03.080 --> 00:09:05.253
followed by +.5.
144
00:09:13.230 --> 00:09:16.563
Next, we'll just copy and paste all of this.
145
00:09:18.720 --> 00:09:21.243
Then we'll replace every x with a y.
146
00:09:24.540 --> 00:09:26.783
And every zero with a one.
147
00:09:33.790 --> 00:09:36.030
Once we have the parameters all set up,
148
00:09:36.030 --> 00:09:38.853
we can go ahead and create the actual meshgrid.
149
00:09:39.690 --> 00:09:41.930
So let's start with xx and yy
150
00:09:44.200 --> 00:09:47.053
and assign that to the meshgrid function.
151
00:09:49.930 --> 00:09:52.000
Then inside the parentheses,
152
00:09:52.000 --> 00:09:53.770
the documentation calls for us
153
00:09:53.770 --> 00:09:56.730
to pass in both one-dimensional arrays
154
00:09:56.730 --> 00:09:59.203
that represent the coordinates of the meshgrid.
155
00:10:01.060 --> 00:10:02.890
Applying that to our code,
156
00:10:02.890 --> 00:10:06.560
we'll start by passing in the arrange function,
157
00:10:06.560 --> 00:10:09.363
followed by a start and stop parameter.
158
00:10:13.330 --> 00:10:16.183
Then we'll pass in a step parameter of .1.
159
00:10:19.800 --> 00:10:22.683
Now we'll go and do the same thing for the y-axis.
160
00:10:34.170 --> 00:10:37.743
Now that's done, so we can go ahead and run the code again.
161
00:10:42.300 --> 00:10:45.490
And looking at the xx object array,
162
00:10:45.490 --> 00:10:48.090
you can see that every feature column value
163
00:10:48.090 --> 00:10:50.953
is repeated down all 40 rows.
164
00:10:53.930 --> 00:10:56.330
Okay, so now we can move forward
165
00:10:56.330 --> 00:10:58.740
with plotting the decision boundary.
166
00:10:58.740 --> 00:11:03.330
And to do that, we're gonna use contourf from Matplotlib.
167
00:11:04.180 --> 00:11:06.870
And looking at this documentation,
168
00:11:06.870 --> 00:11:11.560
we're gonna need an array object for x, y and z.
169
00:11:11.560 --> 00:11:13.410
And according to this,
170
00:11:13.410 --> 00:11:16.430
x and y need to be two dimensional
171
00:11:16.430 --> 00:11:18.563
and have the same shape as z.
172
00:11:19.950 --> 00:11:23.400
Well, we know that we're gonna be using xx for x
173
00:11:23.400 --> 00:11:25.570
and yy for y
174
00:11:25.570 --> 00:11:28.120
because that's what's gonna give us the coordinate system
175
00:11:28.120 --> 00:11:29.483
on the x-y plane.
176
00:11:30.880 --> 00:11:32.690
So that just means we're gonna need
177
00:11:32.690 --> 00:11:35.470
to create an object array for z.
178
00:11:35.470 --> 00:11:36.610
And that's gonna indicate
179
00:11:36.610 --> 00:11:40.010
what class an observation belongs to based
180
00:11:40.010 --> 00:11:41.653
on their x-y coordinate.
181
00:11:43.600 --> 00:11:46.186
So if you think about what z needs to do,
182
00:11:46.186 --> 00:11:48.980
it's essentially gonna be predicting the class
183
00:11:48.980 --> 00:11:52.150
of an observation based on the location of an observation
184
00:11:52.150 --> 00:11:53.143
on the meshgrid.
185
00:11:54.710 --> 00:11:57.980
And that's something we already know how to do.
186
00:11:57.980 --> 00:12:01.750
We just need to use the classifier.predict function.
187
00:12:01.750 --> 00:12:06.450
And normally we do this by passing in the x_test function
188
00:12:06.450 --> 00:12:09.350
but here, we kind of run into an issue
189
00:12:09.350 --> 00:12:12.314
because we need a way of representing both xx
190
00:12:12.314 --> 00:12:15.163
and yy as a single object.
191
00:12:16.610 --> 00:12:18.020
So to fix that,
192
00:12:18.020 --> 00:12:21.670
we're gonna use another function from NumPy called c
193
00:12:21.670 --> 00:12:24.063
or c_ I guess.
194
00:12:25.300 --> 00:12:28.380
I'll try to make this explanation a little quicker.
195
00:12:28.380 --> 00:12:31.870
So if we have array a containing the value one,
196
00:12:31.870 --> 00:12:33.000
two and three
197
00:12:33.000 --> 00:12:36.730
and array b, which contains four, five and six,
198
00:12:36.730 --> 00:12:40.660
we can use c_ and essentially what it does
199
00:12:40.660 --> 00:12:42.970
is transpose both of the arrays
200
00:12:42.970 --> 00:12:44.980
and then concatenate the two,
201
00:12:44.980 --> 00:12:47.123
giving us a three-by-two matrix.
202
00:12:48.340 --> 00:12:51.310
So when we go ahead and apply that to our code,
203
00:12:51.310 --> 00:12:53.163
we'll pass in np.c_,
204
00:12:55.490 --> 00:12:58.150
followed by squared brackets
205
00:12:58.150 --> 00:12:59.563
and then xx and yy.
206
00:13:03.110 --> 00:13:06.640
And when we run this, we get an error stating
207
00:13:06.640 --> 00:13:10.053
the dimensions have to match the training data dimensions.
208
00:13:11.200 --> 00:13:13.150
This is actually a pretty easy fix
209
00:13:13.150 --> 00:13:14.943
when we use the ravel function.
210
00:13:25.130 --> 00:13:27.430
And after we run it again,
211
00:13:27.430 --> 00:13:30.693
z ends up as a one-dimensional array.
212
00:13:31.670 --> 00:13:34.040
But the documentation said z needs
213
00:13:34.040 --> 00:13:36.513
to have the same shape as xx and yy.
214
00:13:37.700 --> 00:13:40.030
And to fix that issue,
215
00:13:40.030 --> 00:13:41.453
we'll do a reassignment.
216
00:13:43.530 --> 00:13:46.663
And reshape z to match the shape of xx.
217
00:13:52.390 --> 00:13:54.180
We'll run it again
218
00:13:54.180 --> 00:13:59.180
and this time, z, xx and yy all have the same shape.
219
00:14:01.730 --> 00:14:03.370
To add a couple final touches,
220
00:14:03.370 --> 00:14:06.663
we can change the color by using the colormap again.
221
00:14:14.260 --> 00:14:17.190
And to make it so we can actually see the data points,
222
00:14:17.190 --> 00:14:19.260
we can pass in alpha
223
00:14:19.260 --> 00:14:21.583
and let's go with something like .3.
224
00:14:25.105 --> 00:14:27.860
And that's okay but it seems a little too light for me.
225
00:14:27.860 --> 00:14:30.193
So let's go back and try .5.
226
00:14:32.670 --> 00:14:33.890
Fantastic.
227
00:14:33.890 --> 00:14:36.113
The graphing part is all done.
228
00:14:37.030 --> 00:14:39.860
Now let's go through and try a couple different k values
229
00:14:39.860 --> 00:14:41.610
to mess with the decision boundary.
230
00:15:08.130 --> 00:15:09.520
And I'm kind of torn
231
00:15:09.520 --> 00:15:13.410
between k equals five and k equals seven.
232
00:15:13.410 --> 00:15:14.810
They both look pretty good
233
00:15:14.810 --> 00:15:17.753
and have a fairly good score_accuracy for both.
234
00:15:18.700 --> 00:15:20.410
So you could really choose either one
235
00:15:20.410 --> 00:15:22.260
until you get a little bit more data.
236
00:15:24.420 --> 00:15:27.210
Okay, so to wrap this guide up,
237
00:15:27.210 --> 00:15:29.133
I'm gonna go through two more things.
238
00:15:30.470 --> 00:15:32.760
The first might be a little annoying
239
00:15:32.760 --> 00:15:36.010
because it would have saved us a bunch of time
240
00:15:36.010 --> 00:15:37.670
but I thought it was important
241
00:15:37.670 --> 00:15:40.080
to show you how to create the decision boundaries
242
00:15:40.080 --> 00:15:42.373
by only using the core libraries.
243
00:15:43.350 --> 00:15:46.868
So there's a library named mlxtend.
244
00:15:46.868 --> 00:15:49.900
And that does all of this for you
245
00:15:49.900 --> 00:15:51.933
in pretty much one line of code.
246
00:15:53.620 --> 00:15:55.560
I already installed it on my computer,
247
00:15:55.560 --> 00:15:57.773
so I'll just show you how it works.
248
00:15:58.970 --> 00:16:00.610
Just like all the other libraries,
249
00:16:00.610 --> 00:16:02.793
importing works the exact same way.
250
00:16:03.630 --> 00:16:07.310
I'll just say from mlxtend.plotting
251
00:16:10.930 --> 00:16:13.080
import decision_regions.
252
00:16:17.490 --> 00:16:18.700
All we're gonna need to do
253
00:16:18.700 --> 00:16:20.193
is pass in the function,
254
00:16:21.730 --> 00:16:25.650
followed by x-train, y_train
255
00:16:28.010 --> 00:16:30.993
and the classifier and that's it.
256
00:16:33.140 --> 00:16:35.020
To make it a little easier to compare these,
257
00:16:35.020 --> 00:16:37.503
I'll just pass in subplot real quick.
258
00:17:09.300 --> 00:17:10.803
And there you go.
259
00:17:13.240 --> 00:17:15.420
Now, the last thing that I wanna do
260
00:17:15.420 --> 00:17:18.143
is create an actual confusion matrix.
261
00:17:19.660 --> 00:17:21.730
So from sklearn.metrics,
262
00:17:21.730 --> 00:17:24.657
we're gonna import plot_confusion_matrix.
263
00:17:27.960 --> 00:17:29.700
Then to make the actual matrix,
264
00:17:29.700 --> 00:17:31.073
it is super simple.
265
00:17:32.050 --> 00:17:36.150
You just have to pass in the plot_confusion_matrix function,
266
00:17:36.150 --> 00:17:38.053
followed by the classifier.
267
00:17:40.740 --> 00:17:42.523
The x and y_test set.
268
00:17:48.210 --> 00:17:51.463
Then we can use the colormap to assign a color scheme.
269
00:17:56.540 --> 00:17:58.933
We'll go ahead and run it one last time.
270
00:18:00.687 --> 00:18:03.443
And there you have your very own confusion matrix.
271
00:18:05.550 --> 00:18:07.630
I know this was one of the longer guides
272
00:18:07.630 --> 00:18:10.290
but we finally made it to the end.
273
00:18:10.290 --> 00:18:11.960
So I will wrap it up
274
00:18:11.960 --> 00:18:13.360
and see you in the next one.