Guide to KNN Visualizations

WEBVTT

1
00:00:03.120 --> 00:00:04.640
In this guide,

2
00:00:04.640 --> 00:00:08.199
we're gonna continue our discussion on k-nearest neighbors

3
00:00:08.199 --> 00:00:11.070
but this time, we're going to focus more

4
00:00:11.070 --> 00:00:12.900
on what k means

5
00:00:12.900 --> 00:00:17.003
and how to create a k-nearest neighbor visualization.

6
00:00:19.190 --> 00:00:22.025
Hopefully you remember but in the last guide,

7
00:00:22.025 --> 00:00:25.300
I mentioned that k-nearest neighbor works

8
00:00:25.300 --> 00:00:27.680
by assuming similar objects exist

9
00:00:27.680 --> 00:00:29.163
in close proximity.

10
00:00:30.420 --> 00:00:33.270
So when a new data point comes in,

11
00:00:33.270 --> 00:00:36.213
it's classified in the group that it's closest to.

12
00:00:38.120 --> 00:00:40.320
Now, that' all fine and dandy

13
00:00:40.320 --> 00:00:43.600
when an input obviously belongs to a certain class

14
00:00:44.500 --> 00:00:48.250
but how does the k-nearest algorithm determine the class

15
00:00:48.250 --> 00:00:50.683
when the choice isn't so obvious?

16
00:00:52.520 --> 00:00:54.660
Well, that's precisely when the number

17
00:00:54.660 --> 00:00:56.673
of neighbors might come into play.

18
00:00:58.770 --> 00:01:01.940
Let's say we're working with this smaller dataset

19
00:01:01.940 --> 00:01:04.010
and we have a new observation

20
00:01:04.010 --> 00:01:07.463
that's pretty close to being in the middle of both classes.

21
00:01:09.240 --> 00:01:12.390
My question to you is what class

22
00:01:12.390 --> 00:01:14.223
should the data point belong to?

23
00:01:17.160 --> 00:01:19.630
The answer isn't so obvious

24
00:01:19.630 --> 00:01:22.853
because it all depends on what the k value is.

25
00:01:25.030 --> 00:01:27.260
If the k value is one,

26
00:01:27.260 --> 00:01:29.680
k-nearest neighbors will calculate the distance

27
00:01:29.680 --> 00:01:33.100
from the new observation to the nearest point

28
00:01:33.100 --> 00:01:34.293
of each class.

29
00:01:36.010 --> 00:01:38.960
And whatever distance is the closest

30
00:01:38.960 --> 00:01:41.470
will indicate what class the observation

31
00:01:41.470 --> 00:01:42.743
should be assigned to,

32
00:01:43.680 --> 00:01:46.163
which in this case is Class B.

33
00:01:47.930 --> 00:01:51.060
But if the k value is set to three,

34
00:01:51.060 --> 00:01:53.080
the three nearest neighbors

35
00:01:53.080 --> 00:01:55.083
will be used to decide the class.

36
00:01:56.600 --> 00:01:59.950
So KNN will find the three closet points

37
00:01:59.950 --> 00:02:02.680
for each class and whichever class

38
00:02:02.680 --> 00:02:05.020
has the smallest total distance

39
00:02:05.020 --> 00:02:06.793
will be the assigned class.

40
00:02:08.540 --> 00:02:11.453
And in this case, it looks like it'll be Class A.

41
00:02:14.150 --> 00:02:17.890
Ultimately, the class the new observation is assigned to

42
00:02:17.890 --> 00:02:20.963
will be determined by what k value is chosen.

43
00:02:22.080 --> 00:02:24.600
And unfortunately for us,

44
00:02:24.600 --> 00:02:26.610
there's no hard and fast rule

45
00:02:26.610 --> 00:02:30.290
to help us choose what k value to use.

46
00:02:30.290 --> 00:02:32.060
Like the other algorithms,

47
00:02:32.060 --> 00:02:34.410
you just need to make sure you avoid under

48
00:02:34.410 --> 00:02:35.493
and over fitting.

49
00:02:37.450 --> 00:02:39.720
Now that you have a better overall understanding

50
00:02:39.720 --> 00:02:42.130
of what the k value means,

51
00:02:42.130 --> 00:02:45.723
let's take a look at how to create a visualization.

52
00:02:48.540 --> 00:02:52.490
The libraries that we're gonna start with are numpy,

53
00:02:52.490 --> 00:02:54.997
pyplot, the train_test_splitter,

54
00:02:56.250 --> 00:03:00.450
KNeighborsClassifier and the accuracy_score function

55
00:03:00.450 --> 00:03:02.193
from scikit.metrics.

56
00:03:04.130 --> 00:03:06.010
And for this example,

57
00:03:06.010 --> 00:03:07.660
I thought it would be a good idea

58
00:03:08.637 --> 00:03:10.587
to use one of sk-learn's practice sets.

59
00:03:11.680 --> 00:03:13.790
Just like any other import,

60
00:03:13.790 --> 00:03:17.380
we'll say from sklearn.datasets

61
00:03:20.040 --> 00:03:21.683
import make_moons.

62
00:03:25.370 --> 00:03:27.700
And then to be able to use the dataset,

63
00:03:27.700 --> 00:03:29.770
we'll need to assign it to an object,

64
00:03:29.770 --> 00:03:33.740
so we'll say x, y equals make_moons.

65
00:03:36.130 --> 00:03:40.970
Then inside the parens, let's pass in n_samples

66
00:03:40.970 --> 00:03:43.293
and set that to 150.

67
00:03:44.310 --> 00:03:47.570
We'll make the noise equal .35,

68
00:03:47.570 --> 00:03:50.483
which will give us a little bit of variance in the data.

69
00:03:51.660 --> 00:03:55.363
And then we'll also pass in random_states equals zero.

70
00:04:00.210 --> 00:04:02.880
And since none of this is gonna be new,

71
00:04:02.880 --> 00:04:04.100
I already went through

72
00:04:04.100 --> 00:04:06.770
and set up the training and testing split,

73
00:04:06.770 --> 00:04:09.943
as well as the classifier and accuracy_score.

74
00:04:11.640 --> 00:04:14.090
Then I'm gonna run what we have so far,

75
00:04:14.090 --> 00:04:16.253
just to make sure that everything works.

76
00:04:18.390 --> 00:04:19.700
And we're good there,

77
00:04:19.700 --> 00:04:22.533
so let's go through what we need to graph this.

78
00:04:26.670 --> 00:04:28.140
We're gonna start out like normal.

79
00:04:28.140 --> 00:04:30.313
Pass in plt.scatter.

80
00:04:33.524 --> 00:04:34.630
And then I'll open this up

81
00:04:34.630 --> 00:04:37.163
so you can see the x_train object array.

82
00:04:38.660 --> 00:04:42.100
And what we're doing is plotting the first feature column

83
00:04:42.100 --> 00:04:44.080
along the x-axis

84
00:04:44.080 --> 00:04:46.010
and the second feature column

85
00:04:46.010 --> 00:04:47.593
will be along the y-axis.

86
00:04:59.300 --> 00:05:01.210
When we run it again,

87
00:05:01.210 --> 00:05:04.600
we have all of the training observations in place

88
00:05:04.600 --> 00:05:06.280
but the obvious problem

89
00:05:06.280 --> 00:05:08.700
is that we don't know what class any

90
00:05:08.700 --> 00:05:10.203
of these points belong to.

91
00:05:11.303 --> 00:05:14.900
To fix that, we'll just have to assign the color parameter

92
00:05:14.900 --> 00:05:16.483
to the y_train set.

93
00:05:26.000 --> 00:05:27.430
And now when we run it,

94
00:05:27.430 --> 00:05:29.763
we have two distinct classes.

95
00:05:31.070 --> 00:05:34.280
Then if you wanna change the colors from purple to yellow,

96
00:05:34.280 --> 00:05:36.523
there's a couple ways of doing that as well.

97
00:05:37.610 --> 00:05:40.520
If you wanna choose your own specific colors,

98
00:05:40.520 --> 00:05:41.353
you can do that

99
00:05:41.353 --> 00:05:45.003
by importing the ListedColorMap from Matplotlib.

100
00:05:53.350 --> 00:05:55.910
Then create your own colormap object

101
00:05:55.910 --> 00:05:58.593
with a list of the colors that you wanna use.

102
00:06:29.010 --> 00:06:31.690
The other option, which is a little bit easier

103
00:06:31.690 --> 00:06:34.380
is to use one of the predefined colormaps

104
00:06:34.380 --> 00:06:36.423
that Matplotlib already offers.

105
00:06:37.600 --> 00:06:39.810
And if you look through the documentation,

106
00:06:39.810 --> 00:06:42.410
there's a bunch of different options to choose from.

107
00:07:08.010 --> 00:07:11.150
Now if we go and change the k value,

108
00:07:11.150 --> 00:07:14.190
we can see the accuracy of the model change

109
00:07:14.190 --> 00:07:16.050
but there's not apparent change

110
00:07:16.050 --> 00:07:17.623
to the actual graph.

111
00:07:18.650 --> 00:07:20.510
So to help with that,

112
00:07:20.510 --> 00:07:23.000
we can create a visual representation

113
00:07:23.000 --> 00:07:24.513
of the decision boundaries.

114
00:07:25.750 --> 00:07:28.171
And bear with me as we go through this

115
00:07:28.171 --> 00:07:30.463
because it's a little complicated.

116
00:07:31.920 --> 00:07:33.810
The first thing that we need to do

117
00:07:33.810 --> 00:07:37.353
is create something called a numpy.meshgrid.

118
00:07:38.270 --> 00:07:42.260
And that gives us the ability to create a rectangular grid

119
00:07:42.260 --> 00:07:46.113
that represents either Cartesian or matrix indexing.

120
00:07:47.960 --> 00:07:51.930
So if we start with two separate one-dimensional arrays,

121
00:07:51.930 --> 00:07:56.050
the meshgrid function returns two two-dimensional arrays,

122
00:07:56.050 --> 00:07:59.683
representing the x and y coordinates of all of the points.

123
00:08:01.460 --> 00:08:05.800
And notice how the x_1 array now have three rows.

124
00:08:05.800 --> 00:08:08.220
And that's because the original y array

125
00:08:08.220 --> 00:08:10.140
had a length of three.

126
00:08:10.140 --> 00:08:13.250
And the y_1 array has four columns

127
00:08:13.250 --> 00:08:16.633
because the original x array had a length of four.

128
00:08:18.230 --> 00:08:20.033
Now, getting back to our code,

129
00:08:20.033 --> 00:08:23.090
the first step to setting up the meshgrid

130
00:08:23.090 --> 00:08:25.403
is defining how big it needs to be.

131
00:08:26.810 --> 00:08:31.520
We'll make the x_min and x_max parameters first.

132
00:08:31.520 --> 00:08:34.110
And remember, the data we're graphing

133
00:08:34.110 --> 00:08:35.670
along the x-axis comes

134
00:08:35.670 --> 00:08:38.020
from the first feature column.

135
00:08:38.020 --> 00:08:39.683
So we'll pass that in.

136
00:08:42.780 --> 00:08:45.370
Then we can apply NumPy's min method

137
00:08:45.370 --> 00:08:48.820
to return the minimum along the x-axis

138
00:08:48.820 --> 00:08:51.810
and we wanna extend the parameters a little bit

139
00:08:51.810 --> 00:08:53.640
beyond the minimum value.

140
00:08:53.640 --> 00:08:56.693
So we'll also pass in -.5.

141
00:08:58.350 --> 00:09:00.380
Then we're gonna do the same thing

142
00:09:00.380 --> 00:09:03.080
but this time pass in the max method,

143
00:09:03.080 --> 00:09:05.253
followed by +.5.

144
00:09:13.230 --> 00:09:16.563
Next, we'll just copy and paste all of this.

145
00:09:18.720 --> 00:09:21.243
Then we'll replace every x with a y.

146
00:09:24.540 --> 00:09:26.783
And every zero with a one.

147
00:09:33.790 --> 00:09:36.030
Once we have the parameters all set up,

148
00:09:36.030 --> 00:09:38.853
we can go ahead and create the actual meshgrid.

149
00:09:39.690 --> 00:09:41.930
So let's start with xx and yy

150
00:09:44.200 --> 00:09:47.053
and assign that to the meshgrid function.

151
00:09:49.930 --> 00:09:52.000
Then inside the parentheses,

152
00:09:52.000 --> 00:09:53.770
the documentation calls for us

153
00:09:53.770 --> 00:09:56.730
to pass in both one-dimensional arrays

154
00:09:56.730 --> 00:09:59.203
that represent the coordinates of the meshgrid.

155
00:10:01.060 --> 00:10:02.890
Applying that to our code,

156
00:10:02.890 --> 00:10:06.560
we'll start by passing in the arrange function,

157
00:10:06.560 --> 00:10:09.363
followed by a start and stop parameter.

158
00:10:13.330 --> 00:10:16.183
Then we'll pass in a step parameter of .1.

159
00:10:19.800 --> 00:10:22.683
Now we'll go and do the same thing for the y-axis.

160
00:10:34.170 --> 00:10:37.743
Now that's done, so we can go ahead and run the code again.

161
00:10:42.300 --> 00:10:45.490
And looking at the xx object array,

162
00:10:45.490 --> 00:10:48.090
you can see that every feature column value

163
00:10:48.090 --> 00:10:50.953
is repeated down all 40 rows.

164
00:10:53.930 --> 00:10:56.330
Okay, so now we can move forward

165
00:10:56.330 --> 00:10:58.740
with plotting the decision boundary.

166
00:10:58.740 --> 00:11:03.330
And to do that, we're gonna use contourf from Matplotlib.

167
00:11:04.180 --> 00:11:06.870
And looking at this documentation,

168
00:11:06.870 --> 00:11:11.560
we're gonna need an array object for x, y and z.

169
00:11:11.560 --> 00:11:13.410
And according to this,

170
00:11:13.410 --> 00:11:16.430
x and y need to be two dimensional

171
00:11:16.430 --> 00:11:18.563
and have the same shape as z.

172
00:11:19.950 --> 00:11:23.400
Well, we know that we're gonna be using xx for x

173
00:11:23.400 --> 00:11:25.570
and yy for y

174
00:11:25.570 --> 00:11:28.120
because that's what's gonna give us the coordinate system

175
00:11:28.120 --> 00:11:29.483
on the x-y plane.

176
00:11:30.880 --> 00:11:32.690
So that just means we're gonna need

177
00:11:32.690 --> 00:11:35.470
to create an object array for z.

178
00:11:35.470 --> 00:11:36.610
And that's gonna indicate

179
00:11:36.610 --> 00:11:40.010
what class an observation belongs to based

180
00:11:40.010 --> 00:11:41.653
on their x-y coordinate.

181
00:11:43.600 --> 00:11:46.186
So if you think about what z needs to do,

182
00:11:46.186 --> 00:11:48.980
it's essentially gonna be predicting the class

183
00:11:48.980 --> 00:11:52.150
of an observation based on the location of an observation

184
00:11:52.150 --> 00:11:53.143
on the meshgrid.

185
00:11:54.710 --> 00:11:57.980
And that's something we already know how to do.

186
00:11:57.980 --> 00:12:01.750
We just need to use the classifier.predict function.

187
00:12:01.750 --> 00:12:06.450
And normally we do this by passing in the x_test function

188
00:12:06.450 --> 00:12:09.350
but here, we kind of run into an issue

189
00:12:09.350 --> 00:12:12.314
because we need a way of representing both xx

190
00:12:12.314 --> 00:12:15.163
and yy as a single object.

191
00:12:16.610 --> 00:12:18.020
So to fix that,

192
00:12:18.020 --> 00:12:21.670
we're gonna use another function from NumPy called c

193
00:12:21.670 --> 00:12:24.063
or c_ I guess.

194
00:12:25.300 --> 00:12:28.380
I'll try to make this explanation a little quicker.

195
00:12:28.380 --> 00:12:31.870
So if we have array a containing the value one,

196
00:12:31.870 --> 00:12:33.000
two and three

197
00:12:33.000 --> 00:12:36.730
and array b, which contains four, five and six,

198
00:12:36.730 --> 00:12:40.660
we can use c_ and essentially what it does

199
00:12:40.660 --> 00:12:42.970
is transpose both of the arrays

200
00:12:42.970 --> 00:12:44.980
and then concatenate the two,

201
00:12:44.980 --> 00:12:47.123
giving us a three-by-two matrix.

202
00:12:48.340 --> 00:12:51.310
So when we go ahead and apply that to our code,

203
00:12:51.310 --> 00:12:53.163
we'll pass in np.c_,

204
00:12:55.490 --> 00:12:58.150
followed by squared brackets

205
00:12:58.150 --> 00:12:59.563
and then xx and yy.

206
00:13:03.110 --> 00:13:06.640
And when we run this, we get an error stating

207
00:13:06.640 --> 00:13:10.053
the dimensions have to match the training data dimensions.

208
00:13:11.200 --> 00:13:13.150
This is actually a pretty easy fix

209
00:13:13.150 --> 00:13:14.943
when we use the ravel function.

210
00:13:25.130 --> 00:13:27.430
And after we run it again,

211
00:13:27.430 --> 00:13:30.693
z ends up as a one-dimensional array.

212
00:13:31.670 --> 00:13:34.040
But the documentation said z needs

213
00:13:34.040 --> 00:13:36.513
to have the same shape as xx and yy.

214
00:13:37.700 --> 00:13:40.030
And to fix that issue,

215
00:13:40.030 --> 00:13:41.453
we'll do a reassignment.

216
00:13:43.530 --> 00:13:46.663
And reshape z to match the shape of xx.

217
00:13:52.390 --> 00:13:54.180
We'll run it again

218
00:13:54.180 --> 00:13:59.180
and this time, z, xx and yy all have the same shape.

219
00:14:01.730 --> 00:14:03.370
To add a couple final touches,

220
00:14:03.370 --> 00:14:06.663
we can change the color by using the colormap again.

221
00:14:14.260 --> 00:14:17.190
And to make it so we can actually see the data points,

222
00:14:17.190 --> 00:14:19.260
we can pass in alpha

223
00:14:19.260 --> 00:14:21.583
and let's go with something like .3.

224
00:14:25.105 --> 00:14:27.860
And that's okay but it seems a little too light for me.

225
00:14:27.860 --> 00:14:30.193
So let's go back and try .5.

226
00:14:32.670 --> 00:14:33.890
Fantastic.

227
00:14:33.890 --> 00:14:36.113
The graphing part is all done.

228
00:14:37.030 --> 00:14:39.860
Now let's go through and try a couple different k values

229
00:14:39.860 --> 00:14:41.610
to mess with the decision boundary.

230
00:15:08.130 --> 00:15:09.520
And I'm kind of torn

231
00:15:09.520 --> 00:15:13.410
between k equals five and k equals seven.

232
00:15:13.410 --> 00:15:14.810
They both look pretty good

233
00:15:14.810 --> 00:15:17.753
and have a fairly good score_accuracy for both.

234
00:15:18.700 --> 00:15:20.410
So you could really choose either one

235
00:15:20.410 --> 00:15:22.260
until you get a little bit more data.

236
00:15:24.420 --> 00:15:27.210
Okay, so to wrap this guide up,

237
00:15:27.210 --> 00:15:29.133
I'm gonna go through two more things.

238
00:15:30.470 --> 00:15:32.760
The first might be a little annoying

239
00:15:32.760 --> 00:15:36.010
because it would have saved us a bunch of time

240
00:15:36.010 --> 00:15:37.670
but I thought it was important

241
00:15:37.670 --> 00:15:40.080
to show you how to create the decision boundaries

242
00:15:40.080 --> 00:15:42.373
by only using the core libraries.

243
00:15:43.350 --> 00:15:46.868
So there's a library named mlxtend.

244
00:15:46.868 --> 00:15:49.900
And that does all of this for you

245
00:15:49.900 --> 00:15:51.933
in pretty much one line of code.

246
00:15:53.620 --> 00:15:55.560
I already installed it on my computer,

247
00:15:55.560 --> 00:15:57.773
so I'll just show you how it works.

248
00:15:58.970 --> 00:16:00.610
Just like all the other libraries,

249
00:16:00.610 --> 00:16:02.793
importing works the exact same way.

250
00:16:03.630 --> 00:16:07.310
I'll just say from mlxtend.plotting

251
00:16:10.930 --> 00:16:13.080
import decision_regions.

252
00:16:17.490 --> 00:16:18.700
All we're gonna need to do

253
00:16:18.700 --> 00:16:20.193
is pass in the function,

254
00:16:21.730 --> 00:16:25.650
followed by x-train, y_train

255
00:16:28.010 --> 00:16:30.993
and the classifier and that's it.

256
00:16:33.140 --> 00:16:35.020
To make it a little easier to compare these,

257
00:16:35.020 --> 00:16:37.503
I'll just pass in subplot real quick.

258
00:17:09.300 --> 00:17:10.803
And there you go.

259
00:17:13.240 --> 00:17:15.420
Now, the last thing that I wanna do

260
00:17:15.420 --> 00:17:18.143
is create an actual confusion matrix.

261
00:17:19.660 --> 00:17:21.730
So from sklearn.metrics,

262
00:17:21.730 --> 00:17:24.657
we're gonna import plot_confusion_matrix.

263
00:17:27.960 --> 00:17:29.700
Then to make the actual matrix,

264
00:17:29.700 --> 00:17:31.073
it is super simple.

265
00:17:32.050 --> 00:17:36.150
You just have to pass in the plot_confusion_matrix function,

266
00:17:36.150 --> 00:17:38.053
followed by the classifier.

267
00:17:40.740 --> 00:17:42.523
The x and y_test set.

268
00:17:48.210 --> 00:17:51.463
Then we can use the colormap to assign a color scheme.

269
00:17:56.540 --> 00:17:58.933
We'll go ahead and run it one last time.

270
00:18:00.687 --> 00:18:03.443
And there you have your very own confusion matrix.

271
00:18:05.550 --> 00:18:07.630
I know this was one of the longer guides

272
00:18:07.630 --> 00:18:10.290
but we finally made it to the end.

273
00:18:10.290 --> 00:18:11.960
So I will wrap it up

274
00:18:11.960 --> 00:18:13.360
and see you in the next one.