How to Build a K-Means Clustering Visualization
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

WEBVTT

1
00:00:04.320 --> 00:00:05.153
Now that we've gone through

2
00:00:05.153 --> 00:00:06.720
dimensionality reduction,

3
00:00:06.720 --> 00:00:09.300
we can move forward with how to create a visualization

4
00:00:09.300 --> 00:00:11.800
for each of the clustering models.

5
00:00:11.800 --> 00:00:12.880
As we move through the guide,

6
00:00:12.880 --> 00:00:15.330
we're also gonna be making a couple different visuals

7
00:00:15.330 --> 00:00:16.900
that we can use to help us choose

8
00:00:16.900 --> 00:00:18.803
the best K value for a model.

9
00:00:21.530 --> 00:00:23.530
When I was originally putting the guide together,

10
00:00:23.530 --> 00:00:25.210
I planned on having it all set up

11
00:00:25.210 --> 00:00:27.720
so we could do our own exploratory analysis

12
00:00:27.720 --> 00:00:29.890
using real-world data.

13
00:00:29.890 --> 00:00:31.350
But the further I got into it,

14
00:00:31.350 --> 00:00:33.560
the more I realized how much information

15
00:00:33.560 --> 00:00:35.860
we're actually gonna be covering.

16
00:00:35.860 --> 00:00:37.650
So instead of splitting our time between

17
00:00:37.650 --> 00:00:39.200
figuring out what the data means

18
00:00:39.200 --> 00:00:41.840
and trying to cover every key concept,

19
00:00:41.840 --> 00:00:43.110
I thought you'd be better served

20
00:00:43.110 --> 00:00:46.030
having a data frame with a really obvious result.

21
00:00:46.030 --> 00:00:47.840
That way, we can use it to help explain

22
00:00:47.840 --> 00:00:49.300
all of the new concepts

23
00:00:49.300 --> 00:00:51.690
instead of using some ambiguous result

24
00:00:51.690 --> 00:00:54.397
that might end up not making any sense at all.

25
00:00:55.800 --> 00:00:57.000
Now, with that being said,

26
00:00:57.000 --> 00:00:59.620
there's really not much to the dataset.

27
00:00:59.620 --> 00:01:00.820
What we're gonna be working with

28
00:01:00.820 --> 00:01:04.650
is made up of 1,000 samples with 10 features.

29
00:01:04.650 --> 00:01:06.420
And then by setting the center to three,

30
00:01:06.420 --> 00:01:08.610
it makes it so the data is gonna be separated

31
00:01:08.610 --> 00:01:10.403
into three distinct blobs.

32
00:01:12.100 --> 00:01:13.250
Now, unfortunately,

33
00:01:13.250 --> 00:01:15.890
I can't really show you what the data looks like yet

34
00:01:15.890 --> 00:01:17.460
because I honestly have no clue

35
00:01:17.460 --> 00:01:20.820
what a blob in 10 dimensional space would look like.

36
00:01:20.820 --> 00:01:23.030
So to fix that, we're gonna need to perform

37
00:01:23.030 --> 00:01:27.000
some sort of dimensionality reduction like PCA.

38
00:01:27.000 --> 00:01:29.080
And like I mentioned in the last guide,

39
00:01:29.080 --> 00:01:31.400
before we can apply the PCA algorithm,

40
00:01:31.400 --> 00:01:34.100
we need to make sure the data has already been scaled.

41
00:01:35.310 --> 00:01:38.950
Now, all of the PCA code should look relatively familiar

42
00:01:38.950 --> 00:01:41.490
because it's pretty much the exact same thing

43
00:01:41.490 --> 00:01:43.600
that we did in the last guide.

44
00:01:43.600 --> 00:01:46.830
We're just gonna be fitting the model with our scaled data,

45
00:01:46.830 --> 00:01:49.790
then using the explained variance ratio attribute

46
00:01:49.790 --> 00:01:52.000
to find out how much variance can be explained

47
00:01:52.000 --> 00:01:53.573
by each principal component.

48
00:01:55.020 --> 00:01:56.270
And I'm not sure how many of you

49
00:01:56.270 --> 00:01:58.400
noticed my mistake in the last guide,

50
00:01:58.400 --> 00:02:00.070
but I'm pretty sure I said that

51
00:02:00.070 --> 00:02:01.550
two of the principal components

52
00:02:01.550 --> 00:02:04.380
explain less than one-tenth of a percent,

53
00:02:04.380 --> 00:02:07.790
when in reality, I forgot to move the decimal place over.

54
00:02:07.790 --> 00:02:10.160
So one of them was actually supposed to be closer

55
00:02:10.160 --> 00:02:12.327
to something like 4%,

56
00:02:12.327 --> 00:02:15.023
and I think the other one was around half a percent.

57
00:02:16.130 --> 00:02:17.790
To avoid making the same mistake,

58
00:02:17.790 --> 00:02:19.960
I added an element-wise multiplication

59
00:02:19.960 --> 00:02:22.520
to essentially convert all the elements of the array

60
00:02:22.520 --> 00:02:24.690
into their actual percentage.

61
00:02:24.690 --> 00:02:26.010
That way we don't have to worry about

62
00:02:26.010 --> 00:02:27.410
doing any of the conversion.

63
00:02:29.610 --> 00:02:31.480
All right, now that we have all of that covered,

64
00:02:31.480 --> 00:02:33.683
I'm gonna go ahead and run the first cell.

65
00:02:36.580 --> 00:02:37.800
And as expected,

66
00:02:37.800 --> 00:02:39.840
what we have returned is an array containing

67
00:02:39.840 --> 00:02:41.690
the percentage of explained variance

68
00:02:41.690 --> 00:02:43.173
for each principle component.

69
00:02:44.230 --> 00:02:47.500
But more importantly, it looks like component one and two

70
00:02:47.500 --> 00:02:50.940
are able to explain just about 80% of the variance,

71
00:02:50.940 --> 00:02:53.660
which is a really solid percentage.

72
00:02:53.660 --> 00:02:55.250
And if you looked ahead, you already know

73
00:02:55.250 --> 00:02:57.220
that those are the only two principle components

74
00:02:57.220 --> 00:02:58.780
we're gonna be using.

75
00:02:58.780 --> 00:03:00.840
Oh, it's also probably worth mentioning

76
00:03:00.840 --> 00:03:03.290
that to actually reduce the number of dimensions,

77
00:03:03.290 --> 00:03:05.480
we need to use the fit transform method

78
00:03:05.480 --> 00:03:07.160
instead of just using the fit method

79
00:03:07.160 --> 00:03:08.760
like we did it in the PCA guide.

80
00:03:10.210 --> 00:03:12.760
So now I'm gonna go ahead and run it one more time.

81
00:03:13.730 --> 00:03:15.640
And before we switch over to the plot pane,

82
00:03:15.640 --> 00:03:18.600
you can already tell that the reduction process worked,

83
00:03:18.600 --> 00:03:20.790
since x scaled has 10 columns

84
00:03:20.790 --> 00:03:23.213
and the x reduced matrix only has two.

85
00:03:24.320 --> 00:03:27.130
So, now when we move it to the plot pane,

86
00:03:27.130 --> 00:03:28.170
it's pretty clear

87
00:03:28.170 --> 00:03:30.570
that we end up getting three distinct blobs,

88
00:03:30.570 --> 00:03:32.940
or in our case, clusters.

89
00:03:32.940 --> 00:03:34.380
But we're gonna pretend for a minute

90
00:03:34.380 --> 00:03:36.520
that it's not overly obvious,

91
00:03:36.520 --> 00:03:39.200
and instead use something called the elbow method,

92
00:03:39.200 --> 00:03:41.760
which is one of the more common methods we can implement

93
00:03:41.760 --> 00:03:44.223
to help figure out the best K value to use.

94
00:03:45.690 --> 00:03:47.050
It'll make sense when you see it,

95
00:03:47.050 --> 00:03:50.290
but the elbow method works by plotting explained variation

96
00:03:50.290 --> 00:03:52.880
as a function of the number of clusters.

97
00:03:52.880 --> 00:03:55.860
Then wherever we end up seeing the elbow of the curve,

98
00:03:55.860 --> 00:03:58.610
that'll indicate the value that we should choose for K.

99
00:04:00.180 --> 00:04:02.340
This is another time where I don't think the math

100
00:04:02.340 --> 00:04:04.790
is gonna be overly important for you to know.

101
00:04:04.790 --> 00:04:07.050
But the three major components that we're gonna be using

102
00:04:07.050 --> 00:04:08.910
are the total sum of squares,

103
00:04:08.910 --> 00:04:10.360
which is the total distance

104
00:04:10.360 --> 00:04:13.180
of every data point from the global mean,

105
00:04:13.180 --> 00:04:15.430
then we have the within sum of squares,

106
00:04:15.430 --> 00:04:17.590
which is the distance of every point

107
00:04:17.590 --> 00:04:20.530
from the centroid of their respective cluster,

108
00:04:20.530 --> 00:04:23.230
and finally, we have the between sum of squares,

109
00:04:23.230 --> 00:04:26.260
which is essentially the weighted distance of each centroid

110
00:04:26.260 --> 00:04:27.303
to the global mean.

111
00:04:28.530 --> 00:04:30.290
Now it's pretty self-explanatory,

112
00:04:30.290 --> 00:04:32.820
but those three are also related to each other

113
00:04:32.820 --> 00:04:34.370
because the total sum of squares

114
00:04:34.370 --> 00:04:37.503
is equal to the within sum added to the between sum.

115
00:04:40.970 --> 00:04:42.920
Alrighty, getting back to the code,

116
00:04:42.920 --> 00:04:45.770
the setup for the elbow method is pretty straightforward.

117
00:04:46.880 --> 00:04:47.950
The first thing we need to do

118
00:04:47.950 --> 00:04:50.823
is define the range of K values we wanna test.

119
00:04:51.690 --> 00:04:53.890
After that, we're using list comprehension

120
00:04:53.890 --> 00:04:56.720
to make a list object containing every K-means model

121
00:04:56.720 --> 00:04:58.403
within the defined K range.

122
00:04:59.580 --> 00:05:00.940
Then for the next three lines,

123
00:05:00.940 --> 00:05:02.050
we're gonna be calculating

124
00:05:02.050 --> 00:05:05.253
the within, total, and between sum of squares.

125
00:05:06.810 --> 00:05:07.950
To get the within sum,

126
00:05:07.950 --> 00:05:10.340
we're gonna be using list comprehension again,

127
00:05:10.340 --> 00:05:14.260
along with the inertia attribute from the K-means module,

128
00:05:14.260 --> 00:05:16.823
which if you check the documentation,

129
00:05:17.770 --> 00:05:19.940
it lets us know that the inertia attribute

130
00:05:19.940 --> 00:05:22.670
calculates the sum of squared distances of samples

131
00:05:22.670 --> 00:05:24.790
to the closest centroid.

132
00:05:24.790 --> 00:05:26.600
And I don't think we need to go back,

133
00:05:26.600 --> 00:05:28.390
but that was the exact definition

134
00:05:28.390 --> 00:05:30.060
of the within sum of squares.

135
00:05:30.060 --> 00:05:31.733
So we're all good for this one.

136
00:05:32.900 --> 00:05:34.210
Now, for the total sum,

137
00:05:34.210 --> 00:05:35.920
we're gonna be using a new function

138
00:05:35.920 --> 00:05:37.850
called pairwise distance,

139
00:05:37.850 --> 00:05:40.600
which we're getting from the scientific Python library.

140
00:05:41.570 --> 00:05:44.450
And I'm starting to feel like it's a trend in this guide,

141
00:05:44.450 --> 00:05:45.770
but the name of the function

142
00:05:45.770 --> 00:05:47.920
is again pretty self-explanatory

143
00:05:47.920 --> 00:05:49.390
since we use it to calculate

144
00:05:49.390 --> 00:05:52.510
the pairwise distance between observations,

145
00:05:52.510 --> 00:05:55.170
which in our code, we're using to calculate the distance

146
00:05:55.170 --> 00:05:58.453
between every observation in the X reduced matrix.

147
00:06:01.890 --> 00:06:03.630
So the pairwise distance function

148
00:06:03.630 --> 00:06:05.930
starts with the first observation

149
00:06:05.930 --> 00:06:09.103
and calculates the distance to the second observation.

150
00:06:10.010 --> 00:06:12.740
Then it saves the result to a distance matrix.

151
00:06:12.740 --> 00:06:13.720
And then after that,

152
00:06:13.720 --> 00:06:17.850
it goes from the first observation to the third observation,

153
00:06:17.850 --> 00:06:20.620
and then from the first to the fourth and so on.

154
00:06:20.620 --> 00:06:23.910
And it'll do that until it gets through the entire matrix.

155
00:06:23.910 --> 00:06:24.830
Once it's done with that,

156
00:06:24.830 --> 00:06:27.100
it's gonna go back up to the second observation

157
00:06:27.100 --> 00:06:29.050
and calculate the distance to the third

158
00:06:29.050 --> 00:06:31.550
and second to fourth and so on.

159
00:06:31.550 --> 00:06:33.350
And it keeps following that process

160
00:06:33.350 --> 00:06:35.673
until every unique distance is calculated.

161
00:06:38.930 --> 00:06:40.150
But anyway, after that,

162
00:06:40.150 --> 00:06:42.400
we're gonna be squaring all the distances,

163
00:06:42.400 --> 00:06:44.940
calculating the sum of all the distances,

164
00:06:44.940 --> 00:06:47.550
then dividing by the original number of observations,

165
00:06:47.550 --> 00:06:49.073
which for us was 1,000.

166
00:06:50.100 --> 00:06:51.750
And finally, all we have to do

167
00:06:51.750 --> 00:06:53.160
to calculate the between sum

168
00:06:53.160 --> 00:06:55.733
is subtract the within sum from the total sum.

169
00:06:57.020 --> 00:07:00.000
Alrighty, now that we have all of that taken care of,

170
00:07:00.000 --> 00:07:02.670
we are all set up to build the elbow graph.

171
00:07:02.670 --> 00:07:04.540
And I think I already mentioned it,

172
00:07:04.540 --> 00:07:07.470
but the elbow graph works by plotting explained variation

173
00:07:07.470 --> 00:07:10.110
as a function of the number of clusters.

174
00:07:10.110 --> 00:07:13.220
So the range we're testing is gonna go along the x-axis

175
00:07:13.220 --> 00:07:16.010
and the explained variation along the y.

176
00:07:16.010 --> 00:07:17.370
And to be honest with you,

177
00:07:17.370 --> 00:07:18.670
it doesn't really matter

178
00:07:18.670 --> 00:07:21.860
if you decide to use the within sum or the between sum

179
00:07:21.860 --> 00:07:23.600
because as you'll see in a second,

180
00:07:23.600 --> 00:07:26.070
they'll both end up giving you the same result.

181
00:07:26.070 --> 00:07:29.530
But personally, I find myself using within sum

182
00:07:29.530 --> 00:07:32.130
just because it's the first thing we're calculating.

183
00:07:37.020 --> 00:07:39.040
All right, so this is one of the reasons

184
00:07:39.040 --> 00:07:41.970
we're using a data set with a super obvious result,

185
00:07:41.970 --> 00:07:44.470
because what we refer to as the elbow

186
00:07:44.470 --> 00:07:46.600
isn't always this pronounced.

187
00:07:46.600 --> 00:07:48.580
So the way you interpret the elbow graph

188
00:07:48.580 --> 00:07:50.950
is by thinking of it as an arm.

189
00:07:50.950 --> 00:07:53.670
So you're gonna have the upper half right here

190
00:07:53.670 --> 00:07:56.160
and then the lower half down here.

191
00:07:56.160 --> 00:07:59.123
And most importantly, we're gonna have the elbow.

192
00:08:00.210 --> 00:08:03.660
So whichever K value looks most like the elbow

193
00:08:03.660 --> 00:08:05.350
is gonna be the suggested value

194
00:08:05.350 --> 00:08:07.210
that we should use in the model.

195
00:08:07.210 --> 00:08:08.270
And in our example,

196
00:08:08.270 --> 00:08:10.750
it's pretty obvious that K value of three

197
00:08:10.750 --> 00:08:12.303
looks the most like an elbow.

198
00:08:17.650 --> 00:08:18.930
For the actual K-model,

199
00:08:18.930 --> 00:08:21.480
there isn't really a whole lot that we need to go through

200
00:08:21.480 --> 00:08:23.990
since most of it we've already done.

201
00:08:23.990 --> 00:08:27.090
At the top we have all of our imports and data.

202
00:08:27.090 --> 00:08:29.760
Then we made a variable for the K-means algorithm

203
00:08:29.760 --> 00:08:32.043
and set the number of clusters to three.

204
00:08:32.920 --> 00:08:36.080
After that, we created the model by using the fit method

205
00:08:36.080 --> 00:08:37.830
along with the X reduced matrix

206
00:08:37.830 --> 00:08:40.750
that we made by using the PCA algorithm.

207
00:08:40.750 --> 00:08:43.200
From there, we have the same old mesh grid

208
00:08:43.200 --> 00:08:45.950
that we've already made a few other times,

209
00:08:45.950 --> 00:08:47.240
followed by all of the code

210
00:08:47.240 --> 00:08:49.090
that we're gonna need for the visual.

211
00:08:50.260 --> 00:08:52.193
So, now when we go ahead and run it,

212
00:08:55.040 --> 00:08:57.960
we end up with a clear and easy to read visual,

213
00:08:57.960 --> 00:09:00.470
showing us the location of the three centroids

214
00:09:00.470 --> 00:09:02.223
along with all of the boundaries.

215
00:09:04.130 --> 00:09:07.360
Alrighty, so the last topic we're gonna cover in this guide

216
00:09:07.360 --> 00:09:08.783
is the silhouette score.

217
00:09:12.590 --> 00:09:14.280
I'm pretty sure a couple of guides ago,

218
00:09:14.280 --> 00:09:15.560
I mentioned that clustering

219
00:09:15.560 --> 00:09:18.920
is often used for exploring unlabeled data.

220
00:09:18.920 --> 00:09:20.900
Meaning we have a bunch of features,

221
00:09:20.900 --> 00:09:23.563
but really no clue as to how they should be grouped.

222
00:09:24.640 --> 00:09:26.760
It obviously wasn't a problem in this example

223
00:09:26.760 --> 00:09:29.250
because the clusters were well separated.

224
00:09:29.250 --> 00:09:30.290
But a lot of times,

225
00:09:30.290 --> 00:09:32.410
the data is gonna be jumbled together

226
00:09:32.410 --> 00:09:34.400
making it really difficult to decipher

227
00:09:34.400 --> 00:09:36.763
an optimum number of clusters we should use.

228
00:09:37.640 --> 00:09:39.420
Now, we already covered the elbow method,

229
00:09:39.420 --> 00:09:41.600
which for this example worked really well

230
00:09:41.600 --> 00:09:45.060
since there was really just one obvious choice.

231
00:09:45.060 --> 00:09:47.440
But as you'd expect with real-world data,

232
00:09:47.440 --> 00:09:50.180
it's not always gonna be that straight forward.

233
00:09:50.180 --> 00:09:52.130
There's gonna be times where you might have

234
00:09:52.130 --> 00:09:54.230
two, three, or four different points

235
00:09:54.230 --> 00:09:56.500
that all look like they could be the elbow.

236
00:09:56.500 --> 00:09:59.333
So to help solve that, we can use the silhouette score.

237
00:10:06.410 --> 00:10:08.770
If you wanna take a look at the official documentation,

238
00:10:08.770 --> 00:10:09.670
you can.

239
00:10:09.670 --> 00:10:11.760
But basically, the silhouette score

240
00:10:11.760 --> 00:10:14.430
indicates how close clusters are to each other

241
00:10:14.430 --> 00:10:15.910
by returning a coefficient

242
00:10:15.910 --> 00:10:18.730
that ranges from negative one to one.

243
00:10:18.730 --> 00:10:21.740
The absolute best value we can get is one,

244
00:10:21.740 --> 00:10:23.500
the worst is negative one.

245
00:10:23.500 --> 00:10:24.900
And if zero is returned,

246
00:10:24.900 --> 00:10:27.100
that's when things start getting a little muddy

247
00:10:27.100 --> 00:10:29.823
because it's an indication clusters are overlapping.

248
00:10:30.700 --> 00:10:32.850
We'll come back to those numbers in just a minute,

249
00:10:32.850 --> 00:10:34.440
but for now let's move on to the code

250
00:10:34.440 --> 00:10:36.440
so we can run through that really quick.

251
00:10:38.290 --> 00:10:39.810
So a lot of this is kind of similar

252
00:10:39.810 --> 00:10:41.600
to what we did for the elbow method,

253
00:10:41.600 --> 00:10:43.150
but this time we're gonna be using a loop

254
00:10:43.150 --> 00:10:45.830
to iterate over a range of K values.

255
00:10:45.830 --> 00:10:48.230
After that, we're gonna be using the fit predict method

256
00:10:48.230 --> 00:10:51.643
to make cluster predictions for each model in that K range.

257
00:10:52.560 --> 00:10:54.610
Then we're using the X reduced matrix

258
00:10:54.610 --> 00:10:56.120
and the predicted cluster labels

259
00:10:56.120 --> 00:10:58.023
to calculate every silhouette score.

260
00:10:59.030 --> 00:11:01.340
And to make it a little easier to analyze,

261
00:11:01.340 --> 00:11:03.310
we're also throwing in a pandas data frame

262
00:11:03.310 --> 00:11:04.863
to help organize everything.

263
00:11:05.960 --> 00:11:07.260
Then finally for the graph,

264
00:11:07.260 --> 00:11:09.230
we're just plotting each silhouette score

265
00:11:09.230 --> 00:11:11.283
against their corresponding K value.

266
00:11:20.090 --> 00:11:21.390
And since we're still pretending

267
00:11:21.390 --> 00:11:23.400
we don't know anything about the data,

268
00:11:23.400 --> 00:11:24.850
the graph in the data frame

269
00:11:24.850 --> 00:11:27.670
suggest we use at K value of three

270
00:11:27.670 --> 00:11:31.280
because it's yielding and silhouette score of 0.85,

271
00:11:31.280 --> 00:11:33.463
which is the closest value to one.

272
00:11:34.310 --> 00:11:38.040
Then in contrast, the worst case value we could use is nine,

273
00:11:38.040 --> 00:11:40.863
which only has a silhouette score of 0.33.

274
00:11:41.960 --> 00:11:44.970
And if we go back into the K-means visual,

275
00:11:44.970 --> 00:11:46.830
it's fairly apparent how well spread out

276
00:11:46.830 --> 00:11:48.580
each of the clusters and centroids are

277
00:11:48.580 --> 00:11:50.563
when we're using three for the K value.

278
00:11:51.550 --> 00:11:54.083
But now if we were to go and switch it to nine,

279
00:12:02.990 --> 00:12:05.000
it's also pretty easy to see

280
00:12:05.000 --> 00:12:10.000
how each of these three groups should be merged into one.

281
00:12:10.480 --> 00:12:12.420
And that's basically what I was talking about

282
00:12:12.420 --> 00:12:15.000
when I said we start to see clusters overlap

283
00:12:15.000 --> 00:12:17.850
when silhouette scores get closer to zero.

284
00:12:17.850 --> 00:12:19.890
So like this part right here

285
00:12:19.890 --> 00:12:23.080
is gonna be overlapping with what should be the blue,

286
00:12:23.080 --> 00:12:24.630
and then this third over here

287
00:12:24.630 --> 00:12:27.530
is also gonna be overlapping with what should be the blue.

288
00:12:33.600 --> 00:12:34.690
And as I promised,

289
00:12:34.690 --> 00:12:37.830
that was the last topic we needed to cover in the guide.

290
00:12:37.830 --> 00:12:40.220
So now assuming everything made sense,

291
00:12:40.220 --> 00:12:41.560
you should know how to perform

292
00:12:41.560 --> 00:12:44.490
dimensionality reduction using PCA,

293
00:12:44.490 --> 00:12:46.760
how to apply the elbow method and silhouette score

294
00:12:46.760 --> 00:12:49.280
to help choose an appropriate K value,

295
00:12:49.280 --> 00:12:51.000
and how to build a K-means model

296
00:12:51.000 --> 00:12:52.923
with an easy to interpret visual.

297
00:12:53.920 --> 00:12:55.010
Now, in the next guide,

298
00:12:55.010 --> 00:12:56.850
we're gonna finish off the clustering section

299
00:12:56.850 --> 00:12:59.410
by creating a hierarchical visual.

300
00:12:59.410 --> 00:13:00.520
But for now,

301
00:13:00.520 --> 00:13:03.270
I'll wrap things up and I'll see you in the next guide.