- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:04.320 --> 00:00:05.153
Now that we've gone through
2
00:00:05.153 --> 00:00:06.720
dimensionality reduction,
3
00:00:06.720 --> 00:00:09.300
we can move forward with how to create a visualization
4
00:00:09.300 --> 00:00:11.800
for each of the clustering models.
5
00:00:11.800 --> 00:00:12.880
As we move through the guide,
6
00:00:12.880 --> 00:00:15.330
we're also gonna be making a couple different visuals
7
00:00:15.330 --> 00:00:16.900
that we can use to help us choose
8
00:00:16.900 --> 00:00:18.803
the best K value for a model.
9
00:00:21.530 --> 00:00:23.530
When I was originally putting the guide together,
10
00:00:23.530 --> 00:00:25.210
I planned on having it all set up
11
00:00:25.210 --> 00:00:27.720
so we could do our own exploratory analysis
12
00:00:27.720 --> 00:00:29.890
using real-world data.
13
00:00:29.890 --> 00:00:31.350
But the further I got into it,
14
00:00:31.350 --> 00:00:33.560
the more I realized how much information
15
00:00:33.560 --> 00:00:35.860
we're actually gonna be covering.
16
00:00:35.860 --> 00:00:37.650
So instead of splitting our time between
17
00:00:37.650 --> 00:00:39.200
figuring out what the data means
18
00:00:39.200 --> 00:00:41.840
and trying to cover every key concept,
19
00:00:41.840 --> 00:00:43.110
I thought you'd be better served
20
00:00:43.110 --> 00:00:46.030
having a data frame with a really obvious result.
21
00:00:46.030 --> 00:00:47.840
That way, we can use it to help explain
22
00:00:47.840 --> 00:00:49.300
all of the new concepts
23
00:00:49.300 --> 00:00:51.690
instead of using some ambiguous result
24
00:00:51.690 --> 00:00:54.397
that might end up not making any sense at all.
25
00:00:55.800 --> 00:00:57.000
Now, with that being said,
26
00:00:57.000 --> 00:00:59.620
there's really not much to the dataset.
27
00:00:59.620 --> 00:01:00.820
What we're gonna be working with
28
00:01:00.820 --> 00:01:04.650
is made up of 1,000 samples with 10 features.
29
00:01:04.650 --> 00:01:06.420
And then by setting the center to three,
30
00:01:06.420 --> 00:01:08.610
it makes it so the data is gonna be separated
31
00:01:08.610 --> 00:01:10.403
into three distinct blobs.
32
00:01:12.100 --> 00:01:13.250
Now, unfortunately,
33
00:01:13.250 --> 00:01:15.890
I can't really show you what the data looks like yet
34
00:01:15.890 --> 00:01:17.460
because I honestly have no clue
35
00:01:17.460 --> 00:01:20.820
what a blob in 10 dimensional space would look like.
36
00:01:20.820 --> 00:01:23.030
So to fix that, we're gonna need to perform
37
00:01:23.030 --> 00:01:27.000
some sort of dimensionality reduction like PCA.
38
00:01:27.000 --> 00:01:29.080
And like I mentioned in the last guide,
39
00:01:29.080 --> 00:01:31.400
before we can apply the PCA algorithm,
40
00:01:31.400 --> 00:01:34.100
we need to make sure the data has already been scaled.
41
00:01:35.310 --> 00:01:38.950
Now, all of the PCA code should look relatively familiar
42
00:01:38.950 --> 00:01:41.490
because it's pretty much the exact same thing
43
00:01:41.490 --> 00:01:43.600
that we did in the last guide.
44
00:01:43.600 --> 00:01:46.830
We're just gonna be fitting the model with our scaled data,
45
00:01:46.830 --> 00:01:49.790
then using the explained variance ratio attribute
46
00:01:49.790 --> 00:01:52.000
to find out how much variance can be explained
47
00:01:52.000 --> 00:01:53.573
by each principal component.
48
00:01:55.020 --> 00:01:56.270
And I'm not sure how many of you
49
00:01:56.270 --> 00:01:58.400
noticed my mistake in the last guide,
50
00:01:58.400 --> 00:02:00.070
but I'm pretty sure I said that
51
00:02:00.070 --> 00:02:01.550
two of the principal components
52
00:02:01.550 --> 00:02:04.380
explain less than one-tenth of a percent,
53
00:02:04.380 --> 00:02:07.790
when in reality, I forgot to move the decimal place over.
54
00:02:07.790 --> 00:02:10.160
So one of them was actually supposed to be closer
55
00:02:10.160 --> 00:02:12.327
to something like 4%,
56
00:02:12.327 --> 00:02:15.023
and I think the other one was around half a percent.
57
00:02:16.130 --> 00:02:17.790
To avoid making the same mistake,
58
00:02:17.790 --> 00:02:19.960
I added an element-wise multiplication
59
00:02:19.960 --> 00:02:22.520
to essentially convert all the elements of the array
60
00:02:22.520 --> 00:02:24.690
into their actual percentage.
61
00:02:24.690 --> 00:02:26.010
That way we don't have to worry about
62
00:02:26.010 --> 00:02:27.410
doing any of the conversion.
63
00:02:29.610 --> 00:02:31.480
All right, now that we have all of that covered,
64
00:02:31.480 --> 00:02:33.683
I'm gonna go ahead and run the first cell.
65
00:02:36.580 --> 00:02:37.800
And as expected,
66
00:02:37.800 --> 00:02:39.840
what we have returned is an array containing
67
00:02:39.840 --> 00:02:41.690
the percentage of explained variance
68
00:02:41.690 --> 00:02:43.173
for each principle component.
69
00:02:44.230 --> 00:02:47.500
But more importantly, it looks like component one and two
70
00:02:47.500 --> 00:02:50.940
are able to explain just about 80% of the variance,
71
00:02:50.940 --> 00:02:53.660
which is a really solid percentage.
72
00:02:53.660 --> 00:02:55.250
And if you looked ahead, you already know
73
00:02:55.250 --> 00:02:57.220
that those are the only two principle components
74
00:02:57.220 --> 00:02:58.780
we're gonna be using.
75
00:02:58.780 --> 00:03:00.840
Oh, it's also probably worth mentioning
76
00:03:00.840 --> 00:03:03.290
that to actually reduce the number of dimensions,
77
00:03:03.290 --> 00:03:05.480
we need to use the fit transform method
78
00:03:05.480 --> 00:03:07.160
instead of just using the fit method
79
00:03:07.160 --> 00:03:08.760
like we did it in the PCA guide.
80
00:03:10.210 --> 00:03:12.760
So now I'm gonna go ahead and run it one more time.
81
00:03:13.730 --> 00:03:15.640
And before we switch over to the plot pane,
82
00:03:15.640 --> 00:03:18.600
you can already tell that the reduction process worked,
83
00:03:18.600 --> 00:03:20.790
since x scaled has 10 columns
84
00:03:20.790 --> 00:03:23.213
and the x reduced matrix only has two.
85
00:03:24.320 --> 00:03:27.130
So, now when we move it to the plot pane,
86
00:03:27.130 --> 00:03:28.170
it's pretty clear
87
00:03:28.170 --> 00:03:30.570
that we end up getting three distinct blobs,
88
00:03:30.570 --> 00:03:32.940
or in our case, clusters.
89
00:03:32.940 --> 00:03:34.380
But we're gonna pretend for a minute
90
00:03:34.380 --> 00:03:36.520
that it's not overly obvious,
91
00:03:36.520 --> 00:03:39.200
and instead use something called the elbow method,
92
00:03:39.200 --> 00:03:41.760
which is one of the more common methods we can implement
93
00:03:41.760 --> 00:03:44.223
to help figure out the best K value to use.
94
00:03:45.690 --> 00:03:47.050
It'll make sense when you see it,
95
00:03:47.050 --> 00:03:50.290
but the elbow method works by plotting explained variation
96
00:03:50.290 --> 00:03:52.880
as a function of the number of clusters.
97
00:03:52.880 --> 00:03:55.860
Then wherever we end up seeing the elbow of the curve,
98
00:03:55.860 --> 00:03:58.610
that'll indicate the value that we should choose for K.
99
00:04:00.180 --> 00:04:02.340
This is another time where I don't think the math
100
00:04:02.340 --> 00:04:04.790
is gonna be overly important for you to know.
101
00:04:04.790 --> 00:04:07.050
But the three major components that we're gonna be using
102
00:04:07.050 --> 00:04:08.910
are the total sum of squares,
103
00:04:08.910 --> 00:04:10.360
which is the total distance
104
00:04:10.360 --> 00:04:13.180
of every data point from the global mean,
105
00:04:13.180 --> 00:04:15.430
then we have the within sum of squares,
106
00:04:15.430 --> 00:04:17.590
which is the distance of every point
107
00:04:17.590 --> 00:04:20.530
from the centroid of their respective cluster,
108
00:04:20.530 --> 00:04:23.230
and finally, we have the between sum of squares,
109
00:04:23.230 --> 00:04:26.260
which is essentially the weighted distance of each centroid
110
00:04:26.260 --> 00:04:27.303
to the global mean.
111
00:04:28.530 --> 00:04:30.290
Now it's pretty self-explanatory,
112
00:04:30.290 --> 00:04:32.820
but those three are also related to each other
113
00:04:32.820 --> 00:04:34.370
because the total sum of squares
114
00:04:34.370 --> 00:04:37.503
is equal to the within sum added to the between sum.
115
00:04:40.970 --> 00:04:42.920
Alrighty, getting back to the code,
116
00:04:42.920 --> 00:04:45.770
the setup for the elbow method is pretty straightforward.
117
00:04:46.880 --> 00:04:47.950
The first thing we need to do
118
00:04:47.950 --> 00:04:50.823
is define the range of K values we wanna test.
119
00:04:51.690 --> 00:04:53.890
After that, we're using list comprehension
120
00:04:53.890 --> 00:04:56.720
to make a list object containing every K-means model
121
00:04:56.720 --> 00:04:58.403
within the defined K range.
122
00:04:59.580 --> 00:05:00.940
Then for the next three lines,
123
00:05:00.940 --> 00:05:02.050
we're gonna be calculating
124
00:05:02.050 --> 00:05:05.253
the within, total, and between sum of squares.
125
00:05:06.810 --> 00:05:07.950
To get the within sum,
126
00:05:07.950 --> 00:05:10.340
we're gonna be using list comprehension again,
127
00:05:10.340 --> 00:05:14.260
along with the inertia attribute from the K-means module,
128
00:05:14.260 --> 00:05:16.823
which if you check the documentation,
129
00:05:17.770 --> 00:05:19.940
it lets us know that the inertia attribute
130
00:05:19.940 --> 00:05:22.670
calculates the sum of squared distances of samples
131
00:05:22.670 --> 00:05:24.790
to the closest centroid.
132
00:05:24.790 --> 00:05:26.600
And I don't think we need to go back,
133
00:05:26.600 --> 00:05:28.390
but that was the exact definition
134
00:05:28.390 --> 00:05:30.060
of the within sum of squares.
135
00:05:30.060 --> 00:05:31.733
So we're all good for this one.
136
00:05:32.900 --> 00:05:34.210
Now, for the total sum,
137
00:05:34.210 --> 00:05:35.920
we're gonna be using a new function
138
00:05:35.920 --> 00:05:37.850
called pairwise distance,
139
00:05:37.850 --> 00:05:40.600
which we're getting from the scientific Python library.
140
00:05:41.570 --> 00:05:44.450
And I'm starting to feel like it's a trend in this guide,
141
00:05:44.450 --> 00:05:45.770
but the name of the function
142
00:05:45.770 --> 00:05:47.920
is again pretty self-explanatory
143
00:05:47.920 --> 00:05:49.390
since we use it to calculate
144
00:05:49.390 --> 00:05:52.510
the pairwise distance between observations,
145
00:05:52.510 --> 00:05:55.170
which in our code, we're using to calculate the distance
146
00:05:55.170 --> 00:05:58.453
between every observation in the X reduced matrix.
147
00:06:01.890 --> 00:06:03.630
So the pairwise distance function
148
00:06:03.630 --> 00:06:05.930
starts with the first observation
149
00:06:05.930 --> 00:06:09.103
and calculates the distance to the second observation.
150
00:06:10.010 --> 00:06:12.740
Then it saves the result to a distance matrix.
151
00:06:12.740 --> 00:06:13.720
And then after that,
152
00:06:13.720 --> 00:06:17.850
it goes from the first observation to the third observation,
153
00:06:17.850 --> 00:06:20.620
and then from the first to the fourth and so on.
154
00:06:20.620 --> 00:06:23.910
And it'll do that until it gets through the entire matrix.
155
00:06:23.910 --> 00:06:24.830
Once it's done with that,
156
00:06:24.830 --> 00:06:27.100
it's gonna go back up to the second observation
157
00:06:27.100 --> 00:06:29.050
and calculate the distance to the third
158
00:06:29.050 --> 00:06:31.550
and second to fourth and so on.
159
00:06:31.550 --> 00:06:33.350
And it keeps following that process
160
00:06:33.350 --> 00:06:35.673
until every unique distance is calculated.
161
00:06:38.930 --> 00:06:40.150
But anyway, after that,
162
00:06:40.150 --> 00:06:42.400
we're gonna be squaring all the distances,
163
00:06:42.400 --> 00:06:44.940
calculating the sum of all the distances,
164
00:06:44.940 --> 00:06:47.550
then dividing by the original number of observations,
165
00:06:47.550 --> 00:06:49.073
which for us was 1,000.
166
00:06:50.100 --> 00:06:51.750
And finally, all we have to do
167
00:06:51.750 --> 00:06:53.160
to calculate the between sum
168
00:06:53.160 --> 00:06:55.733
is subtract the within sum from the total sum.
169
00:06:57.020 --> 00:07:00.000
Alrighty, now that we have all of that taken care of,
170
00:07:00.000 --> 00:07:02.670
we are all set up to build the elbow graph.
171
00:07:02.670 --> 00:07:04.540
And I think I already mentioned it,
172
00:07:04.540 --> 00:07:07.470
but the elbow graph works by plotting explained variation
173
00:07:07.470 --> 00:07:10.110
as a function of the number of clusters.
174
00:07:10.110 --> 00:07:13.220
So the range we're testing is gonna go along the x-axis
175
00:07:13.220 --> 00:07:16.010
and the explained variation along the y.
176
00:07:16.010 --> 00:07:17.370
And to be honest with you,
177
00:07:17.370 --> 00:07:18.670
it doesn't really matter
178
00:07:18.670 --> 00:07:21.860
if you decide to use the within sum or the between sum
179
00:07:21.860 --> 00:07:23.600
because as you'll see in a second,
180
00:07:23.600 --> 00:07:26.070
they'll both end up giving you the same result.
181
00:07:26.070 --> 00:07:29.530
But personally, I find myself using within sum
182
00:07:29.530 --> 00:07:32.130
just because it's the first thing we're calculating.
183
00:07:37.020 --> 00:07:39.040
All right, so this is one of the reasons
184
00:07:39.040 --> 00:07:41.970
we're using a data set with a super obvious result,
185
00:07:41.970 --> 00:07:44.470
because what we refer to as the elbow
186
00:07:44.470 --> 00:07:46.600
isn't always this pronounced.
187
00:07:46.600 --> 00:07:48.580
So the way you interpret the elbow graph
188
00:07:48.580 --> 00:07:50.950
is by thinking of it as an arm.
189
00:07:50.950 --> 00:07:53.670
So you're gonna have the upper half right here
190
00:07:53.670 --> 00:07:56.160
and then the lower half down here.
191
00:07:56.160 --> 00:07:59.123
And most importantly, we're gonna have the elbow.
192
00:08:00.210 --> 00:08:03.660
So whichever K value looks most like the elbow
193
00:08:03.660 --> 00:08:05.350
is gonna be the suggested value
194
00:08:05.350 --> 00:08:07.210
that we should use in the model.
195
00:08:07.210 --> 00:08:08.270
And in our example,
196
00:08:08.270 --> 00:08:10.750
it's pretty obvious that K value of three
197
00:08:10.750 --> 00:08:12.303
looks the most like an elbow.
198
00:08:17.650 --> 00:08:18.930
For the actual K-model,
199
00:08:18.930 --> 00:08:21.480
there isn't really a whole lot that we need to go through
200
00:08:21.480 --> 00:08:23.990
since most of it we've already done.
201
00:08:23.990 --> 00:08:27.090
At the top we have all of our imports and data.
202
00:08:27.090 --> 00:08:29.760
Then we made a variable for the K-means algorithm
203
00:08:29.760 --> 00:08:32.043
and set the number of clusters to three.
204
00:08:32.920 --> 00:08:36.080
After that, we created the model by using the fit method
205
00:08:36.080 --> 00:08:37.830
along with the X reduced matrix
206
00:08:37.830 --> 00:08:40.750
that we made by using the PCA algorithm.
207
00:08:40.750 --> 00:08:43.200
From there, we have the same old mesh grid
208
00:08:43.200 --> 00:08:45.950
that we've already made a few other times,
209
00:08:45.950 --> 00:08:47.240
followed by all of the code
210
00:08:47.240 --> 00:08:49.090
that we're gonna need for the visual.
211
00:08:50.260 --> 00:08:52.193
So, now when we go ahead and run it,
212
00:08:55.040 --> 00:08:57.960
we end up with a clear and easy to read visual,
213
00:08:57.960 --> 00:09:00.470
showing us the location of the three centroids
214
00:09:00.470 --> 00:09:02.223
along with all of the boundaries.
215
00:09:04.130 --> 00:09:07.360
Alrighty, so the last topic we're gonna cover in this guide
216
00:09:07.360 --> 00:09:08.783
is the silhouette score.
217
00:09:12.590 --> 00:09:14.280
I'm pretty sure a couple of guides ago,
218
00:09:14.280 --> 00:09:15.560
I mentioned that clustering
219
00:09:15.560 --> 00:09:18.920
is often used for exploring unlabeled data.
220
00:09:18.920 --> 00:09:20.900
Meaning we have a bunch of features,
221
00:09:20.900 --> 00:09:23.563
but really no clue as to how they should be grouped.
222
00:09:24.640 --> 00:09:26.760
It obviously wasn't a problem in this example
223
00:09:26.760 --> 00:09:29.250
because the clusters were well separated.
224
00:09:29.250 --> 00:09:30.290
But a lot of times,
225
00:09:30.290 --> 00:09:32.410
the data is gonna be jumbled together
226
00:09:32.410 --> 00:09:34.400
making it really difficult to decipher
227
00:09:34.400 --> 00:09:36.763
an optimum number of clusters we should use.
228
00:09:37.640 --> 00:09:39.420
Now, we already covered the elbow method,
229
00:09:39.420 --> 00:09:41.600
which for this example worked really well
230
00:09:41.600 --> 00:09:45.060
since there was really just one obvious choice.
231
00:09:45.060 --> 00:09:47.440
But as you'd expect with real-world data,
232
00:09:47.440 --> 00:09:50.180
it's not always gonna be that straight forward.
233
00:09:50.180 --> 00:09:52.130
There's gonna be times where you might have
234
00:09:52.130 --> 00:09:54.230
two, three, or four different points
235
00:09:54.230 --> 00:09:56.500
that all look like they could be the elbow.
236
00:09:56.500 --> 00:09:59.333
So to help solve that, we can use the silhouette score.
237
00:10:06.410 --> 00:10:08.770
If you wanna take a look at the official documentation,
238
00:10:08.770 --> 00:10:09.670
you can.
239
00:10:09.670 --> 00:10:11.760
But basically, the silhouette score
240
00:10:11.760 --> 00:10:14.430
indicates how close clusters are to each other
241
00:10:14.430 --> 00:10:15.910
by returning a coefficient
242
00:10:15.910 --> 00:10:18.730
that ranges from negative one to one.
243
00:10:18.730 --> 00:10:21.740
The absolute best value we can get is one,
244
00:10:21.740 --> 00:10:23.500
the worst is negative one.
245
00:10:23.500 --> 00:10:24.900
And if zero is returned,
246
00:10:24.900 --> 00:10:27.100
that's when things start getting a little muddy
247
00:10:27.100 --> 00:10:29.823
because it's an indication clusters are overlapping.
248
00:10:30.700 --> 00:10:32.850
We'll come back to those numbers in just a minute,
249
00:10:32.850 --> 00:10:34.440
but for now let's move on to the code
250
00:10:34.440 --> 00:10:36.440
so we can run through that really quick.
251
00:10:38.290 --> 00:10:39.810
So a lot of this is kind of similar
252
00:10:39.810 --> 00:10:41.600
to what we did for the elbow method,
253
00:10:41.600 --> 00:10:43.150
but this time we're gonna be using a loop
254
00:10:43.150 --> 00:10:45.830
to iterate over a range of K values.
255
00:10:45.830 --> 00:10:48.230
After that, we're gonna be using the fit predict method
256
00:10:48.230 --> 00:10:51.643
to make cluster predictions for each model in that K range.
257
00:10:52.560 --> 00:10:54.610
Then we're using the X reduced matrix
258
00:10:54.610 --> 00:10:56.120
and the predicted cluster labels
259
00:10:56.120 --> 00:10:58.023
to calculate every silhouette score.
260
00:10:59.030 --> 00:11:01.340
And to make it a little easier to analyze,
261
00:11:01.340 --> 00:11:03.310
we're also throwing in a pandas data frame
262
00:11:03.310 --> 00:11:04.863
to help organize everything.
263
00:11:05.960 --> 00:11:07.260
Then finally for the graph,
264
00:11:07.260 --> 00:11:09.230
we're just plotting each silhouette score
265
00:11:09.230 --> 00:11:11.283
against their corresponding K value.
266
00:11:20.090 --> 00:11:21.390
And since we're still pretending
267
00:11:21.390 --> 00:11:23.400
we don't know anything about the data,
268
00:11:23.400 --> 00:11:24.850
the graph in the data frame
269
00:11:24.850 --> 00:11:27.670
suggest we use at K value of three
270
00:11:27.670 --> 00:11:31.280
because it's yielding and silhouette score of 0.85,
271
00:11:31.280 --> 00:11:33.463
which is the closest value to one.
272
00:11:34.310 --> 00:11:38.040
Then in contrast, the worst case value we could use is nine,
273
00:11:38.040 --> 00:11:40.863
which only has a silhouette score of 0.33.
274
00:11:41.960 --> 00:11:44.970
And if we go back into the K-means visual,
275
00:11:44.970 --> 00:11:46.830
it's fairly apparent how well spread out
276
00:11:46.830 --> 00:11:48.580
each of the clusters and centroids are
277
00:11:48.580 --> 00:11:50.563
when we're using three for the K value.
278
00:11:51.550 --> 00:11:54.083
But now if we were to go and switch it to nine,
279
00:12:02.990 --> 00:12:05.000
it's also pretty easy to see
280
00:12:05.000 --> 00:12:10.000
how each of these three groups should be merged into one.
281
00:12:10.480 --> 00:12:12.420
And that's basically what I was talking about
282
00:12:12.420 --> 00:12:15.000
when I said we start to see clusters overlap
283
00:12:15.000 --> 00:12:17.850
when silhouette scores get closer to zero.
284
00:12:17.850 --> 00:12:19.890
So like this part right here
285
00:12:19.890 --> 00:12:23.080
is gonna be overlapping with what should be the blue,
286
00:12:23.080 --> 00:12:24.630
and then this third over here
287
00:12:24.630 --> 00:12:27.530
is also gonna be overlapping with what should be the blue.
288
00:12:33.600 --> 00:12:34.690
And as I promised,
289
00:12:34.690 --> 00:12:37.830
that was the last topic we needed to cover in the guide.
290
00:12:37.830 --> 00:12:40.220
So now assuming everything made sense,
291
00:12:40.220 --> 00:12:41.560
you should know how to perform
292
00:12:41.560 --> 00:12:44.490
dimensionality reduction using PCA,
293
00:12:44.490 --> 00:12:46.760
how to apply the elbow method and silhouette score
294
00:12:46.760 --> 00:12:49.280
to help choose an appropriate K value,
295
00:12:49.280 --> 00:12:51.000
and how to build a K-means model
296
00:12:51.000 --> 00:12:52.923
with an easy to interpret visual.
297
00:12:53.920 --> 00:12:55.010
Now, in the next guide,
298
00:12:55.010 --> 00:12:56.850
we're gonna finish off the clustering section
299
00:12:56.850 --> 00:12:59.410
by creating a hierarchical visual.
300
00:12:59.410 --> 00:13:00.520
But for now,
301
00:13:00.520 --> 00:13:03.270
I'll wrap things up and I'll see you in the next guide.