- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:05.340 --> 00:00:06.950
All right, in this guide,
2
00:00:06.950 --> 00:00:09.090
we're gonna be wrapping up the clustering section
3
00:00:09.090 --> 00:00:10.590
by going through how to build
4
00:00:10.590 --> 00:00:12.780
an agglomerative clustering model
5
00:00:12.780 --> 00:00:14.950
as well as how to create a dendrogram,
6
00:00:14.950 --> 00:00:18.053
which is the name of the visual for hierarchical clustering.
7
00:00:19.726 --> 00:00:21.350
Now, before we jump into the code,
8
00:00:21.350 --> 00:00:22.990
I wanna do a quick breakdown
9
00:00:22.990 --> 00:00:24.920
of some of the more important aspects
10
00:00:24.920 --> 00:00:26.400
of the dendrogram.
11
00:00:26.400 --> 00:00:29.300
I figure it's probably better to get it out of the way.
12
00:00:29.300 --> 00:00:30.850
That way we don't have to keep stopping
13
00:00:30.850 --> 00:00:33.310
when we're trying to go through all the code.
14
00:00:33.310 --> 00:00:37.100
So first and foremost, a dendrogram is a branching diagram
15
00:00:37.100 --> 00:00:39.750
that represents the relationships of similarity
16
00:00:39.750 --> 00:00:41.832
among a group of observations.
17
00:00:41.832 --> 00:00:43.733
Each branch is called a clade.
18
00:00:44.810 --> 00:00:48.180
The terminal end of each clade is called a leaf.
19
00:00:48.180 --> 00:00:50.436
Clades can have as few as one leaf
20
00:00:50.436 --> 00:00:52.170
but more often than not,
21
00:00:52.170 --> 00:00:54.110
they'll have at least two leaves
22
00:00:54.110 --> 00:00:56.553
and can go up to as many leaves as they want.
23
00:00:58.090 --> 00:01:00.280
Now, the last thing that you need to know about
24
00:01:00.280 --> 00:01:02.690
is the arrangements of each clade.
25
00:01:02.690 --> 00:01:04.730
It's probably the most important aspect
26
00:01:04.730 --> 00:01:05.640
of the dendrogram
27
00:01:05.640 --> 00:01:06.880
because it tells us which leaves
28
00:01:06.880 --> 00:01:09.660
are gonna be the most similar to each other.
29
00:01:09.660 --> 00:01:13.040
In this case, leaf one is most similar to leaf two
30
00:01:13.040 --> 00:01:16.615
because they're the first to form a clade with each other.
31
00:01:16.615 --> 00:01:18.990
Then the same is gonna be true for leaves four and five
32
00:01:18.990 --> 00:01:22.300
because they're the first to join with each other.
33
00:01:22.300 --> 00:01:23.640
We can even go beyond that
34
00:01:23.640 --> 00:01:27.117
and say that leaves one, two, four and five
35
00:01:27.117 --> 00:01:29.050
and more similar with each other
36
00:01:29.050 --> 00:01:31.000
than they are with leaf three
37
00:01:31.000 --> 00:01:33.620
because they're all forming a clade before they form one
38
00:01:33.620 --> 00:01:34.563
with leaf three.
39
00:01:35.760 --> 00:01:38.830
And finally, the height of each branch indicates
40
00:01:38.830 --> 00:01:41.920
how similar or different everything is from each other.
41
00:01:41.920 --> 00:01:44.470
The greater the height, the greater the difference.
42
00:01:45.340 --> 00:01:47.070
So we already know that leaves one and two
43
00:01:47.070 --> 00:01:49.860
are most closely related to each other.
44
00:01:49.860 --> 00:01:52.597
And the same is true for leaves four and five.
45
00:01:52.597 --> 00:01:54.270
Now, we can also say
46
00:01:54.270 --> 00:01:57.080
that leaves one and two are more closely related
47
00:01:57.080 --> 00:01:59.210
to each other than leaves four and five
48
00:01:59.210 --> 00:02:01.440
because the branch for leaves one and two
49
00:02:01.440 --> 00:02:04.780
is shorter than the branch for leaves four and five.
50
00:02:04.780 --> 00:02:06.380
All right, now that you know the basics
51
00:02:06.380 --> 00:02:07.550
of the dendrogram,
52
00:02:07.550 --> 00:02:09.500
let's get into some of the actual code.
53
00:02:16.180 --> 00:02:17.050
So to start off,
54
00:02:17.050 --> 00:02:18.360
we're gonna need to take a quick look
55
00:02:18.360 --> 00:02:19.910
at the data we're working with.
56
00:02:21.430 --> 00:02:23.050
And it looks like the data frame we're using
57
00:02:23.050 --> 00:02:26.070
has just under 600 rows with eight columns,
58
00:02:26.070 --> 00:02:28.780
including state_id, state_name,
59
00:02:28.780 --> 00:02:31.690
county_name, latitude, longitude,
60
00:02:31.690 --> 00:02:34.083
population, density and timezone.
61
00:02:35.360 --> 00:02:37.140
And you'll see why I'm doing it this way
62
00:02:37.140 --> 00:02:38.113
in just a minute.
63
00:02:39.700 --> 00:02:41.930
But we're also replacing the original index
64
00:02:41.930 --> 00:02:43.440
with the city column.
65
00:02:43.440 --> 00:02:45.820
That way, we can use it to label the dendrogram
66
00:02:45.820 --> 00:02:47.470
when we finally get to that part.
67
00:02:48.310 --> 00:02:50.313
Now in terms of the features,
68
00:02:53.010 --> 00:02:55.860
we're only gonna be using latitude and longitude,
69
00:02:55.860 --> 00:02:57.650
so all of the clustering decisions
70
00:02:57.650 --> 00:03:00.393
are based quite literally on distance or location.
71
00:03:02.430 --> 00:03:04.140
And we'll get to the rest of the code
72
00:03:04.140 --> 00:03:05.293
in just a second.
73
00:03:06.990 --> 00:03:08.080
But I thought it would be fun
74
00:03:08.080 --> 00:03:11.160
to throw in this new package called Cartopy,
75
00:03:11.160 --> 00:03:12.810
which is a pretty cool graphing tool
76
00:03:12.810 --> 00:03:15.200
that works direct with matplotlib.
77
00:03:15.200 --> 00:03:16.270
And by their own description
78
00:03:16.270 --> 00:03:19.890
is a package designed for geospatial data processing
79
00:03:19.890 --> 00:03:21.543
in order to produce maps.
80
00:03:22.540 --> 00:03:24.951
I've only used the package a couple times
81
00:03:24.951 --> 00:03:27.190
but it looks like they have a bunch
82
00:03:27.190 --> 00:03:28.840
of different ways it can be used.
83
00:03:29.800 --> 00:03:32.143
So if it looks like something you might wanna mess with,
84
00:03:32.143 --> 00:03:34.320
feel free to do the install.
85
00:03:34.320 --> 00:03:36.470
But this is the only time that we're gonna be using it,
86
00:03:36.470 --> 00:03:38.570
so it's really not that important to have.
87
00:03:40.250 --> 00:03:41.810
Now, to get back to it,
88
00:03:41.810 --> 00:03:43.120
let's take a quick look
89
00:03:43.120 --> 00:03:44.540
at what we've built so far
90
00:03:44.540 --> 00:03:45.990
and then go through the code.
91
00:03:48.530 --> 00:03:50.073
And that's pretty cool, right?
92
00:03:52.490 --> 00:03:54.960
All righty, so starting with the imports,
93
00:03:54.960 --> 00:03:56.960
we're gonna be using the crs module
94
00:03:56.960 --> 00:04:00.820
to define and transform the map's coordinate system.
95
00:04:00.820 --> 00:04:02.870
And then the features module allows us
96
00:04:02.870 --> 00:04:05.520
to add in all of the different features we wanna use.
97
00:04:08.400 --> 00:04:11.570
Then the three lines right here define
98
00:04:11.570 --> 00:04:14.080
what section of the map we're looking at.
99
00:04:14.080 --> 00:04:16.070
The first two lines are where we want the map
100
00:04:16.070 --> 00:04:17.540
to be centered
101
00:04:17.540 --> 00:04:19.670
and the third line sets the latitude
102
00:04:19.670 --> 00:04:22.540
and longitude min, max values for the map.
103
00:04:22.540 --> 00:04:24.790
So this is kinda like setting the window
104
00:04:24.790 --> 00:04:26.040
that we'll be looking at.
105
00:04:27.480 --> 00:04:30.263
Now, the rest of it works on top of matplotlib.
106
00:04:31.200 --> 00:04:33.350
So after we adjust the figure size,
107
00:04:33.350 --> 00:04:36.100
we're projecting the map along the matplotlib axis,
108
00:04:36.100 --> 00:04:38.513
using the coordinate system we just defined.
109
00:04:40.450 --> 00:04:42.070
Then in the next five lines,
110
00:04:42.070 --> 00:04:43.720
we're just adding in some features
111
00:04:43.720 --> 00:04:45.843
like oceans and state borders.
112
00:04:48.050 --> 00:04:50.400
Now, to actually overlay the scatterplot,
113
00:04:50.400 --> 00:04:52.760
we have to make sure to transform the data
114
00:04:52.760 --> 00:04:56.040
so it's using the same coordinate system as the map.
115
00:04:56.040 --> 00:04:58.620
Then change what's called the z order
116
00:04:58.620 --> 00:05:01.120
to make sure the scatterplot is the top layer
117
00:05:01.120 --> 00:05:02.063
of the visual.
118
00:05:04.190 --> 00:05:05.440
We'll be coming back to the map
119
00:05:05.440 --> 00:05:07.050
to use as a reference.
120
00:05:07.050 --> 00:05:08.680
But for now, we're gonna be moving on
121
00:05:08.680 --> 00:05:10.240
to the hierarchical model
122
00:05:10.240 --> 00:05:11.853
and building the dendrogram.
123
00:05:15.770 --> 00:05:17.680
All right, so to build the model,
124
00:05:17.680 --> 00:05:20.700
we're gonna first need to import the agglomerative algorithm
125
00:05:20.700 --> 00:05:23.410
from the cluster module in scikit-learn.
126
00:05:23.410 --> 00:05:26.120
We're also gonna be going back to the scipy library
127
00:05:26.120 --> 00:05:29.023
to import the linkage and dendrogram function.
128
00:05:34.310 --> 00:05:35.930
Now, if you wanna take a closer look
129
00:05:35.930 --> 00:05:38.400
at the linkage documentation, you can
130
00:05:38.400 --> 00:05:39.910
but its general purpose
131
00:05:39.910 --> 00:05:42.240
is to give us a way of creating an object
132
00:05:42.240 --> 00:05:44.580
that represents the hierarchical clustering
133
00:05:44.580 --> 00:05:47.733
of a dataset encoded as a linkage matrix.
134
00:05:49.200 --> 00:05:50.760
And just in case you forgot,
135
00:05:50.760 --> 00:05:52.960
linkage refers to where in the cluster
136
00:05:52.960 --> 00:05:54.780
we're measuring distance from,
137
00:05:54.780 --> 00:05:56.160
so for a single linkage,
138
00:05:56.160 --> 00:05:58.500
which happens to be the default method,
139
00:05:58.500 --> 00:06:00.680
it works by measuring the shortest distance
140
00:06:00.680 --> 00:06:03.323
between a pair of observations in two clusters.
141
00:06:04.760 --> 00:06:07.733
Now, for the dendrogram documentation,
142
00:06:10.220 --> 00:06:11.410
it says that it can be used
143
00:06:11.410 --> 00:06:14.373
to plot the hierarchical clustering as a dendrogram.
144
00:06:15.680 --> 00:06:17.310
Now, in order to do that,
145
00:06:17.310 --> 00:06:19.940
the very first parameter the function calls for
146
00:06:19.940 --> 00:06:21.730
is the linkage matrix
147
00:06:21.730 --> 00:06:24.303
that we just so conveniently happen to make.
148
00:06:25.510 --> 00:06:28.320
And I think before we start talking about the parameters,
149
00:06:28.320 --> 00:06:30.220
we should probably run the code
150
00:06:30.220 --> 00:06:32.743
so we can start talking about the actual result.
151
00:06:43.170 --> 00:06:45.730
All right, so we're gonna switch back and forth
152
00:06:45.730 --> 00:06:47.340
between this and the map
153
00:06:47.340 --> 00:06:49.320
but according to the dendrogram,
154
00:06:49.320 --> 00:06:51.460
it ended up creating two main clusters,
155
00:06:51.460 --> 00:06:54.460
which I guess we can just call the orange and green cluster.
156
00:06:55.940 --> 00:06:57.730
But regardless of the cluster,
157
00:06:57.730 --> 00:06:59.590
the two cities that appear to the closest
158
00:06:59.590 --> 00:07:03.120
to each other are Las Vegas and Spring Valley
159
00:07:03.120 --> 00:07:04.380
and we can make that assumption
160
00:07:04.380 --> 00:07:06.280
because they have the shortest branch.
161
00:07:11.190 --> 00:07:12.890
I don't know if that helped at all
162
00:07:12.890 --> 00:07:15.440
but it seems like it might be a little better
163
00:07:15.440 --> 00:07:18.000
but anyway, Vegas and Spring Valley
164
00:07:18.000 --> 00:07:20.170
are pretty much completely overlapping,
165
00:07:20.170 --> 00:07:22.970
so it makes sense that those two cities are the closest.
166
00:07:23.980 --> 00:07:26.490
Then after that it was Fullerton, Corona
167
00:07:26.490 --> 00:07:28.160
and I think it was Hemet
168
00:07:28.160 --> 00:07:30.470
is that's even how you pronounce it.
169
00:07:30.470 --> 00:07:33.163
But that's gonna be this little group over here.
170
00:07:34.060 --> 00:07:37.090
And kind of like it was with Vegas and Spring Valley,
171
00:07:37.090 --> 00:07:40.060
since Fullerton and Corona are so close to each other,
172
00:07:40.060 --> 00:07:41.300
it makes it really hard
173
00:07:41.300 --> 00:07:43.890
to see all three of the observations.
174
00:07:43.890 --> 00:07:45.190
Then if you really wanted to,
175
00:07:45.190 --> 00:07:46.930
you could go through and do the same thing
176
00:07:46.930 --> 00:07:48.410
with all of the other cities
177
00:07:48.410 --> 00:07:50.853
but I think you get the gist of what's going on.
178
00:08:00.220 --> 00:08:02.580
So to finish off with the dendrogram code,
179
00:08:02.580 --> 00:08:04.950
the two other parameters I wanted to talk about
180
00:08:04.950 --> 00:08:07.620
are p and truncate mode.
181
00:08:07.620 --> 00:08:09.840
It's not a problem in this example
182
00:08:09.840 --> 00:08:11.520
but a lot of the times when you're working
183
00:08:11.520 --> 00:08:13.310
with a really large dataset,
184
00:08:13.310 --> 00:08:15.510
you're gonna need to condense the dendrogram
185
00:08:15.510 --> 00:08:16.983
in order to make it readable.
186
00:08:17.990 --> 00:08:19.730
And the easiest way of doing that
187
00:08:19.730 --> 00:08:22.500
is by using the truncate parameter.
188
00:08:22.500 --> 00:08:24.850
There's a couple of different ways of doing it.
189
00:08:28.540 --> 00:08:30.030
But the first one that we'll go over
190
00:08:30.030 --> 00:08:31.183
is called lastp.
191
00:08:34.670 --> 00:08:36.550
And when we use lastp,
192
00:08:36.550 --> 00:08:39.440
it only gives us the last p merges.
193
00:08:39.440 --> 00:08:42.793
So if I leave p at 20, and run the code,
194
00:08:44.620 --> 00:08:45.630
nothing changes
195
00:08:45.630 --> 00:08:49.590
because the last 20 merges still include every merge
196
00:08:50.500 --> 00:08:53.253
but if I move p down to 10,
197
00:08:58.800 --> 00:09:02.420
the dendrogram only shows the last 10 merges.
198
00:09:02.420 --> 00:09:05.390
Then the numbers inside the parentheses represent the number
199
00:09:05.390 --> 00:09:07.163
of observations in each leaf.
200
00:09:09.750 --> 00:09:12.393
The other option that we have is called level.
201
00:09:16.310 --> 00:09:18.950
And basically, level's gonna be truncating based
202
00:09:18.950 --> 00:09:20.610
on the number of nodes.
203
00:09:20.610 --> 00:09:23.010
So right now, if we count the number of nodes
204
00:09:23.010 --> 00:09:25.573
on the left-hand side of the green cluster,
205
00:09:26.480 --> 00:09:30.730
we're gonna have one, two, three, four, five,
206
00:09:30.730 --> 00:09:32.570
six and seven.
207
00:09:32.570 --> 00:09:34.883
So if I change the p value to seven,
208
00:09:42.280 --> 00:09:44.120
nothing happens.
209
00:09:44.120 --> 00:09:46.983
But if we move it down to six,
210
00:09:49.930 --> 00:09:53.123
we end up losing a level along that same pathway.
211
00:09:54.350 --> 00:09:56.800
But since we're using such a small dataset,
212
00:09:56.800 --> 00:09:59.160
it's not something we're really gonna have to worry about.
213
00:09:59.160 --> 00:10:00.793
So we can just leave it at none.
214
00:10:11.580 --> 00:10:13.050
All right, so that leaves us
215
00:10:13.050 --> 00:10:15.290
with just one last topic to cover
216
00:10:15.290 --> 00:10:17.663
and that's setting up the agglomerative model.
217
00:10:18.840 --> 00:10:20.340
A lot of the same stuff we've covered
218
00:10:20.340 --> 00:10:22.160
in the other guides is still gonna hold true
219
00:10:22.160 --> 00:10:23.960
for what we're doing here.
220
00:10:23.960 --> 00:10:25.770
So just like we've always done,
221
00:10:25.770 --> 00:10:28.580
we're assigning the algorithm to a variable.
222
00:10:28.580 --> 00:10:31.223
Then using the fit function to generate the model.
223
00:10:32.070 --> 00:10:34.040
And in the last guide, we even talked
224
00:10:34.040 --> 00:10:36.300
about a couple different methods we can implement
225
00:10:36.300 --> 00:10:39.240
to help us choose the best k value for a model.
226
00:10:39.240 --> 00:10:41.140
But I don't think it's really necessary
227
00:10:41.140 --> 00:10:43.720
for us to do those in this example.
228
00:10:43.720 --> 00:10:46.670
So there honestly isn't too much for us to go through
229
00:10:46.670 --> 00:10:48.760
other than comparing the results of the model
230
00:10:48.760 --> 00:10:49.923
to the dendrogram.
231
00:10:50.760 --> 00:10:52.700
Now, before we go over to the console,
232
00:10:52.700 --> 00:10:54.090
it probably wouldn't hurt
233
00:10:54.090 --> 00:10:56.230
to go over these last few lines of code,
234
00:10:56.230 --> 00:10:58.080
especially since we haven't done too much
235
00:10:58.080 --> 00:10:59.183
with pandas lately.
236
00:11:01.280 --> 00:11:02.910
So after we fit the model,
237
00:11:02.910 --> 00:11:04.580
we're using the labels function
238
00:11:04.580 --> 00:11:06.630
to return a list of cluster labels
239
00:11:06.630 --> 00:11:08.653
that each observation was assigned to.
240
00:11:09.760 --> 00:11:12.550
Then we're creating a new data frame called X_clustered
241
00:11:13.560 --> 00:11:15.550
and using pandas assigned function
242
00:11:15.550 --> 00:11:18.513
to add a new label column to the features data frame.
243
00:11:19.730 --> 00:11:21.040
Now to finish things off,
244
00:11:21.040 --> 00:11:23.000
we're making one last data frame
245
00:11:23.000 --> 00:11:25.030
that's made up of all of the observations
246
00:11:25.030 --> 00:11:26.173
that were labeled one.
247
00:11:28.320 --> 00:11:31.083
All right, so now we can move back over to the console.
248
00:11:40.840 --> 00:11:42.450
And when we compare the eight cities
249
00:11:42.450 --> 00:11:44.850
that were labeled one by the agglomerative model
250
00:11:44.850 --> 00:11:47.170
to the orange cluster of the dendrogram,
251
00:11:47.170 --> 00:11:49.283
we end up getting the exact same result.
252
00:11:50.400 --> 00:11:52.460
So there you have it.
253
00:11:52.460 --> 00:11:53.560
You now officially know
254
00:11:53.560 --> 00:11:56.010
how to perform hierarchical clustering
255
00:11:56.010 --> 00:11:58.380
by either creating an agglomerative model
256
00:11:58.380 --> 00:12:00.270
or by making a dendrogram.
257
00:12:02.550 --> 00:12:04.900
And I know you're probably super sad
258
00:12:04.900 --> 00:12:08.060
but that does bring us to the end of the clustering section
259
00:12:08.060 --> 00:12:09.670
but fortunately for us,
260
00:12:09.670 --> 00:12:11.580
it also means that we're finally breaking
261
00:12:11.580 --> 00:12:13.780
into the neural network section.
262
00:12:13.780 --> 00:12:16.620
So until then, I'll wrap things up
263
00:12:16.620 --> 00:12:18.370
and I'll see you in the next guide.