Guide to Hierarchical Visualizations

WEBVTT

1
00:00:05.340 --> 00:00:06.950
All right, in this guide,

2
00:00:06.950 --> 00:00:09.090
we're gonna be wrapping up the clustering section

3
00:00:09.090 --> 00:00:10.590
by going through how to build

4
00:00:10.590 --> 00:00:12.780
an agglomerative clustering model

5
00:00:12.780 --> 00:00:14.950
as well as how to create a dendrogram,

6
00:00:14.950 --> 00:00:18.053
which is the name of the visual for hierarchical clustering.

7
00:00:19.726 --> 00:00:21.350
Now, before we jump into the code,

8
00:00:21.350 --> 00:00:22.990
I wanna do a quick breakdown

9
00:00:22.990 --> 00:00:24.920
of some of the more important aspects

10
00:00:24.920 --> 00:00:26.400
of the dendrogram.

11
00:00:26.400 --> 00:00:29.300
I figure it's probably better to get it out of the way.

12
00:00:29.300 --> 00:00:30.850
That way we don't have to keep stopping

13
00:00:30.850 --> 00:00:33.310
when we're trying to go through all the code.

14
00:00:33.310 --> 00:00:37.100
So first and foremost, a dendrogram is a branching diagram

15
00:00:37.100 --> 00:00:39.750
that represents the relationships of similarity

16
00:00:39.750 --> 00:00:41.832
among a group of observations.

17
00:00:41.832 --> 00:00:43.733
Each branch is called a clade.

18
00:00:44.810 --> 00:00:48.180
The terminal end of each clade is called a leaf.

19
00:00:48.180 --> 00:00:50.436
Clades can have as few as one leaf

20
00:00:50.436 --> 00:00:52.170
but more often than not,

21
00:00:52.170 --> 00:00:54.110
they'll have at least two leaves

22
00:00:54.110 --> 00:00:56.553
and can go up to as many leaves as they want.

23
00:00:58.090 --> 00:01:00.280
Now, the last thing that you need to know about

24
00:01:00.280 --> 00:01:02.690
is the arrangements of each clade.

25
00:01:02.690 --> 00:01:04.730
It's probably the most important aspect

26
00:01:04.730 --> 00:01:05.640
of the dendrogram

27
00:01:05.640 --> 00:01:06.880
because it tells us which leaves

28
00:01:06.880 --> 00:01:09.660
are gonna be the most similar to each other.

29
00:01:09.660 --> 00:01:13.040
In this case, leaf one is most similar to leaf two

30
00:01:13.040 --> 00:01:16.615
because they're the first to form a clade with each other.

31
00:01:16.615 --> 00:01:18.990
Then the same is gonna be true for leaves four and five

32
00:01:18.990 --> 00:01:22.300
because they're the first to join with each other.

33
00:01:22.300 --> 00:01:23.640
We can even go beyond that

34
00:01:23.640 --> 00:01:27.117
and say that leaves one, two, four and five

35
00:01:27.117 --> 00:01:29.050
and more similar with each other

36
00:01:29.050 --> 00:01:31.000
than they are with leaf three

37
00:01:31.000 --> 00:01:33.620
because they're all forming a clade before they form one

38
00:01:33.620 --> 00:01:34.563
with leaf three.

39
00:01:35.760 --> 00:01:38.830
And finally, the height of each branch indicates

40
00:01:38.830 --> 00:01:41.920
how similar or different everything is from each other.

41
00:01:41.920 --> 00:01:44.470
The greater the height, the greater the difference.

42
00:01:45.340 --> 00:01:47.070
So we already know that leaves one and two

43
00:01:47.070 --> 00:01:49.860
are most closely related to each other.

44
00:01:49.860 --> 00:01:52.597
And the same is true for leaves four and five.

45
00:01:52.597 --> 00:01:54.270
Now, we can also say

46
00:01:54.270 --> 00:01:57.080
that leaves one and two are more closely related

47
00:01:57.080 --> 00:01:59.210
to each other than leaves four and five

48
00:01:59.210 --> 00:02:01.440
because the branch for leaves one and two

49
00:02:01.440 --> 00:02:04.780
is shorter than the branch for leaves four and five.

50
00:02:04.780 --> 00:02:06.380
All right, now that you know the basics

51
00:02:06.380 --> 00:02:07.550
of the dendrogram,

52
00:02:07.550 --> 00:02:09.500
let's get into some of the actual code.

53
00:02:16.180 --> 00:02:17.050
So to start off,

54
00:02:17.050 --> 00:02:18.360
we're gonna need to take a quick look

55
00:02:18.360 --> 00:02:19.910
at the data we're working with.

56
00:02:21.430 --> 00:02:23.050
And it looks like the data frame we're using

57
00:02:23.050 --> 00:02:26.070
has just under 600 rows with eight columns,

58
00:02:26.070 --> 00:02:28.780
including state_id, state_name,

59
00:02:28.780 --> 00:02:31.690
county_name, latitude, longitude,

60
00:02:31.690 --> 00:02:34.083
population, density and timezone.

61
00:02:35.360 --> 00:02:37.140
And you'll see why I'm doing it this way

62
00:02:37.140 --> 00:02:38.113
in just a minute.

63
00:02:39.700 --> 00:02:41.930
But we're also replacing the original index

64
00:02:41.930 --> 00:02:43.440
with the city column.

65
00:02:43.440 --> 00:02:45.820
That way, we can use it to label the dendrogram

66
00:02:45.820 --> 00:02:47.470
when we finally get to that part.

67
00:02:48.310 --> 00:02:50.313
Now in terms of the features,

68
00:02:53.010 --> 00:02:55.860
we're only gonna be using latitude and longitude,

69
00:02:55.860 --> 00:02:57.650
so all of the clustering decisions

70
00:02:57.650 --> 00:03:00.393
are based quite literally on distance or location.

71
00:03:02.430 --> 00:03:04.140
And we'll get to the rest of the code

72
00:03:04.140 --> 00:03:05.293
in just a second.

73
00:03:06.990 --> 00:03:08.080
But I thought it would be fun

74
00:03:08.080 --> 00:03:11.160
to throw in this new package called Cartopy,

75
00:03:11.160 --> 00:03:12.810
which is a pretty cool graphing tool

76
00:03:12.810 --> 00:03:15.200
that works direct with matplotlib.

77
00:03:15.200 --> 00:03:16.270
And by their own description

78
00:03:16.270 --> 00:03:19.890
is a package designed for geospatial data processing

79
00:03:19.890 --> 00:03:21.543
in order to produce maps.

80
00:03:22.540 --> 00:03:24.951
I've only used the package a couple times

81
00:03:24.951 --> 00:03:27.190
but it looks like they have a bunch

82
00:03:27.190 --> 00:03:28.840
of different ways it can be used.

83
00:03:29.800 --> 00:03:32.143
So if it looks like something you might wanna mess with,

84
00:03:32.143 --> 00:03:34.320
feel free to do the install.

85
00:03:34.320 --> 00:03:36.470
But this is the only time that we're gonna be using it,

86
00:03:36.470 --> 00:03:38.570
so it's really not that important to have.

87
00:03:40.250 --> 00:03:41.810
Now, to get back to it,

88
00:03:41.810 --> 00:03:43.120
let's take a quick look

89
00:03:43.120 --> 00:03:44.540
at what we've built so far

90
00:03:44.540 --> 00:03:45.990
and then go through the code.

91
00:03:48.530 --> 00:03:50.073
And that's pretty cool, right?

92
00:03:52.490 --> 00:03:54.960
All righty, so starting with the imports,

93
00:03:54.960 --> 00:03:56.960
we're gonna be using the crs module

94
00:03:56.960 --> 00:04:00.820
to define and transform the map's coordinate system.

95
00:04:00.820 --> 00:04:02.870
And then the features module allows us

96
00:04:02.870 --> 00:04:05.520
to add in all of the different features we wanna use.

97
00:04:08.400 --> 00:04:11.570
Then the three lines right here define

98
00:04:11.570 --> 00:04:14.080
what section of the map we're looking at.

99
00:04:14.080 --> 00:04:16.070
The first two lines are where we want the map

100
00:04:16.070 --> 00:04:17.540
to be centered

101
00:04:17.540 --> 00:04:19.670
and the third line sets the latitude

102
00:04:19.670 --> 00:04:22.540
and longitude min, max values for the map.

103
00:04:22.540 --> 00:04:24.790
So this is kinda like setting the window

104
00:04:24.790 --> 00:04:26.040
that we'll be looking at.

105
00:04:27.480 --> 00:04:30.263
Now, the rest of it works on top of matplotlib.

106
00:04:31.200 --> 00:04:33.350
So after we adjust the figure size,

107
00:04:33.350 --> 00:04:36.100
we're projecting the map along the matplotlib axis,

108
00:04:36.100 --> 00:04:38.513
using the coordinate system we just defined.

109
00:04:40.450 --> 00:04:42.070
Then in the next five lines,

110
00:04:42.070 --> 00:04:43.720
we're just adding in some features

111
00:04:43.720 --> 00:04:45.843
like oceans and state borders.

112
00:04:48.050 --> 00:04:50.400
Now, to actually overlay the scatterplot,

113
00:04:50.400 --> 00:04:52.760
we have to make sure to transform the data

114
00:04:52.760 --> 00:04:56.040
so it's using the same coordinate system as the map.

115
00:04:56.040 --> 00:04:58.620
Then change what's called the z order

116
00:04:58.620 --> 00:05:01.120
to make sure the scatterplot is the top layer

117
00:05:01.120 --> 00:05:02.063
of the visual.

118
00:05:04.190 --> 00:05:05.440
We'll be coming back to the map

119
00:05:05.440 --> 00:05:07.050
to use as a reference.

120
00:05:07.050 --> 00:05:08.680
But for now, we're gonna be moving on

121
00:05:08.680 --> 00:05:10.240
to the hierarchical model

122
00:05:10.240 --> 00:05:11.853
and building the dendrogram.

123
00:05:15.770 --> 00:05:17.680
All right, so to build the model,

124
00:05:17.680 --> 00:05:20.700
we're gonna first need to import the agglomerative algorithm

125
00:05:20.700 --> 00:05:23.410
from the cluster module in scikit-learn.

126
00:05:23.410 --> 00:05:26.120
We're also gonna be going back to the scipy library

127
00:05:26.120 --> 00:05:29.023
to import the linkage and dendrogram function.

128
00:05:34.310 --> 00:05:35.930
Now, if you wanna take a closer look

129
00:05:35.930 --> 00:05:38.400
at the linkage documentation, you can

130
00:05:38.400 --> 00:05:39.910
but its general purpose

131
00:05:39.910 --> 00:05:42.240
is to give us a way of creating an object

132
00:05:42.240 --> 00:05:44.580
that represents the hierarchical clustering

133
00:05:44.580 --> 00:05:47.733
of a dataset encoded as a linkage matrix.

134
00:05:49.200 --> 00:05:50.760
And just in case you forgot,

135
00:05:50.760 --> 00:05:52.960
linkage refers to where in the cluster

136
00:05:52.960 --> 00:05:54.780
we're measuring distance from,

137
00:05:54.780 --> 00:05:56.160
so for a single linkage,

138
00:05:56.160 --> 00:05:58.500
which happens to be the default method,

139
00:05:58.500 --> 00:06:00.680
it works by measuring the shortest distance

140
00:06:00.680 --> 00:06:03.323
between a pair of observations in two clusters.

141
00:06:04.760 --> 00:06:07.733
Now, for the dendrogram documentation,

142
00:06:10.220 --> 00:06:11.410
it says that it can be used

143
00:06:11.410 --> 00:06:14.373
to plot the hierarchical clustering as a dendrogram.

144
00:06:15.680 --> 00:06:17.310
Now, in order to do that,

145
00:06:17.310 --> 00:06:19.940
the very first parameter the function calls for

146
00:06:19.940 --> 00:06:21.730
is the linkage matrix

147
00:06:21.730 --> 00:06:24.303
that we just so conveniently happen to make.

148
00:06:25.510 --> 00:06:28.320
And I think before we start talking about the parameters,

149
00:06:28.320 --> 00:06:30.220
we should probably run the code

150
00:06:30.220 --> 00:06:32.743
so we can start talking about the actual result.

151
00:06:43.170 --> 00:06:45.730
All right, so we're gonna switch back and forth

152
00:06:45.730 --> 00:06:47.340
between this and the map

153
00:06:47.340 --> 00:06:49.320
but according to the dendrogram,

154
00:06:49.320 --> 00:06:51.460
it ended up creating two main clusters,

155
00:06:51.460 --> 00:06:54.460
which I guess we can just call the orange and green cluster.

156
00:06:55.940 --> 00:06:57.730
But regardless of the cluster,

157
00:06:57.730 --> 00:06:59.590
the two cities that appear to the closest

158
00:06:59.590 --> 00:07:03.120
to each other are Las Vegas and Spring Valley

159
00:07:03.120 --> 00:07:04.380
and we can make that assumption

160
00:07:04.380 --> 00:07:06.280
because they have the shortest branch.

161
00:07:11.190 --> 00:07:12.890
I don't know if that helped at all

162
00:07:12.890 --> 00:07:15.440
but it seems like it might be a little better

163
00:07:15.440 --> 00:07:18.000
but anyway, Vegas and Spring Valley

164
00:07:18.000 --> 00:07:20.170
are pretty much completely overlapping,

165
00:07:20.170 --> 00:07:22.970
so it makes sense that those two cities are the closest.

166
00:07:23.980 --> 00:07:26.490
Then after that it was Fullerton, Corona

167
00:07:26.490 --> 00:07:28.160
and I think it was Hemet

168
00:07:28.160 --> 00:07:30.470
is that's even how you pronounce it.

169
00:07:30.470 --> 00:07:33.163
But that's gonna be this little group over here.

170
00:07:34.060 --> 00:07:37.090
And kind of like it was with Vegas and Spring Valley,

171
00:07:37.090 --> 00:07:40.060
since Fullerton and Corona are so close to each other,

172
00:07:40.060 --> 00:07:41.300
it makes it really hard

173
00:07:41.300 --> 00:07:43.890
to see all three of the observations.

174
00:07:43.890 --> 00:07:45.190
Then if you really wanted to,

175
00:07:45.190 --> 00:07:46.930
you could go through and do the same thing

176
00:07:46.930 --> 00:07:48.410
with all of the other cities

177
00:07:48.410 --> 00:07:50.853
but I think you get the gist of what's going on.

178
00:08:00.220 --> 00:08:02.580
So to finish off with the dendrogram code,

179
00:08:02.580 --> 00:08:04.950
the two other parameters I wanted to talk about

180
00:08:04.950 --> 00:08:07.620
are p and truncate mode.

181
00:08:07.620 --> 00:08:09.840
It's not a problem in this example

182
00:08:09.840 --> 00:08:11.520
but a lot of the times when you're working

183
00:08:11.520 --> 00:08:13.310
with a really large dataset,

184
00:08:13.310 --> 00:08:15.510
you're gonna need to condense the dendrogram

185
00:08:15.510 --> 00:08:16.983
in order to make it readable.

186
00:08:17.990 --> 00:08:19.730
And the easiest way of doing that

187
00:08:19.730 --> 00:08:22.500
is by using the truncate parameter.

188
00:08:22.500 --> 00:08:24.850
There's a couple of different ways of doing it.

189
00:08:28.540 --> 00:08:30.030
But the first one that we'll go over

190
00:08:30.030 --> 00:08:31.183
is called lastp.

191
00:08:34.670 --> 00:08:36.550
And when we use lastp,

192
00:08:36.550 --> 00:08:39.440
it only gives us the last p merges.

193
00:08:39.440 --> 00:08:42.793
So if I leave p at 20, and run the code,

194
00:08:44.620 --> 00:08:45.630
nothing changes

195
00:08:45.630 --> 00:08:49.590
because the last 20 merges still include every merge

196
00:08:50.500 --> 00:08:53.253
but if I move p down to 10,

197
00:08:58.800 --> 00:09:02.420
the dendrogram only shows the last 10 merges.

198
00:09:02.420 --> 00:09:05.390
Then the numbers inside the parentheses represent the number

199
00:09:05.390 --> 00:09:07.163
of observations in each leaf.

200
00:09:09.750 --> 00:09:12.393
The other option that we have is called level.

201
00:09:16.310 --> 00:09:18.950
And basically, level's gonna be truncating based

202
00:09:18.950 --> 00:09:20.610
on the number of nodes.

203
00:09:20.610 --> 00:09:23.010
So right now, if we count the number of nodes

204
00:09:23.010 --> 00:09:25.573
on the left-hand side of the green cluster,

205
00:09:26.480 --> 00:09:30.730
we're gonna have one, two, three, four, five,

206
00:09:30.730 --> 00:09:32.570
six and seven.

207
00:09:32.570 --> 00:09:34.883
So if I change the p value to seven,

208
00:09:42.280 --> 00:09:44.120
nothing happens.

209
00:09:44.120 --> 00:09:46.983
But if we move it down to six,

210
00:09:49.930 --> 00:09:53.123
we end up losing a level along that same pathway.

211
00:09:54.350 --> 00:09:56.800
But since we're using such a small dataset,

212
00:09:56.800 --> 00:09:59.160
it's not something we're really gonna have to worry about.

213
00:09:59.160 --> 00:10:00.793
So we can just leave it at none.

214
00:10:11.580 --> 00:10:13.050
All right, so that leaves us

215
00:10:13.050 --> 00:10:15.290
with just one last topic to cover

216
00:10:15.290 --> 00:10:17.663
and that's setting up the agglomerative model.

217
00:10:18.840 --> 00:10:20.340
A lot of the same stuff we've covered

218
00:10:20.340 --> 00:10:22.160
in the other guides is still gonna hold true

219
00:10:22.160 --> 00:10:23.960
for what we're doing here.

220
00:10:23.960 --> 00:10:25.770
So just like we've always done,

221
00:10:25.770 --> 00:10:28.580
we're assigning the algorithm to a variable.

222
00:10:28.580 --> 00:10:31.223
Then using the fit function to generate the model.

223
00:10:32.070 --> 00:10:34.040
And in the last guide, we even talked

224
00:10:34.040 --> 00:10:36.300
about a couple different methods we can implement

225
00:10:36.300 --> 00:10:39.240
to help us choose the best k value for a model.

226
00:10:39.240 --> 00:10:41.140
But I don't think it's really necessary

227
00:10:41.140 --> 00:10:43.720
for us to do those in this example.

228
00:10:43.720 --> 00:10:46.670
So there honestly isn't too much for us to go through

229
00:10:46.670 --> 00:10:48.760
other than comparing the results of the model

230
00:10:48.760 --> 00:10:49.923
to the dendrogram.

231
00:10:50.760 --> 00:10:52.700
Now, before we go over to the console,

232
00:10:52.700 --> 00:10:54.090
it probably wouldn't hurt

233
00:10:54.090 --> 00:10:56.230
to go over these last few lines of code,

234
00:10:56.230 --> 00:10:58.080
especially since we haven't done too much

235
00:10:58.080 --> 00:10:59.183
with pandas lately.

236
00:11:01.280 --> 00:11:02.910
So after we fit the model,

237
00:11:02.910 --> 00:11:04.580
we're using the labels function

238
00:11:04.580 --> 00:11:06.630
to return a list of cluster labels

239
00:11:06.630 --> 00:11:08.653
that each observation was assigned to.

240
00:11:09.760 --> 00:11:12.550
Then we're creating a new data frame called X_clustered

241
00:11:13.560 --> 00:11:15.550
and using pandas assigned function

242
00:11:15.550 --> 00:11:18.513
to add a new label column to the features data frame.

243
00:11:19.730 --> 00:11:21.040
Now to finish things off,

244
00:11:21.040 --> 00:11:23.000
we're making one last data frame

245
00:11:23.000 --> 00:11:25.030
that's made up of all of the observations

246
00:11:25.030 --> 00:11:26.173
that were labeled one.

247
00:11:28.320 --> 00:11:31.083
All right, so now we can move back over to the console.

248
00:11:40.840 --> 00:11:42.450
And when we compare the eight cities

249
00:11:42.450 --> 00:11:44.850
that were labeled one by the agglomerative model

250
00:11:44.850 --> 00:11:47.170
to the orange cluster of the dendrogram,

251
00:11:47.170 --> 00:11:49.283
we end up getting the exact same result.

252
00:11:50.400 --> 00:11:52.460
So there you have it.

253
00:11:52.460 --> 00:11:53.560
You now officially know

254
00:11:53.560 --> 00:11:56.010
how to perform hierarchical clustering

255
00:11:56.010 --> 00:11:58.380
by either creating an agglomerative model

256
00:11:58.380 --> 00:12:00.270
or by making a dendrogram.

257
00:12:02.550 --> 00:12:04.900
And I know you're probably super sad

258
00:12:04.900 --> 00:12:08.060
but that does bring us to the end of the clustering section

259
00:12:08.060 --> 00:12:09.670
but fortunately for us,

260
00:12:09.670 --> 00:12:11.580
it also means that we're finally breaking

261
00:12:11.580 --> 00:12:13.780
into the neural network section.

262
00:12:13.780 --> 00:12:16.620
So until then, I'll wrap things up

263
00:12:16.620 --> 00:12:18.370
and I'll see you in the next guide.