Guide to Hierarchical Clustering

WEBVTT

1
00:00:03.950 --> 00:00:05.540
In this guide, we'll be covering

2
00:00:05.540 --> 00:00:07.230
our second clustering technique

3
00:00:07.230 --> 00:00:09.910
which is Hierarchical Clustering.

4
00:00:09.910 --> 00:00:12.940
And as a heads up, I'm for sure gonna mispronounce

5
00:00:12.940 --> 00:00:15.350
hierarchical a few times throughout the guide,

6
00:00:15.350 --> 00:00:17.100
because for whatever reason

7
00:00:17.100 --> 00:00:19.193
I struggled horribly with that word.

8
00:00:20.490 --> 00:00:23.980
But to jump right in hierarchical clustering can be achieved

9
00:00:23.980 --> 00:00:25.240
in two ways.

10
00:00:25.240 --> 00:00:27.560
The first is called Agglomerative

11
00:00:27.560 --> 00:00:30.690
which basically means to collect or form.

12
00:00:30.690 --> 00:00:33.020
Then the second is Divisive,

13
00:00:33.020 --> 00:00:35.920
and that means to pull apart or separate.

14
00:00:35.920 --> 00:00:38.690
But regardless of the method, the sole purpose

15
00:00:38.690 --> 00:00:41.090
of both hierarchical clustering methods

16
00:00:41.090 --> 00:00:44.220
is to build out a hierarchy of unique clusters

17
00:00:44.220 --> 00:00:45.753
based on similarities.

18
00:00:47.020 --> 00:00:49.410
And to the best of my knowledge Scikit-learn

19
00:00:49.410 --> 00:00:52.050
still only offers the Agglomerative class.

20
00:00:52.050 --> 00:00:54.520
So we'll probably end up focusing on that method

21
00:00:54.520 --> 00:00:55.493
a little bit more.

22
00:00:56.610 --> 00:00:59.390
If you're not familiar with hierarchical structures,

23
00:00:59.390 --> 00:01:01.770
it's really just an arrangement of items

24
00:01:01.770 --> 00:01:03.950
based on levels of importance.

25
00:01:03.950 --> 00:01:06.360
So you can think of something like the United States

26
00:01:06.360 --> 00:01:09.580
Federal Government, or my all-time favorite

27
00:01:09.580 --> 00:01:11.823
The Swanson Pyramid of Greatness.

28
00:01:13.230 --> 00:01:14.930
Hierarchy structures can be found

29
00:01:14.930 --> 00:01:17.270
in nearly every aspect of life,

30
00:01:17.270 --> 00:01:19.430
but specific to machine learning

31
00:01:19.430 --> 00:01:21.400
you'll typically find them being used

32
00:01:21.400 --> 00:01:24.710
for data analysis projects like social networking,

33
00:01:24.710 --> 00:01:27.703
or in the biomedical field where they're widely used.

34
00:01:28.680 --> 00:01:31.410
Now to go back to what I said in the beginning

35
00:01:31.410 --> 00:01:35.470
hierarchical clustering algorithms fall into two categories,

36
00:01:35.470 --> 00:01:37.553
agglomerative and divisive.

37
00:01:38.630 --> 00:01:40.850
Agglomerative hierarchical algorithms

38
00:01:40.850 --> 00:01:43.150
follow a bottom up approach.

39
00:01:43.150 --> 00:01:45.260
Meaning the algorithm begins by treating

40
00:01:45.260 --> 00:01:49.310
each individual data point as its own single cluster.

41
00:01:49.310 --> 00:01:51.280
Then pairs of clusters are merged

42
00:01:51.280 --> 00:01:53.003
as it moves up the hierarchy.

43
00:01:53.950 --> 00:01:56.640
In contrast, divisive hierarchical algorithms

44
00:01:56.640 --> 00:01:59.110
implement a top down approach.

45
00:01:59.110 --> 00:02:02.690
Where all the observations are treated as one big cluster

46
00:02:02.690 --> 00:02:05.253
before being divided into smaller clusters.

47
00:02:06.750 --> 00:02:08.820
I've alluded to it in the past couple of guides

48
00:02:08.820 --> 00:02:11.870
but the most important aspect of hierarchical clustering

49
00:02:11.870 --> 00:02:14.020
and really any clustering algorithm,

50
00:02:14.020 --> 00:02:15.940
is the distance between data points

51
00:02:15.940 --> 00:02:17.403
and how it's being measured?

52
00:02:18.740 --> 00:02:21.670
We're gonna start off with just a six data points,

53
00:02:21.670 --> 00:02:23.510
and just by eyeballing the data,

54
00:02:23.510 --> 00:02:27.100
we can get a pretty good idea of how many clusters to use?

55
00:02:27.100 --> 00:02:29.563
And how the data will probably be grouped.

56
00:02:30.610 --> 00:02:33.260
More than likely we'll end up having three clusters,

57
00:02:33.260 --> 00:02:36.510
where X1 and X2 are grouped together,

58
00:02:36.510 --> 00:02:39.190
X3 and X4 are in the second cluster,

59
00:02:39.190 --> 00:02:42.820
and X5 and X6 are in the third cluster.

60
00:02:42.820 --> 00:02:45.560
But for a Scikit learn to actually yield that result

61
00:02:45.560 --> 00:02:48.270
there obviously needs to be a much more technical

62
00:02:48.270 --> 00:02:50.203
and mathematically based approach.

63
00:02:51.270 --> 00:02:53.670
Going back to the documentation really quick

64
00:02:53.670 --> 00:02:55.730
we're gonna be using two separate parameters

65
00:02:55.730 --> 00:02:58.240
to help with calculating distance.

66
00:02:58.240 --> 00:03:01.550
The first is Affinity, which defines how the algorithm

67
00:03:01.550 --> 00:03:05.160
measures the actual physical distance between points.

68
00:03:05.160 --> 00:03:06.630
We'll be going through a couple of these

69
00:03:06.630 --> 00:03:10.240
but Scikit learn uses Euclidean distance as the default,

70
00:03:10.240 --> 00:03:14.050
but we also have the option of using L1, L2,

71
00:03:14.050 --> 00:03:17.853
Manhattan, Cosine or a pre-computed distance.

72
00:03:18.700 --> 00:03:20.640
But since Euclidean is the default method

73
00:03:20.640 --> 00:03:22.990
I guess it makes the most sense to start there.

74
00:03:23.950 --> 00:03:26.120
So the quick explanation of euclidean distance

75
00:03:26.120 --> 00:03:29.000
is that it works the same way as vector addition.

76
00:03:29.000 --> 00:03:30.660
So it's going to calculate the distance

77
00:03:30.660 --> 00:03:33.690
between the X and Y component of each point,

78
00:03:33.690 --> 00:03:36.090
then use Pythagorean's theorem to solve for the length

79
00:03:36.090 --> 00:03:37.763
or magnitude of the hypotenuse.

80
00:03:39.580 --> 00:03:41.480
Now the second one offering in Scikit learn

81
00:03:41.480 --> 00:03:43.380
is Manhattan distance.

82
00:03:43.380 --> 00:03:46.110
Unlike euclidean, Manhattan doesn't calculate

83
00:03:46.110 --> 00:03:48.420
the straight line distance between points.

84
00:03:48.420 --> 00:03:51.680
Instead it measures distance by calculating the sum

85
00:03:51.680 --> 00:03:54.250
of the absolute difference of the Cartesian coordinates

86
00:03:54.250 --> 00:03:55.373
of both points.

87
00:03:56.250 --> 00:03:59.990
So in this example, we can see three different paths taken

88
00:03:59.990 --> 00:04:02.730
but they all end up with the same result.

89
00:04:02.730 --> 00:04:04.270
And sometimes you'll hear this method

90
00:04:04.270 --> 00:04:06.690
referred to as Taxi Cab Geometry,

91
00:04:06.690 --> 00:04:08.440
because it mimics the shortest route

92
00:04:08.440 --> 00:04:09.860
that a taxi cab can travel through

93
00:04:09.860 --> 00:04:11.603
the grid layout of Manhattan.

94
00:04:13.130 --> 00:04:15.410
The last built in option that Scikit learn offers

95
00:04:15.410 --> 00:04:17.640
is Cosine similarity.

96
00:04:17.640 --> 00:04:19.980
The biggest difference between Cosine similarity

97
00:04:19.980 --> 00:04:22.290
and Euclidean or Manhattan distance

98
00:04:22.290 --> 00:04:25.240
is that it doesn't use magnitude or the length

99
00:04:25.240 --> 00:04:27.280
to measure distance.

100
00:04:27.280 --> 00:04:30.610
Instead it treats each instance as a unit vector

101
00:04:30.610 --> 00:04:32.510
originating from the origin,

102
00:04:32.510 --> 00:04:36.250
and instead measures the angle between two instances.

103
00:04:36.250 --> 00:04:38.333
This is done by using the dot product.

104
00:04:39.550 --> 00:04:42.240
In our example, we can say that vectors A and C

105
00:04:42.240 --> 00:04:44.660
are more similar than vectors A and B,

106
00:04:44.660 --> 00:04:47.730
because the angle between them is smaller.

107
00:04:47.730 --> 00:04:50.300
And typically when we use cosine similarity

108
00:04:50.300 --> 00:04:52.300
it's limited to the first quadrant,

109
00:04:52.300 --> 00:04:55.430
meaning it usually only deals with positive space.

110
00:04:55.430 --> 00:04:57.900
And because of that, the largest possible angle

111
00:04:57.900 --> 00:05:00.223
between any two points is 90 degrees.

112
00:05:01.550 --> 00:05:04.220
Now, one of the rules or characteristics

113
00:05:04.220 --> 00:05:07.040
or whatever you wanna call it with vectors,

114
00:05:07.040 --> 00:05:09.470
is that they're considered to be the most similar

115
00:05:09.470 --> 00:05:10.840
if they're parallel,

116
00:05:10.840 --> 00:05:13.643
and maximally different if they're perpendicular.

117
00:05:15.560 --> 00:05:17.710
All right, now that we've covered a few different ways

118
00:05:17.710 --> 00:05:20.980
clustering algorithms can measure the actual distance.

119
00:05:20.980 --> 00:05:23.690
The other parameter that we need to take into consideration

120
00:05:23.690 --> 00:05:27.830
is where exactly in the cluster we should be measuring from.

121
00:05:27.830 --> 00:05:30.700
And this parameter is referred to as Linkage,

122
00:05:30.700 --> 00:05:34.023
because it's how the algorithm joins or links clusters.

123
00:05:34.900 --> 00:05:37.050
For this one, the options that we can choose from

124
00:05:37.050 --> 00:05:40.390
in Scikit learn are single linkage, complete linkage,

125
00:05:40.390 --> 00:05:42.363
average linkage, and wards linkage.

126
00:05:44.060 --> 00:05:45.810
We're gonna start off with the newest addition

127
00:05:45.810 --> 00:05:46.643
to Scikit learn,

128
00:05:46.643 --> 00:05:50.010
which is Single Linkage or Nearest Neighbor.

129
00:05:50.010 --> 00:05:52.480
This method works by measuring the shortest distance

130
00:05:52.480 --> 00:05:55.730
between a pair of observations in two clusters.

131
00:05:55.730 --> 00:05:59.010
It's a simple and fast that works well with a variety

132
00:05:59.010 --> 00:06:00.560
of data sets,

133
00:06:00.560 --> 00:06:04.210
but unfortunately it does struggle with noisy data

134
00:06:04.210 --> 00:06:06.910
and is also known to produce long chains

135
00:06:06.910 --> 00:06:09.100
because of its tendency to add observations

136
00:06:09.100 --> 00:06:12.523
to existing clusters rather than creating new clusters.

137
00:06:14.040 --> 00:06:15.900
The second strategy is Complete

138
00:06:15.900 --> 00:06:18.445
or Furthest Neighbor Linkage.

139
00:06:18.445 --> 00:06:20.740
And I guess you'd consider this to be the opposite

140
00:06:20.740 --> 00:06:21.920
of single linkage,

141
00:06:21.920 --> 00:06:23.740
because it measures the furthest distance

142
00:06:23.740 --> 00:06:26.910
between observations in different clusters.

143
00:06:26.910 --> 00:06:30.010
After it calculates all of the possible max distances,

144
00:06:30.010 --> 00:06:33.363
it chooses the shortest maximum to merge the clusters.

145
00:06:35.020 --> 00:06:36.260
Unlike single linkage,

146
00:06:36.260 --> 00:06:39.600
complete linkage is able to handle noise a little bit better

147
00:06:39.600 --> 00:06:42.630
and it avoids chaining, but the downside

148
00:06:42.630 --> 00:06:45.253
is that it's more computationally expensive.

149
00:06:47.550 --> 00:06:50.020
The third option is Average Linkage,

150
00:06:50.020 --> 00:06:52.160
and it's kind of the middle ground between single

151
00:06:52.160 --> 00:06:53.900
and complete linkage.

152
00:06:53.900 --> 00:06:55.950
It's a pretty self-explanatory method

153
00:06:55.950 --> 00:06:58.340
because it's literally just the average distance

154
00:06:58.340 --> 00:07:00.693
between all of the elements in each cluster.

155
00:07:01.580 --> 00:07:03.600
The benefit of using this strategy

156
00:07:03.600 --> 00:07:06.980
is that it's less effected by noise and outliers,

157
00:07:06.980 --> 00:07:10.543
but it does tend to be biased towards really dense clusters.

158
00:07:12.420 --> 00:07:14.490
Now, the last option that we're gonna cover

159
00:07:14.490 --> 00:07:18.450
which also happens to be the default method is Wards Method

160
00:07:18.450 --> 00:07:20.483
or the Minimum Variance Method.

161
00:07:21.540 --> 00:07:23.500
It's gonna be a little more complex

162
00:07:23.500 --> 00:07:27.120
but this strategy works by pretending to merge two clusters,

163
00:07:27.120 --> 00:07:29.610
then estimating the location of the centroid

164
00:07:29.610 --> 00:07:31.483
for each of the possible clusters.

165
00:07:32.320 --> 00:07:34.770
Once it estimates the location at the centroid

166
00:07:34.770 --> 00:07:37.120
it calculates the sum of square deviations

167
00:07:37.120 --> 00:07:39.950
of all of the points from the new centroid,

168
00:07:39.950 --> 00:07:41.430
then chooses whichever merge

169
00:07:41.430 --> 00:07:44.200
would result in the smallest increase in total variance

170
00:07:44.200 --> 00:07:45.253
within the cluster.

171
00:07:46.440 --> 00:07:48.667
I know this method is a little more confusing

172
00:07:48.667 --> 00:07:51.210
but in our example, the blue and yellow clusters

173
00:07:51.210 --> 00:07:53.230
would end up being the first to merge,

174
00:07:53.230 --> 00:07:55.480
because they're gonna have the lowest total variance

175
00:07:55.480 --> 00:07:56.433
after they join.

176
00:07:57.400 --> 00:07:59.670
Then similar to a couple of the other options.

177
00:07:59.670 --> 00:08:03.260
This method is pretty good at avoiding noise or outliers,

178
00:08:03.260 --> 00:08:05.680
but it can also be biased towards globular

179
00:08:05.680 --> 00:08:07.313
or high dense clusters.

180
00:08:09.450 --> 00:08:12.850
And that about brings us to the end of the guide.

181
00:08:12.850 --> 00:08:14.580
I'm assuming you probably wanna go

182
00:08:14.580 --> 00:08:16.450
through some of the actual code.

183
00:08:16.450 --> 00:08:18.740
So in the next couple of guides, we'll be moving forward

184
00:08:18.740 --> 00:08:20.360
with the modeling portion,

185
00:08:20.360 --> 00:08:22.940
along with how we can build and utilize visuals

186
00:08:22.940 --> 00:08:25.130
to help with decision-making.

187
00:08:25.130 --> 00:08:27.600
But for now I will wrap things up

188
00:08:27.600 --> 00:08:29.450
and I will see you in the next guide.