K-means Clustering Overview
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

WEBVTT

1
00:00:04.290 --> 00:00:05.340
In this guide,

2
00:00:05.340 --> 00:00:07.990
we'll be discussing our first clustering algorithm,

3
00:00:07.990 --> 00:00:10.320
which is k-means clustering.

4
00:00:10.320 --> 00:00:11.740
I like starting with k-means

5
00:00:11.740 --> 00:00:13.200
because it's probably the most

6
00:00:13.200 --> 00:00:15.270
well-known clustering algorithm,

7
00:00:15.270 --> 00:00:18.423
and is arguably the easiest to implement as well.

8
00:00:19.540 --> 00:00:22.130
A couple other reasons why I prefer to use k-means

9
00:00:22.130 --> 00:00:25.900
is that it's also capable of handling really large datasets

10
00:00:25.900 --> 00:00:27.950
while maintaining efficiency.

11
00:00:27.950 --> 00:00:29.690
It's easily adjustable

12
00:00:29.690 --> 00:00:32.653
and it delivers a results that are easy to interpret.

13
00:00:33.780 --> 00:00:35.520
Now, to move into how it works,

14
00:00:35.520 --> 00:00:38.510
the objective of k-means is pretty straightforward.

15
00:00:38.510 --> 00:00:40.760
It's to group similar data points together

16
00:00:40.760 --> 00:00:42.610
to discover unknown patterns

17
00:00:42.610 --> 00:00:45.330
by looking for a fixed number of k clusters

18
00:00:45.330 --> 00:00:47.620
in an unlabeled dataset.

19
00:00:47.620 --> 00:00:48.750
And that's accomplished

20
00:00:48.750 --> 00:00:50.780
through reducing variants in the data

21
00:00:50.780 --> 00:00:54.693
by partitioning it into distinct non-overlapping subgroups.

22
00:00:55.990 --> 00:00:58.160
To give you a better understanding of how it works,

23
00:00:58.160 --> 00:01:01.260
let's start with the dataset we want clustered.

24
00:01:01.260 --> 00:01:02.870
Now, unless specified,

25
00:01:02.870 --> 00:01:04.160
the algorithm is gonna start

26
00:01:04.160 --> 00:01:07.520
by placing k centroids in random locations,

27
00:01:07.520 --> 00:01:08.883
which for us will be two.

28
00:01:09.830 --> 00:01:11.740
Then it iterates through the dataset

29
00:01:11.740 --> 00:01:14.290
calculating the distance between each data point

30
00:01:14.290 --> 00:01:15.937
and both of the centroids.

31
00:01:17.120 --> 00:01:19.520
From there, the data point is assigned to the cluster

32
00:01:19.520 --> 00:01:21.573
that the nearest centroid represents.

33
00:01:22.700 --> 00:01:25.470
Once the algorithm iterates through the entire dataset,

34
00:01:25.470 --> 00:01:29.320
the centroids are adjusted to the center of each cluster.

35
00:01:29.320 --> 00:01:31.800
And that's done by calculating the mean position

36
00:01:31.800 --> 00:01:34.203
of all of the instances in each cluster.

37
00:01:35.290 --> 00:01:36.500
But more importantly,

38
00:01:36.500 --> 00:01:38.670
now that the centroid has been adjusted,

39
00:01:38.670 --> 00:01:41.023
the algorithm is gonna start all over again.

40
00:01:42.260 --> 00:01:43.370
So just like before,

41
00:01:43.370 --> 00:01:45.580
it's gonna iterate through the entire dataset

42
00:01:45.580 --> 00:01:46.610
measuring the distance

43
00:01:46.610 --> 00:01:48.843
between each point and the two centroids.

44
00:01:49.770 --> 00:01:53.260
Then it will assign each point to the appropriate cluster.

45
00:01:53.260 --> 00:01:54.240
And due to the fact

46
00:01:54.240 --> 00:01:56.160
that there was such a big change of position

47
00:01:56.160 --> 00:01:58.130
for the centroid in the red class,

48
00:01:58.130 --> 00:01:59.870
we end up seeing one of the data points

49
00:01:59.870 --> 00:02:02.210
change from blue to red.

50
00:02:02.210 --> 00:02:03.690
We're not gonna keep going through it,

51
00:02:03.690 --> 00:02:05.620
but the same process will continue

52
00:02:05.620 --> 00:02:07.590
until the data points are stable

53
00:02:07.590 --> 00:02:10.363
and you stop seeing them switch between clusters.

54
00:02:11.390 --> 00:02:13.720
And since I brought it up a few guides ago,

55
00:02:13.720 --> 00:02:16.070
it's probably worth mentioning again,

56
00:02:16.070 --> 00:02:17.580
but this is also the point

57
00:02:17.580 --> 00:02:20.920
where the algorithm officially converges as well.

58
00:02:20.920 --> 00:02:22.230
And like before,

59
00:02:22.230 --> 00:02:23.650
we're not gonna spend too much time

60
00:02:23.650 --> 00:02:25.490
talking about convergence,

61
00:02:25.490 --> 00:02:27.260
but it is an aspect of k-means

62
00:02:27.260 --> 00:02:29.260
that you absolutely need to be aware of

63
00:02:29.260 --> 00:02:31.460
because even though we don't have to do it ourselves,

64
00:02:31.460 --> 00:02:34.463
mathematically, it's how clusters are actually separated.

65
00:02:35.750 --> 00:02:38.910
Now, we kind of talked about this general concept already,

66
00:02:38.910 --> 00:02:40.240
but because of data point

67
00:02:40.240 --> 00:02:42.070
can only be assigned to one class,

68
00:02:42.070 --> 00:02:44.250
or in this case, one cluster,

69
00:02:44.250 --> 00:02:46.600
the algorithm has to eliminate the possibility

70
00:02:46.600 --> 00:02:49.380
of any shared space between clusters.

71
00:02:49.380 --> 00:02:52.720
So to make sure that's the case, it establishes limits,

72
00:02:52.720 --> 00:02:54.110
ensuring the cluster can come

73
00:02:54.110 --> 00:02:56.060
really, really close to another cluster

74
00:02:56.060 --> 00:02:57.483
without actually touching.

75
00:02:58.480 --> 00:03:01.330
And I that about covers all the basic principles

76
00:03:01.330 --> 00:03:04.660
of the k-means algorithm that you'll need to know.

77
00:03:04.660 --> 00:03:06.330
So I guess to finish off the guide,

78
00:03:06.330 --> 00:03:08.040
we'll go through some of the drawbacks

79
00:03:08.040 --> 00:03:10.230
that are pretty much guaranteed to come your way

80
00:03:10.230 --> 00:03:11.793
when you start to use k-means.

81
00:03:13.180 --> 00:03:14.450
The first one we'll talk about

82
00:03:14.450 --> 00:03:17.220
is determining the best k value to use,

83
00:03:17.220 --> 00:03:19.920
which is easily the most important to get right,

84
00:03:19.920 --> 00:03:23.600
but also arguably the most annoying to figure out.

85
00:03:23.600 --> 00:03:25.510
Unfortunately, this is a process

86
00:03:25.510 --> 00:03:27.340
that needs to be done manually,

87
00:03:27.340 --> 00:03:30.493
and it's not always clear what value you should go with.

88
00:03:31.460 --> 00:03:33.210
We're not gonna be going through it right now,

89
00:03:33.210 --> 00:03:35.870
but there are some general suggestions that you can use

90
00:03:35.870 --> 00:03:38.090
to actually figure out your k value.

91
00:03:38.090 --> 00:03:40.990
But there isn't a hard and fast rule that you can turn to.

92
00:03:41.870 --> 00:03:43.370
The second issue with k-means

93
00:03:43.370 --> 00:03:45.963
is that it tends to be sensitive to initialization.

94
00:03:46.870 --> 00:03:47.710
In this visual,

95
00:03:47.710 --> 00:03:49.760
you can see that depending on the location

96
00:03:49.760 --> 00:03:51.080
of the initial seed,

97
00:03:51.080 --> 00:03:53.180
it can have a fairly substantial effect

98
00:03:53.180 --> 00:03:55.203
on how the data ends up being clustered.

99
00:03:56.530 --> 00:03:57.630
The third shortcoming

100
00:03:57.630 --> 00:04:00.740
is that it's also sensitive to outliers.

101
00:04:00.740 --> 00:04:02.010
This is kind of related

102
00:04:02.010 --> 00:04:04.290
to choosing the appropriate number of clusters

103
00:04:04.290 --> 00:04:05.810
because if there's a single point

104
00:04:05.810 --> 00:04:08.170
that's too far from the rest of the dataset,

105
00:04:08.170 --> 00:04:09.080
a lot of the time

106
00:04:09.080 --> 00:04:11.603
it'll end up being placed as its own cluster.

107
00:04:12.750 --> 00:04:14.310
The final drawback to k-means

108
00:04:14.310 --> 00:04:17.030
is that it's highly sensitive to distance.

109
00:04:17.030 --> 00:04:18.820
So depending on the data,

110
00:04:18.820 --> 00:04:21.193
feature scaling can be incredibly important.

111
00:04:22.030 --> 00:04:24.640
It's not a huge deal if you skip over scaling

112
00:04:24.640 --> 00:04:26.130
if you know that all of your data

113
00:04:26.130 --> 00:04:28.260
is using the same unit of measure,

114
00:04:28.260 --> 00:04:30.810
because the only real difference you'll probably see

115
00:04:30.810 --> 00:04:33.400
is just in the processing speed.

116
00:04:33.400 --> 00:04:35.730
But that does lead us into the other reason

117
00:04:35.730 --> 00:04:37.990
why you'd wanna scale your data.

118
00:04:37.990 --> 00:04:39.030
At the off chance,

119
00:04:39.030 --> 00:04:41.340
you end up working with a really big dataset

120
00:04:41.340 --> 00:04:43.280
that also happens to contain data

121
00:04:43.280 --> 00:04:45.880
measured with a really high order of magnitude,

122
00:04:45.880 --> 00:04:47.570
it's probably gonna be beneficial

123
00:04:47.570 --> 00:04:49.420
to shrink down that measuring distance

124
00:04:49.420 --> 00:04:51.573
to save you a lot of processing time.

125
00:04:52.990 --> 00:04:54.450
Now, with all that being said,

126
00:04:54.450 --> 00:04:56.990
I think it's time to officially wrap this guide up.

127
00:04:56.990 --> 00:04:59.543
So as always, I will see you in the next guide.