- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:04.290 --> 00:00:05.340
In this guide,
2
00:00:05.340 --> 00:00:07.990
we'll be discussing our first clustering algorithm,
3
00:00:07.990 --> 00:00:10.320
which is k-means clustering.
4
00:00:10.320 --> 00:00:11.740
I like starting with k-means
5
00:00:11.740 --> 00:00:13.200
because it's probably the most
6
00:00:13.200 --> 00:00:15.270
well-known clustering algorithm,
7
00:00:15.270 --> 00:00:18.423
and is arguably the easiest to implement as well.
8
00:00:19.540 --> 00:00:22.130
A couple other reasons why I prefer to use k-means
9
00:00:22.130 --> 00:00:25.900
is that it's also capable of handling really large datasets
10
00:00:25.900 --> 00:00:27.950
while maintaining efficiency.
11
00:00:27.950 --> 00:00:29.690
It's easily adjustable
12
00:00:29.690 --> 00:00:32.653
and it delivers a results that are easy to interpret.
13
00:00:33.780 --> 00:00:35.520
Now, to move into how it works,
14
00:00:35.520 --> 00:00:38.510
the objective of k-means is pretty straightforward.
15
00:00:38.510 --> 00:00:40.760
It's to group similar data points together
16
00:00:40.760 --> 00:00:42.610
to discover unknown patterns
17
00:00:42.610 --> 00:00:45.330
by looking for a fixed number of k clusters
18
00:00:45.330 --> 00:00:47.620
in an unlabeled dataset.
19
00:00:47.620 --> 00:00:48.750
And that's accomplished
20
00:00:48.750 --> 00:00:50.780
through reducing variants in the data
21
00:00:50.780 --> 00:00:54.693
by partitioning it into distinct non-overlapping subgroups.
22
00:00:55.990 --> 00:00:58.160
To give you a better understanding of how it works,
23
00:00:58.160 --> 00:01:01.260
let's start with the dataset we want clustered.
24
00:01:01.260 --> 00:01:02.870
Now, unless specified,
25
00:01:02.870 --> 00:01:04.160
the algorithm is gonna start
26
00:01:04.160 --> 00:01:07.520
by placing k centroids in random locations,
27
00:01:07.520 --> 00:01:08.883
which for us will be two.
28
00:01:09.830 --> 00:01:11.740
Then it iterates through the dataset
29
00:01:11.740 --> 00:01:14.290
calculating the distance between each data point
30
00:01:14.290 --> 00:01:15.937
and both of the centroids.
31
00:01:17.120 --> 00:01:19.520
From there, the data point is assigned to the cluster
32
00:01:19.520 --> 00:01:21.573
that the nearest centroid represents.
33
00:01:22.700 --> 00:01:25.470
Once the algorithm iterates through the entire dataset,
34
00:01:25.470 --> 00:01:29.320
the centroids are adjusted to the center of each cluster.
35
00:01:29.320 --> 00:01:31.800
And that's done by calculating the mean position
36
00:01:31.800 --> 00:01:34.203
of all of the instances in each cluster.
37
00:01:35.290 --> 00:01:36.500
But more importantly,
38
00:01:36.500 --> 00:01:38.670
now that the centroid has been adjusted,
39
00:01:38.670 --> 00:01:41.023
the algorithm is gonna start all over again.
40
00:01:42.260 --> 00:01:43.370
So just like before,
41
00:01:43.370 --> 00:01:45.580
it's gonna iterate through the entire dataset
42
00:01:45.580 --> 00:01:46.610
measuring the distance
43
00:01:46.610 --> 00:01:48.843
between each point and the two centroids.
44
00:01:49.770 --> 00:01:53.260
Then it will assign each point to the appropriate cluster.
45
00:01:53.260 --> 00:01:54.240
And due to the fact
46
00:01:54.240 --> 00:01:56.160
that there was such a big change of position
47
00:01:56.160 --> 00:01:58.130
for the centroid in the red class,
48
00:01:58.130 --> 00:01:59.870
we end up seeing one of the data points
49
00:01:59.870 --> 00:02:02.210
change from blue to red.
50
00:02:02.210 --> 00:02:03.690
We're not gonna keep going through it,
51
00:02:03.690 --> 00:02:05.620
but the same process will continue
52
00:02:05.620 --> 00:02:07.590
until the data points are stable
53
00:02:07.590 --> 00:02:10.363
and you stop seeing them switch between clusters.
54
00:02:11.390 --> 00:02:13.720
And since I brought it up a few guides ago,
55
00:02:13.720 --> 00:02:16.070
it's probably worth mentioning again,
56
00:02:16.070 --> 00:02:17.580
but this is also the point
57
00:02:17.580 --> 00:02:20.920
where the algorithm officially converges as well.
58
00:02:20.920 --> 00:02:22.230
And like before,
59
00:02:22.230 --> 00:02:23.650
we're not gonna spend too much time
60
00:02:23.650 --> 00:02:25.490
talking about convergence,
61
00:02:25.490 --> 00:02:27.260
but it is an aspect of k-means
62
00:02:27.260 --> 00:02:29.260
that you absolutely need to be aware of
63
00:02:29.260 --> 00:02:31.460
because even though we don't have to do it ourselves,
64
00:02:31.460 --> 00:02:34.463
mathematically, it's how clusters are actually separated.
65
00:02:35.750 --> 00:02:38.910
Now, we kind of talked about this general concept already,
66
00:02:38.910 --> 00:02:40.240
but because of data point
67
00:02:40.240 --> 00:02:42.070
can only be assigned to one class,
68
00:02:42.070 --> 00:02:44.250
or in this case, one cluster,
69
00:02:44.250 --> 00:02:46.600
the algorithm has to eliminate the possibility
70
00:02:46.600 --> 00:02:49.380
of any shared space between clusters.
71
00:02:49.380 --> 00:02:52.720
So to make sure that's the case, it establishes limits,
72
00:02:52.720 --> 00:02:54.110
ensuring the cluster can come
73
00:02:54.110 --> 00:02:56.060
really, really close to another cluster
74
00:02:56.060 --> 00:02:57.483
without actually touching.
75
00:02:58.480 --> 00:03:01.330
And I that about covers all the basic principles
76
00:03:01.330 --> 00:03:04.660
of the k-means algorithm that you'll need to know.
77
00:03:04.660 --> 00:03:06.330
So I guess to finish off the guide,
78
00:03:06.330 --> 00:03:08.040
we'll go through some of the drawbacks
79
00:03:08.040 --> 00:03:10.230
that are pretty much guaranteed to come your way
80
00:03:10.230 --> 00:03:11.793
when you start to use k-means.
81
00:03:13.180 --> 00:03:14.450
The first one we'll talk about
82
00:03:14.450 --> 00:03:17.220
is determining the best k value to use,
83
00:03:17.220 --> 00:03:19.920
which is easily the most important to get right,
84
00:03:19.920 --> 00:03:23.600
but also arguably the most annoying to figure out.
85
00:03:23.600 --> 00:03:25.510
Unfortunately, this is a process
86
00:03:25.510 --> 00:03:27.340
that needs to be done manually,
87
00:03:27.340 --> 00:03:30.493
and it's not always clear what value you should go with.
88
00:03:31.460 --> 00:03:33.210
We're not gonna be going through it right now,
89
00:03:33.210 --> 00:03:35.870
but there are some general suggestions that you can use
90
00:03:35.870 --> 00:03:38.090
to actually figure out your k value.
91
00:03:38.090 --> 00:03:40.990
But there isn't a hard and fast rule that you can turn to.
92
00:03:41.870 --> 00:03:43.370
The second issue with k-means
93
00:03:43.370 --> 00:03:45.963
is that it tends to be sensitive to initialization.
94
00:03:46.870 --> 00:03:47.710
In this visual,
95
00:03:47.710 --> 00:03:49.760
you can see that depending on the location
96
00:03:49.760 --> 00:03:51.080
of the initial seed,
97
00:03:51.080 --> 00:03:53.180
it can have a fairly substantial effect
98
00:03:53.180 --> 00:03:55.203
on how the data ends up being clustered.
99
00:03:56.530 --> 00:03:57.630
The third shortcoming
100
00:03:57.630 --> 00:04:00.740
is that it's also sensitive to outliers.
101
00:04:00.740 --> 00:04:02.010
This is kind of related
102
00:04:02.010 --> 00:04:04.290
to choosing the appropriate number of clusters
103
00:04:04.290 --> 00:04:05.810
because if there's a single point
104
00:04:05.810 --> 00:04:08.170
that's too far from the rest of the dataset,
105
00:04:08.170 --> 00:04:09.080
a lot of the time
106
00:04:09.080 --> 00:04:11.603
it'll end up being placed as its own cluster.
107
00:04:12.750 --> 00:04:14.310
The final drawback to k-means
108
00:04:14.310 --> 00:04:17.030
is that it's highly sensitive to distance.
109
00:04:17.030 --> 00:04:18.820
So depending on the data,
110
00:04:18.820 --> 00:04:21.193
feature scaling can be incredibly important.
111
00:04:22.030 --> 00:04:24.640
It's not a huge deal if you skip over scaling
112
00:04:24.640 --> 00:04:26.130
if you know that all of your data
113
00:04:26.130 --> 00:04:28.260
is using the same unit of measure,
114
00:04:28.260 --> 00:04:30.810
because the only real difference you'll probably see
115
00:04:30.810 --> 00:04:33.400
is just in the processing speed.
116
00:04:33.400 --> 00:04:35.730
But that does lead us into the other reason
117
00:04:35.730 --> 00:04:37.990
why you'd wanna scale your data.
118
00:04:37.990 --> 00:04:39.030
At the off chance,
119
00:04:39.030 --> 00:04:41.340
you end up working with a really big dataset
120
00:04:41.340 --> 00:04:43.280
that also happens to contain data
121
00:04:43.280 --> 00:04:45.880
measured with a really high order of magnitude,
122
00:04:45.880 --> 00:04:47.570
it's probably gonna be beneficial
123
00:04:47.570 --> 00:04:49.420
to shrink down that measuring distance
124
00:04:49.420 --> 00:04:51.573
to save you a lot of processing time.
125
00:04:52.990 --> 00:04:54.450
Now, with all that being said,
126
00:04:54.450 --> 00:04:56.990
I think it's time to officially wrap this guide up.
127
00:04:56.990 --> 00:04:59.543
So as always, I will see you in the next guide.