- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:04.560 --> 00:00:06.250
Over the course of the next few guides,
2
00:00:06.250 --> 00:00:07.750
we're gonna be spending our time talking
3
00:00:07.750 --> 00:00:10.840
about clustering, and depending on the task at time,
4
00:00:10.840 --> 00:00:14.050
scikit-learn offers a few different clustering classes
5
00:00:14.050 --> 00:00:15.203
for us to choose from.
6
00:00:16.210 --> 00:00:19.610
So whether you're working on spam or fraud detection,
7
00:00:19.610 --> 00:00:22.040
cyber profiling, recommender systems,
8
00:00:22.040 --> 00:00:24.080
identifying fake news articles
9
00:00:24.080 --> 00:00:26.730
or mining data for a ride sharing company,
10
00:00:26.730 --> 00:00:28.060
you'll have plenty of options
11
00:00:28.060 --> 00:00:29.660
within the scikit-learn library.
12
00:00:31.000 --> 00:00:32.580
If you're unfamiliar with clustering,
13
00:00:32.580 --> 00:00:35.250
it's an unsupervised machine learning technique
14
00:00:35.250 --> 00:00:37.730
that attempts to form groups or clusters
15
00:00:37.730 --> 00:00:40.260
of unlabeled data points.
16
00:00:40.260 --> 00:00:41.950
So when we're given a DataFrame,
17
00:00:41.950 --> 00:00:44.920
we can use clustering to help find latent relationships
18
00:00:44.920 --> 00:00:48.020
within the set by grouping each data point based
19
00:00:48.020 --> 00:00:48.963
on their features.
20
00:00:50.190 --> 00:00:52.800
Traditionally, there's really only been two reasons
21
00:00:52.800 --> 00:00:54.670
to implement clustering.
22
00:00:54.670 --> 00:00:57.500
The first is for exploratory analysis.
23
00:00:57.500 --> 00:01:00.250
So if you're looking at it from the data mining perspective,
24
00:01:00.250 --> 00:01:02.530
it can be used as a standalone tool
25
00:01:02.530 --> 00:01:04.920
to better understand a dataset.
26
00:01:04.920 --> 00:01:07.590
And secondly, it can also act as a component
27
00:01:07.590 --> 00:01:09.710
of a supervised learning process
28
00:01:09.710 --> 00:01:11.190
for when you want data clusters
29
00:01:11.190 --> 00:01:12.803
before applying it to a model.
30
00:01:14.352 --> 00:01:15.560
So over the next few guides,
31
00:01:15.560 --> 00:01:17.450
we'll be exploring the two most common
32
00:01:17.450 --> 00:01:19.920
clustering algorithms in machine learning:
33
00:01:19.920 --> 00:01:23.123
K-means clustering and hierarchy based clustering.