Principal Component Analysis Overview

WEBVTT

1
00:00:05.750 --> 00:00:07.910
Apparently I lied to you in the last guide,

2
00:00:07.910 --> 00:00:09.200
because before we get into

3
00:00:09.200 --> 00:00:11.250
building the actual clustering model,

4
00:00:11.250 --> 00:00:13.110
we need to briefly go through something called

5
00:00:13.110 --> 00:00:17.530
a matrix decomposition and principle component analysis.

6
00:00:17.530 --> 00:00:19.820
This is a fairly complex topic.

7
00:00:19.820 --> 00:00:21.120
So I thought for this guide,

8
00:00:21.120 --> 00:00:22.360
it would probably be best

9
00:00:22.360 --> 00:00:24.900
to only take a surface level approach.

10
00:00:24.900 --> 00:00:27.250
That way you can at least start to familiarize yourself

11
00:00:27.250 --> 00:00:30.470
with the terms, but also avoid having to dive headfirst

12
00:00:30.470 --> 00:00:34.250
into a bunch of linear algebra and probability theory.

13
00:00:34.250 --> 00:00:35.640
So to start things off,

14
00:00:35.640 --> 00:00:39.150
you should probably know why we use tools like PCA.

15
00:00:39.150 --> 00:00:41.210
Well, the big overarching reason

16
00:00:41.210 --> 00:00:42.840
is to lower the number of features

17
00:00:42.840 --> 00:00:45.350
an algorithm uses to create a model.

18
00:00:45.350 --> 00:00:46.540
Which in machine learning,

19
00:00:46.540 --> 00:00:49.093
we refer to as dimensionality reduction.

20
00:00:50.020 --> 00:00:52.140
In general, there's only a couple of reasons

21
00:00:52.140 --> 00:00:53.370
a machine learning developer

22
00:00:53.370 --> 00:00:56.320
would implement dimensionality reduction.

23
00:00:56.320 --> 00:00:58.900
The first and probably the most common

24
00:00:58.900 --> 00:01:01.200
is for data compression.

25
00:01:01.200 --> 00:01:04.870
Which will not only help save on memory or disk space

26
00:01:04.870 --> 00:01:08.540
but it also speeds up how quickly an algorithm can learn.

27
00:01:08.540 --> 00:01:11.420
The second reason is what we're gonna be using it for.

28
00:01:11.420 --> 00:01:13.320
And that's to help us create a visual.

29
00:01:14.917 --> 00:01:17.350
As many times as I've done this,

30
00:01:17.350 --> 00:01:18.670
I still haven't figured out

31
00:01:18.670 --> 00:01:20.320
the best way to start.

32
00:01:20.320 --> 00:01:22.970
So I'm just gonna lead it off with a couple diagrams

33
00:01:22.970 --> 00:01:24.793
to show you how PCA works.

34
00:01:25.770 --> 00:01:27.040
So in the first example,

35
00:01:27.040 --> 00:01:28.350
we have a tiny data set

36
00:01:28.350 --> 00:01:30.730
that's in two dimensional space.

37
00:01:30.730 --> 00:01:32.550
Well, what PCA aims to do

38
00:01:32.550 --> 00:01:35.930
is work towards finding a lower dimensional subspace

39
00:01:35.930 --> 00:01:38.370
to project the higher dimensional data,

40
00:01:38.370 --> 00:01:41.560
with the goal of minimizing the projection error.

41
00:01:41.560 --> 00:01:44.500
In our case, the data started in two dimensions.

42
00:01:44.500 --> 00:01:47.150
So PCA really has no other choice

43
00:01:47.150 --> 00:01:49.200
than to look for a one dimensional line

44
00:01:49.200 --> 00:01:51.840
that best reduces the distances.

45
00:01:51.840 --> 00:01:55.300
So if we go ahead and rotate the original data,

46
00:01:55.300 --> 00:01:56.880
it makes it a little easier

47
00:01:56.880 --> 00:01:59.180
to imagine how the data will be projected

48
00:01:59.180 --> 00:02:01.113
onto a one dimensional number line.

49
00:02:02.360 --> 00:02:03.770
Now, once we start working

50
00:02:03.770 --> 00:02:05.750
in dimensions greater than two,

51
00:02:05.750 --> 00:02:08.410
PCA has a few more options to choose from.

52
00:02:08.410 --> 00:02:11.510
In this example, we're working in three dimensional space.

53
00:02:11.510 --> 00:02:13.850
So we can use PCA to reduce the data

54
00:02:13.850 --> 00:02:16.510
to either one or two dimensions.

55
00:02:16.510 --> 00:02:18.680
This one's a little harder for me to draw

56
00:02:18.680 --> 00:02:21.060
but for reduction into two dimensional space,

57
00:02:21.060 --> 00:02:23.620
it's gonna work about the same way.

58
00:02:23.620 --> 00:02:26.560
Only this time, PCA needs to find two vectors

59
00:02:26.560 --> 00:02:28.700
that are able to represent the plane

60
00:02:28.700 --> 00:02:30.350
that the three-dimensional data

61
00:02:30.350 --> 00:02:32.100
is gonna be projected on.

62
00:02:32.100 --> 00:02:34.190
And like before, the projection error

63
00:02:34.190 --> 00:02:35.780
is still gonna be used.

64
00:02:35.780 --> 00:02:39.120
But this time, PCA measures the orthogonal distance

65
00:02:39.120 --> 00:02:40.383
from point to plane.

66
00:02:41.330 --> 00:02:44.810
And if you have curiosity as to how vector projection works,

67
00:02:44.810 --> 00:02:47.180
it's directly related to the dot product.

68
00:02:47.180 --> 00:02:49.480
And there's a bunch of really good resources out there

69
00:02:49.480 --> 00:02:50.880
if you wanna check them out.

70
00:02:52.150 --> 00:02:54.490
Now, the last chunk of information I'd like to cover

71
00:02:54.490 --> 00:02:56.560
before we run through some of the code,

72
00:02:56.560 --> 00:02:59.280
is how exactly PCA is able to figure out

73
00:02:59.280 --> 00:03:01.663
what vectors are gonna work best for the data.

74
00:03:02.710 --> 00:03:04.150
And since this is the point

75
00:03:04.150 --> 00:03:06.860
where the math starts to get a little more complicated,

76
00:03:06.860 --> 00:03:10.620
we're gonna basically just do an overview of the procedures.

77
00:03:10.620 --> 00:03:12.260
So we already know

78
00:03:12.260 --> 00:03:15.580
that PCA works towards dimensionality reduction,

79
00:03:15.580 --> 00:03:16.700
but to do that,

80
00:03:16.700 --> 00:03:18.540
it first needs to compute something called

81
00:03:18.540 --> 00:03:20.033
the covariance matrix.

82
00:03:21.130 --> 00:03:22.900
I'm assuming that covariance matrix

83
00:03:22.900 --> 00:03:26.240
is probably gonna be a new concept for most of you.

84
00:03:26.240 --> 00:03:28.710
So I think the easiest way to understand it,

85
00:03:28.710 --> 00:03:31.043
is just to relate it to regular variance.

86
00:03:32.160 --> 00:03:33.410
If regular variance

87
00:03:33.410 --> 00:03:36.650
measures the variation of a single random variable,

88
00:03:36.650 --> 00:03:39.950
like the height of a single person within a population,

89
00:03:39.950 --> 00:03:42.290
then covariance is the measure of how much

90
00:03:42.290 --> 00:03:45.240
two random variables, vary together.

91
00:03:45.240 --> 00:03:46.920
So instead of just height,

92
00:03:46.920 --> 00:03:48.480
it might be height and weight

93
00:03:48.480 --> 00:03:50.283
of a person within a population.

94
00:03:51.180 --> 00:03:54.150
Now, applying that to principal component analysis,

95
00:03:54.150 --> 00:03:56.040
covariance matrix is nothing more

96
00:03:56.040 --> 00:03:57.860
than a squared shaped matrix

97
00:03:57.860 --> 00:03:59.300
that's made up of the covariance

98
00:03:59.300 --> 00:04:00.730
between each pair of elements

99
00:04:00.730 --> 00:04:02.660
in a random vector.

100
00:04:02.660 --> 00:04:04.160
And if you go searching for it,

101
00:04:04.160 --> 00:04:07.510
you'll probably see it written a ton of different ways.

102
00:04:07.510 --> 00:04:11.510
But this is the general equation for a covariance matrix.

103
00:04:11.510 --> 00:04:14.110
The last point I wanna make about the covariance matrix

104
00:04:14.110 --> 00:04:15.720
is that in linear algebra,

105
00:04:15.720 --> 00:04:18.100
a matrix and vector have a relationship

106
00:04:18.100 --> 00:04:19.220
that's similar to a function

107
00:04:19.220 --> 00:04:21.123
and input variables in calculus.

108
00:04:22.100 --> 00:04:24.800
So for example, if we pass in x,

109
00:04:24.800 --> 00:04:26.793
the output we get is f of x.

110
00:04:27.820 --> 00:04:31.060
Well, the same can be true for a matrix in vector.

111
00:04:31.060 --> 00:04:34.283
If vector u goes in, vector au comes out.

112
00:04:35.150 --> 00:04:37.120
Now when it comes to PCA,

113
00:04:37.120 --> 00:04:39.300
the vectors that are most important to us

114
00:04:39.300 --> 00:04:40.470
are the vectors that come out

115
00:04:40.470 --> 00:04:42.440
pointing in the exact same direction

116
00:04:42.440 --> 00:04:43.793
they were originally in.

117
00:04:44.750 --> 00:04:47.130
Any vector that can maintain this direction,

118
00:04:47.130 --> 00:04:48.643
we refer to as an eigenvector.

119
00:04:51.622 --> 00:04:54.240
In this animation, you can get a much better idea

120
00:04:54.240 --> 00:04:56.100
as to how it all works.

121
00:04:56.100 --> 00:04:57.850
If you watch the pink vectors,

122
00:04:57.850 --> 00:04:59.310
you'll notice they're able to maintain

123
00:04:59.310 --> 00:05:02.360
their parallel relationship with the pink axis,

124
00:05:02.360 --> 00:05:04.460
even as everything is being stretched out.

125
00:05:06.690 --> 00:05:08.050
And the same is gonna be true

126
00:05:08.050 --> 00:05:11.173
for all of the blue vectors in regard to the blue axis.

127
00:05:12.640 --> 00:05:15.290
So we've considered both the pink and blue vectors

128
00:05:15.290 --> 00:05:16.240
to be eigenvectors.

129
00:05:18.040 --> 00:05:21.200
In contrast, if you only watch the red vectors,

130
00:05:21.200 --> 00:05:24.810
you can see how their orientation to the gray line changes,

131
00:05:24.810 --> 00:05:27.060
and that's because they are not eigenvectors.

132
00:05:28.600 --> 00:05:30.500
Now that parallel relationship

133
00:05:30.500 --> 00:05:32.810
can be explained through this equation,

134
00:05:32.810 --> 00:05:36.570
which states the product of matrix a and vector u

135
00:05:36.570 --> 00:05:39.543
is equal to the product of vector u and Lambda.

136
00:05:40.490 --> 00:05:44.410
We already know that matrix a is the covariance matrix.

137
00:05:44.410 --> 00:05:46.640
But for reasons we don't need to get into

138
00:05:46.640 --> 00:05:48.210
when it's used in this equation

139
00:05:48.210 --> 00:05:51.723
it's also a matrix representation of linear transformation.

140
00:05:52.750 --> 00:05:54.350
We also have vector u

141
00:05:54.350 --> 00:05:57.220
which we already know is an eigenvector.

142
00:05:57.220 --> 00:05:58.320
Then we have Lambda,

143
00:05:58.320 --> 00:06:00.970
which is more commonly referred to as the eigenvalue.

144
00:06:02.100 --> 00:06:04.610
And that's used as a scaling factor,

145
00:06:04.610 --> 00:06:07.160
allowing the vector u to adjust this magnitude

146
00:06:07.160 --> 00:06:09.160
without having to change this direction.

147
00:06:10.110 --> 00:06:12.000
Now, the final step to all of this

148
00:06:12.000 --> 00:06:14.573
is the computation of the principle components.

149
00:06:15.730 --> 00:06:17.060
For this we're actually gonna have to

150
00:06:17.060 --> 00:06:18.860
backtrack just a little bit

151
00:06:18.860 --> 00:06:21.803
and go back to the equation for the covariance matrix.

152
00:06:23.270 --> 00:06:25.450
The only part of the equation we're gonna be looking at,

153
00:06:25.450 --> 00:06:26.893
is this part right here.

154
00:06:28.460 --> 00:06:31.240
Where x-bar which represents the real mean

155
00:06:31.240 --> 00:06:34.730
is being subtracted from the feature values.

156
00:06:34.730 --> 00:06:36.860
And the whole point of that is to make sure the data

157
00:06:36.860 --> 00:06:38.710
is gonna be centered around the mean.

158
00:06:39.910 --> 00:06:42.130
Okay, so I had to bring that up

159
00:06:42.130 --> 00:06:43.600
because the principal components

160
00:06:43.600 --> 00:06:45.490
that we're gonna label as T

161
00:06:45.490 --> 00:06:48.355
are equal to the product of the centered features

162
00:06:48.355 --> 00:06:49.680
and the eigenvectors,

163
00:06:49.680 --> 00:06:52.593
which in this equation, we're also calling the loading.

164
00:06:54.100 --> 00:06:56.100
I know it's not terribly obvious,

165
00:06:56.100 --> 00:06:58.630
but the principal components actually represent

166
00:06:58.630 --> 00:07:00.620
the directions of maximum variance

167
00:07:00.620 --> 00:07:02.620
of the decomposed matrix.

168
00:07:02.620 --> 00:07:03.880
And the eigenvectors,

169
00:07:03.880 --> 00:07:05.810
which we also call loadings

170
00:07:05.810 --> 00:07:07.910
can be viewed as the weight,

171
00:07:07.910 --> 00:07:10.960
indicating how closely related each observation is

172
00:07:10.960 --> 00:07:12.313
to the principle component.

173
00:07:13.210 --> 00:07:15.400
Now, to finally wrap up the explanation,

174
00:07:15.400 --> 00:07:17.690
we can use the eigenvalue to help explain

175
00:07:17.690 --> 00:07:19.890
how much variance each principal component

176
00:07:19.890 --> 00:07:21.780
was able to capture.

177
00:07:21.780 --> 00:07:24.350
And that's actually a really important metric.

178
00:07:24.350 --> 00:07:26.320
So we'll be looking at that a little bit more

179
00:07:26.320 --> 00:07:28.120
now that we're moving into the code.

180
00:07:34.000 --> 00:07:36.580
All right, the only new impart we need to make

181
00:07:36.580 --> 00:07:38.370
is the PCA algorithm,

182
00:07:38.370 --> 00:07:41.270
which we get from the decomposition module.

183
00:07:41.270 --> 00:07:44.520
And I'm not 100% sure if I mentioned it earlier,

184
00:07:44.520 --> 00:07:48.230
but in order for PCA to work, the data needs to be scaled.

185
00:07:48.230 --> 00:07:50.243
So I had to make that import as well.

186
00:07:51.540 --> 00:07:52.660
When it comes to the data,

187
00:07:52.660 --> 00:07:55.120
we're gonna be working with three features,

188
00:07:55.120 --> 00:07:57.163
height, weight, and foot length.

189
00:07:58.560 --> 00:07:59.970
And over in the plot pane,

190
00:07:59.970 --> 00:08:02.370
you can see how the data looks when we graph it.

191
00:08:03.640 --> 00:08:05.810
Now for the PCA algorithm itself,

192
00:08:05.810 --> 00:08:07.640
the syntax is pretty much the same

193
00:08:07.640 --> 00:08:10.290
as all of the other algorithms we've worked with.

194
00:08:10.290 --> 00:08:12.900
We start by assigning it to a variable

195
00:08:12.900 --> 00:08:15.000
then depending on what we wanna do with it,

196
00:08:15.000 --> 00:08:17.350
we can either fit and transform the data

197
00:08:17.350 --> 00:08:19.850
or we can just fit the data like we're doing here.

198
00:08:21.150 --> 00:08:22.350
We probably won't need it

199
00:08:22.350 --> 00:08:25.100
but let me just pull up the documentation really quick.

200
00:08:32.190 --> 00:08:33.410
There we go.

201
00:08:33.410 --> 00:08:36.890
And now you can see the official definition if you want.

202
00:08:36.890 --> 00:08:37.723
So to start off,

203
00:08:37.723 --> 00:08:40.760
the only parameter we passed in was N components.

204
00:08:40.760 --> 00:08:41.850
And that's just indicating

205
00:08:41.850 --> 00:08:44.563
how many principal components we want PCA to use.

206
00:08:45.930 --> 00:08:47.100
The next thing we're gonna look at

207
00:08:47.100 --> 00:08:48.850
are the two attributes we're using.

208
00:08:52.100 --> 00:08:53.350
The first is components.

209
00:08:53.350 --> 00:08:56.150
And that's just a reference to the principle components.

210
00:08:57.190 --> 00:08:58.840
Now I'm gonna jump over to the console,

211
00:08:58.840 --> 00:09:01.683
so we can actually take a look at what we're getting back.

212
00:09:08.580 --> 00:09:10.260
Okay, so what we're looking at

213
00:09:10.260 --> 00:09:11.730
are the directional coordinates

214
00:09:11.730 --> 00:09:14.783
for principal component one, two and three.

215
00:09:16.050 --> 00:09:17.770
So if you remember back in the guide,

216
00:09:17.770 --> 00:09:19.630
principle components point in the direction

217
00:09:19.630 --> 00:09:22.820
of maximum variance, and I'm pretty sure

218
00:09:22.820 --> 00:09:25.160
these are the unit vectors as well.

219
00:09:25.160 --> 00:09:26.170
But just to double check,

220
00:09:26.170 --> 00:09:27.880
let's figure out the magnitude.

221
00:09:27.880 --> 00:09:29.920
Which we can do by calculating the square root

222
00:09:29.920 --> 00:09:30.913
of the squared sum.

223
00:10:04.030 --> 00:10:05.150
That's pretty close to one.

224
00:10:05.150 --> 00:10:07.420
So yeah, all three of these principle components

225
00:10:07.420 --> 00:10:09.023
are in unit vector form as well.

226
00:10:10.380 --> 00:10:12.960
Okay, now that brings us to the second attribute

227
00:10:12.960 --> 00:10:15.900
which is explained variance ratio.

228
00:10:15.900 --> 00:10:17.090
Now, this is the attribute

229
00:10:17.090 --> 00:10:18.650
that relates back to the eigenvalues

230
00:10:18.650 --> 00:10:20.400
that we were talking about earlier.

231
00:10:23.460 --> 00:10:25.740
The documentation actually has a pretty good explanation

232
00:10:25.740 --> 00:10:26.573
for this one.

233
00:10:26.573 --> 00:10:27.530
So what we're getting back

234
00:10:27.530 --> 00:10:29.010
is the percentage of variance

235
00:10:29.010 --> 00:10:32.060
explained by each of the principal components.

236
00:10:32.060 --> 00:10:33.540
So the first principal component,

237
00:10:33.540 --> 00:10:36.910
captures just under 96% of the variance.

238
00:10:36.910 --> 00:10:38.900
Then the second and third principle components,

239
00:10:38.900 --> 00:10:41.560
each explained less than a 10th of a percent.

240
00:10:41.560 --> 00:10:44.810
So to me, those two are kind of pointless to even use.

241
00:10:44.810 --> 00:10:45.750
And if you think about it,

242
00:10:45.750 --> 00:10:47.400
it makes a lot of sense

243
00:10:47.400 --> 00:10:48.660
because weight and foot length

244
00:10:48.660 --> 00:10:50.860
are probably strongly correlated to height.

245
00:10:50.860 --> 00:10:52.850
So most of the variants can be explained

246
00:10:52.850 --> 00:10:54.773
by using just one principal component.

247
00:10:56.160 --> 00:10:59.180
Now this was obviously an oversimplified dataset.

248
00:10:59.180 --> 00:11:01.220
So just know that more often than not,

249
00:11:01.220 --> 00:11:02.620
you're probably gonna have to use

250
00:11:02.620 --> 00:11:04.400
more than one principal component

251
00:11:04.400 --> 00:11:05.983
to explain that much variance.

252
00:11:07.130 --> 00:11:09.750
Oh and one last thing before we wrap it up,

253
00:11:09.750 --> 00:11:11.850
there isn't really a predetermined percentage

254
00:11:11.850 --> 00:11:14.370
of explained variants that we're trying to get to.

255
00:11:14.370 --> 00:11:16.490
It's all gonna depend on your data

256
00:11:16.490 --> 00:11:18.520
and like everything else in machine learning,

257
00:11:18.520 --> 00:11:20.520
the trade offs you're willing to accept.

258
00:11:21.470 --> 00:11:24.130
If you're able to explain 75% of the variance

259
00:11:24.130 --> 00:11:26.220
with 10 principal components

260
00:11:26.220 --> 00:11:29.500
and 98% with a 100 principal components,

261
00:11:29.500 --> 00:11:30.990
you'll need to assess the trade-off

262
00:11:30.990 --> 00:11:34.163
and decide what's more important, speed or accuracy.

263
00:11:35.990 --> 00:11:39.040
All right, that finally brings us to the end of the guide.

264
00:11:39.040 --> 00:11:41.700
And I know this topic can be a little bit confusing,

265
00:11:41.700 --> 00:11:43.970
but hopefully this all comes together in the next guide.

266
00:11:43.970 --> 00:11:46.240
When we actually get to build a k-means model,

267
00:11:46.240 --> 00:11:48.860
and then use PCA to help graph it.

268
00:11:48.860 --> 00:11:51.080
So in the words of a wise man

269
00:11:51.080 --> 00:11:52.370
I will wrap things up

270
00:11:52.370 --> 00:11:54.120
and I'll see you in the next guide.