- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.980 --> 00:00:05.020
In this guide,
2
00:00:05.020 --> 00:00:06.580
I'd like to take some time to talk
3
00:00:06.580 --> 00:00:09.160
about one of the characteristics of machine learning
4
00:00:09.160 --> 00:00:11.330
that sometimes gets overlooked.
5
00:00:11.330 --> 00:00:14.433
It's the idea of a bias-variance trade off.
6
00:00:15.360 --> 00:00:18.130
When we talk about bias or bias error,
7
00:00:18.130 --> 00:00:19.770
it's really referring to the difference
8
00:00:19.770 --> 00:00:21.430
between the expected value,
9
00:00:21.430 --> 00:00:24.520
which is the predictor function or the true value,
10
00:00:24.520 --> 00:00:26.863
which is what the outcome should actually be.
11
00:00:28.040 --> 00:00:28.940
Now in this image,
12
00:00:28.940 --> 00:00:30.400
we're gonna be looking at the bias
13
00:00:30.400 --> 00:00:32.380
of a linear regression model
14
00:00:32.380 --> 00:00:36.100
and the red prediction function is the expected value
15
00:00:36.100 --> 00:00:39.133
and the black line represents the true value.
16
00:00:40.150 --> 00:00:43.830
You can also see that no matter how we adjust the fit line,
17
00:00:43.830 --> 00:00:46.410
it's never going to have the flexibility
18
00:00:46.410 --> 00:00:48.660
to accurately replicate the curvature
19
00:00:48.660 --> 00:00:50.233
of the true relationship.
20
00:00:51.180 --> 00:00:53.550
So the inability of the machine learning method
21
00:00:53.550 --> 00:00:55.650
to capture the true relationship
22
00:00:55.650 --> 00:00:57.763
is what we consider to be bias.
23
00:00:59.370 --> 00:01:00.410
But in contrast,
24
00:01:00.410 --> 00:01:03.460
we could use a squiggly line that's really flexible
25
00:01:03.460 --> 00:01:06.780
and able to adapt to the curves of the training data,
26
00:01:06.780 --> 00:01:09.363
essentially eliminating the bias altogether.
27
00:01:10.280 --> 00:01:12.480
But if we were to add testing data,
28
00:01:12.480 --> 00:01:14.300
the squiggly line suddenly appears
29
00:01:14.300 --> 00:01:17.440
to really struggle with making accurate predictions,
30
00:01:17.440 --> 00:01:18.963
even with no bias.
31
00:01:20.300 --> 00:01:22.640
Now, from that we'd consider the squiggly line
32
00:01:22.640 --> 00:01:24.610
to have high variability
33
00:01:24.610 --> 00:01:26.880
because the training and testing data
34
00:01:26.880 --> 00:01:29.473
have drastically different sums of squares.
35
00:01:30.510 --> 00:01:31.740
What that means for the model
36
00:01:31.740 --> 00:01:33.530
is that it makes it harder to predict
37
00:01:33.530 --> 00:01:37.020
how well the squiggly line will actually perform.
38
00:01:37.020 --> 00:01:39.830
Sometimes it might make a perfect prediction.
39
00:01:39.830 --> 00:01:42.683
Then other times it just might be really far off.
40
00:01:43.760 --> 00:01:44.960
On the other hand,
41
00:01:44.960 --> 00:01:48.590
we know the straight line has significantly higher bias
42
00:01:48.590 --> 00:01:51.140
because it's not able to capture the true curvature
43
00:01:51.140 --> 00:01:52.223
of the training data.
44
00:01:53.380 --> 00:01:54.700
But using this visual,
45
00:01:54.700 --> 00:01:58.340
we can see it has a low variance because the sums of squares
46
00:01:58.340 --> 00:02:02.163
for both the training and testing set are fairly similar.
47
00:02:03.030 --> 00:02:04.700
So this model might only give us
48
00:02:04.700 --> 00:02:06.990
good predictions, not great.
49
00:02:06.990 --> 00:02:08.600
But at the very least,
50
00:02:08.600 --> 00:02:11.543
we know there'll be consistently good predictions.
51
00:02:13.040 --> 00:02:14.270
It's not always true,
52
00:02:14.270 --> 00:02:17.060
but typically algorithms high-end bias
53
00:02:17.060 --> 00:02:19.610
are able to learn a little bit faster.
54
00:02:19.610 --> 00:02:21.600
So if your primary goal is speed
55
00:02:21.600 --> 00:02:25.120
and you're not too concerned with really accurate results
56
00:02:25.120 --> 00:02:28.543
in algorithm high and bias is probably okay to use.
57
00:02:29.410 --> 00:02:31.480
But if you're modeling a complex function
58
00:02:31.480 --> 00:02:33.620
and you need pinpoint results
59
00:02:33.620 --> 00:02:35.370
in algorithm lower and bias
60
00:02:35.370 --> 00:02:37.193
is probably the better option.
61
00:02:38.420 --> 00:02:39.500
At this point in the course,
62
00:02:39.500 --> 00:02:41.870
we've already covered the two most common examples
63
00:02:41.870 --> 00:02:43.870
of high bias algorithms,
64
00:02:43.870 --> 00:02:47.500
which are linear regression and logistic regression.
65
00:02:47.500 --> 00:02:48.690
And we've also covered
66
00:02:48.690 --> 00:02:51.550
the two most common low bias algorithms as well,
67
00:02:51.550 --> 00:02:54.263
which are decision trees and K-nearest neighbors.
68
00:02:55.200 --> 00:02:56.820
I should probably also mention
69
00:02:56.820 --> 00:02:58.350
that it's pretty much impossible
70
00:02:58.350 --> 00:03:01.730
to avoid some level of variance in an algorithm,
71
00:03:01.730 --> 00:03:04.640
but ideally speaking there shouldn't be too much change
72
00:03:04.640 --> 00:03:06.793
from one training set to the next.
73
00:03:08.000 --> 00:03:11.500
Unfortunately that's not always gonna be the case
74
00:03:11.500 --> 00:03:13.230
because some algorithms happen
75
00:03:13.230 --> 00:03:16.963
to be more strongly influenced by specifics and data sets.
76
00:03:18.500 --> 00:03:21.500
Usually nonlinear algorithms like the decision tree
77
00:03:21.500 --> 00:03:24.880
or K nearest neighbors have a lot of flexibility
78
00:03:24.880 --> 00:03:27.290
resulting in higher variance.
79
00:03:27.290 --> 00:03:30.220
Then the more rigid algorithms like linear regression
80
00:03:30.220 --> 00:03:33.233
and logistic regression tend to have lower variance,
81
00:03:34.400 --> 00:03:35.540
even though it's impossible
82
00:03:35.540 --> 00:03:38.020
to avoid bias and variance altogether,
83
00:03:38.020 --> 00:03:41.130
every good supervised machine learning algorithm works
84
00:03:41.130 --> 00:03:43.680
towards reducing the two as much as it can
85
00:03:43.680 --> 00:03:45.983
to maximize the predictive performance.
86
00:03:47.010 --> 00:03:48.340
And for whatever reason,
87
00:03:48.340 --> 00:03:51.110
I've always looked at the bias variance trade off
88
00:03:51.110 --> 00:03:53.963
kind of like an optimization problem for a business.
89
00:03:54.900 --> 00:03:56.420
And really what I mean by that
90
00:03:56.420 --> 00:03:57.890
is let's pretend for a minute
91
00:03:57.890 --> 00:04:00.090
that we're the owners of a coffee shop
92
00:04:00.090 --> 00:04:04.000
and we decide we wanna increase our overall income.
93
00:04:04.000 --> 00:04:05.960
To do that, we decide that we wanna lower
94
00:04:05.960 --> 00:04:08.010
our operating expenses.
95
00:04:08.010 --> 00:04:11.240
So the first thing we do is cut down on labor.
96
00:04:11.240 --> 00:04:13.050
Let's say a little bit of time goes by
97
00:04:13.050 --> 00:04:15.740
and the plan seems to be working really well.
98
00:04:15.740 --> 00:04:18.030
So now we decide to reduce our overhead
99
00:04:18.030 --> 00:04:20.230
by ordering less product.
100
00:04:20.230 --> 00:04:22.930
Initially we'll say the plan works,
101
00:04:22.930 --> 00:04:24.410
but after a few more months,
102
00:04:24.410 --> 00:04:25.880
we noticed that we keep running out
103
00:04:25.880 --> 00:04:28.010
of all of our essential goods.
104
00:04:28.010 --> 00:04:29.120
And not only that,
105
00:04:29.120 --> 00:04:31.040
but now our staff hates us
106
00:04:31.040 --> 00:04:32.420
and we're losing customers
107
00:04:32.420 --> 00:04:34.033
because we're so understaffed.
108
00:04:35.210 --> 00:04:38.600
So to offset that we decided to add a little bit more labor
109
00:04:38.600 --> 00:04:42.010
and increase our ordering just a little bit more.
110
00:04:42.010 --> 00:04:44.750
And we finally ended up finding our sweet spot,
111
00:04:44.750 --> 00:04:46.590
which maximizes our margins
112
00:04:46.590 --> 00:04:49.673
all while keeping the staff happy and customers happy.
113
00:04:50.980 --> 00:04:51.860
And that to me
114
00:04:51.860 --> 00:04:55.080
is how I like to look at the bias variance trade off.
115
00:04:55.080 --> 00:04:57.400
It's gonna be impossible to change one
116
00:04:57.400 --> 00:05:00.690
without effecting the other one, at least a little bit.
117
00:05:00.690 --> 00:05:02.970
And unfortunately, if you lose sight of that,
118
00:05:02.970 --> 00:05:05.570
you're going to end up with a really horrible model.
119
00:05:06.620 --> 00:05:08.730
Now with all of that being said,
120
00:05:08.730 --> 00:05:09.910
the last thing that I'll mention
121
00:05:09.910 --> 00:05:11.480
before I wrap the guide up
122
00:05:11.480 --> 00:05:13.850
is that unfortunately there's no real way
123
00:05:13.850 --> 00:05:16.770
of calculating the bias variance trade off,
124
00:05:16.770 --> 00:05:19.070
but as a framework bias and variance,
125
00:05:19.070 --> 00:05:21.020
give us a few more tools that we can use
126
00:05:21.020 --> 00:05:24.330
to better understand the behavior of an algorithm,
127
00:05:24.330 --> 00:05:26.020
which hopefully results in a model
128
00:05:26.020 --> 00:05:28.210
that's capable of making really strong
129
00:05:28.210 --> 00:05:30.610
and accurate predictions.
130
00:05:30.610 --> 00:05:32.910
And for now, I will wrap this up
131
00:05:32.910 --> 00:05:34.813
and I will see you in the next guide.