Bias Variance Tradeoff Discussion

WEBVTT

1
00:00:03.980 --> 00:00:05.020
In this guide,

2
00:00:05.020 --> 00:00:06.580
I'd like to take some time to talk

3
00:00:06.580 --> 00:00:09.160
about one of the characteristics of machine learning

4
00:00:09.160 --> 00:00:11.330
that sometimes gets overlooked.

5
00:00:11.330 --> 00:00:14.433
It's the idea of a bias-variance trade off.

6
00:00:15.360 --> 00:00:18.130
When we talk about bias or bias error,

7
00:00:18.130 --> 00:00:19.770
it's really referring to the difference

8
00:00:19.770 --> 00:00:21.430
between the expected value,

9
00:00:21.430 --> 00:00:24.520
which is the predictor function or the true value,

10
00:00:24.520 --> 00:00:26.863
which is what the outcome should actually be.

11
00:00:28.040 --> 00:00:28.940
Now in this image,

12
00:00:28.940 --> 00:00:30.400
we're gonna be looking at the bias

13
00:00:30.400 --> 00:00:32.380
of a linear regression model

14
00:00:32.380 --> 00:00:36.100
and the red prediction function is the expected value

15
00:00:36.100 --> 00:00:39.133
and the black line represents the true value.

16
00:00:40.150 --> 00:00:43.830
You can also see that no matter how we adjust the fit line,

17
00:00:43.830 --> 00:00:46.410
it's never going to have the flexibility

18
00:00:46.410 --> 00:00:48.660
to accurately replicate the curvature

19
00:00:48.660 --> 00:00:50.233
of the true relationship.

20
00:00:51.180 --> 00:00:53.550
So the inability of the machine learning method

21
00:00:53.550 --> 00:00:55.650
to capture the true relationship

22
00:00:55.650 --> 00:00:57.763
is what we consider to be bias.

23
00:00:59.370 --> 00:01:00.410
But in contrast,

24
00:01:00.410 --> 00:01:03.460
we could use a squiggly line that's really flexible

25
00:01:03.460 --> 00:01:06.780
and able to adapt to the curves of the training data,

26
00:01:06.780 --> 00:01:09.363
essentially eliminating the bias altogether.

27
00:01:10.280 --> 00:01:12.480
But if we were to add testing data,

28
00:01:12.480 --> 00:01:14.300
the squiggly line suddenly appears

29
00:01:14.300 --> 00:01:17.440
to really struggle with making accurate predictions,

30
00:01:17.440 --> 00:01:18.963
even with no bias.

31
00:01:20.300 --> 00:01:22.640
Now, from that we'd consider the squiggly line

32
00:01:22.640 --> 00:01:24.610
to have high variability

33
00:01:24.610 --> 00:01:26.880
because the training and testing data

34
00:01:26.880 --> 00:01:29.473
have drastically different sums of squares.

35
00:01:30.510 --> 00:01:31.740
What that means for the model

36
00:01:31.740 --> 00:01:33.530
is that it makes it harder to predict

37
00:01:33.530 --> 00:01:37.020
how well the squiggly line will actually perform.

38
00:01:37.020 --> 00:01:39.830
Sometimes it might make a perfect prediction.

39
00:01:39.830 --> 00:01:42.683
Then other times it just might be really far off.

40
00:01:43.760 --> 00:01:44.960
On the other hand,

41
00:01:44.960 --> 00:01:48.590
we know the straight line has significantly higher bias

42
00:01:48.590 --> 00:01:51.140
because it's not able to capture the true curvature

43
00:01:51.140 --> 00:01:52.223
of the training data.

44
00:01:53.380 --> 00:01:54.700
But using this visual,

45
00:01:54.700 --> 00:01:58.340
we can see it has a low variance because the sums of squares

46
00:01:58.340 --> 00:02:02.163
for both the training and testing set are fairly similar.

47
00:02:03.030 --> 00:02:04.700
So this model might only give us

48
00:02:04.700 --> 00:02:06.990
good predictions, not great.

49
00:02:06.990 --> 00:02:08.600
But at the very least,

50
00:02:08.600 --> 00:02:11.543
we know there'll be consistently good predictions.

51
00:02:13.040 --> 00:02:14.270
It's not always true,

52
00:02:14.270 --> 00:02:17.060
but typically algorithms high-end bias

53
00:02:17.060 --> 00:02:19.610
are able to learn a little bit faster.

54
00:02:19.610 --> 00:02:21.600
So if your primary goal is speed

55
00:02:21.600 --> 00:02:25.120
and you're not too concerned with really accurate results

56
00:02:25.120 --> 00:02:28.543
in algorithm high and bias is probably okay to use.

57
00:02:29.410 --> 00:02:31.480
But if you're modeling a complex function

58
00:02:31.480 --> 00:02:33.620
and you need pinpoint results

59
00:02:33.620 --> 00:02:35.370
in algorithm lower and bias

60
00:02:35.370 --> 00:02:37.193
is probably the better option.

61
00:02:38.420 --> 00:02:39.500
At this point in the course,

62
00:02:39.500 --> 00:02:41.870
we've already covered the two most common examples

63
00:02:41.870 --> 00:02:43.870
of high bias algorithms,

64
00:02:43.870 --> 00:02:47.500
which are linear regression and logistic regression.

65
00:02:47.500 --> 00:02:48.690
And we've also covered

66
00:02:48.690 --> 00:02:51.550
the two most common low bias algorithms as well,

67
00:02:51.550 --> 00:02:54.263
which are decision trees and K-nearest neighbors.

68
00:02:55.200 --> 00:02:56.820
I should probably also mention

69
00:02:56.820 --> 00:02:58.350
that it's pretty much impossible

70
00:02:58.350 --> 00:03:01.730
to avoid some level of variance in an algorithm,

71
00:03:01.730 --> 00:03:04.640
but ideally speaking there shouldn't be too much change

72
00:03:04.640 --> 00:03:06.793
from one training set to the next.

73
00:03:08.000 --> 00:03:11.500
Unfortunately that's not always gonna be the case

74
00:03:11.500 --> 00:03:13.230
because some algorithms happen

75
00:03:13.230 --> 00:03:16.963
to be more strongly influenced by specifics and data sets.

76
00:03:18.500 --> 00:03:21.500
Usually nonlinear algorithms like the decision tree

77
00:03:21.500 --> 00:03:24.880
or K nearest neighbors have a lot of flexibility

78
00:03:24.880 --> 00:03:27.290
resulting in higher variance.

79
00:03:27.290 --> 00:03:30.220
Then the more rigid algorithms like linear regression

80
00:03:30.220 --> 00:03:33.233
and logistic regression tend to have lower variance,

81
00:03:34.400 --> 00:03:35.540
even though it's impossible

82
00:03:35.540 --> 00:03:38.020
to avoid bias and variance altogether,

83
00:03:38.020 --> 00:03:41.130
every good supervised machine learning algorithm works

84
00:03:41.130 --> 00:03:43.680
towards reducing the two as much as it can

85
00:03:43.680 --> 00:03:45.983
to maximize the predictive performance.

86
00:03:47.010 --> 00:03:48.340
And for whatever reason,

87
00:03:48.340 --> 00:03:51.110
I've always looked at the bias variance trade off

88
00:03:51.110 --> 00:03:53.963
kind of like an optimization problem for a business.

89
00:03:54.900 --> 00:03:56.420
And really what I mean by that

90
00:03:56.420 --> 00:03:57.890
is let's pretend for a minute

91
00:03:57.890 --> 00:04:00.090
that we're the owners of a coffee shop

92
00:04:00.090 --> 00:04:04.000
and we decide we wanna increase our overall income.

93
00:04:04.000 --> 00:04:05.960
To do that, we decide that we wanna lower

94
00:04:05.960 --> 00:04:08.010
our operating expenses.

95
00:04:08.010 --> 00:04:11.240
So the first thing we do is cut down on labor.

96
00:04:11.240 --> 00:04:13.050
Let's say a little bit of time goes by

97
00:04:13.050 --> 00:04:15.740
and the plan seems to be working really well.

98
00:04:15.740 --> 00:04:18.030
So now we decide to reduce our overhead

99
00:04:18.030 --> 00:04:20.230
by ordering less product.

100
00:04:20.230 --> 00:04:22.930
Initially we'll say the plan works,

101
00:04:22.930 --> 00:04:24.410
but after a few more months,

102
00:04:24.410 --> 00:04:25.880
we noticed that we keep running out

103
00:04:25.880 --> 00:04:28.010
of all of our essential goods.

104
00:04:28.010 --> 00:04:29.120
And not only that,

105
00:04:29.120 --> 00:04:31.040
but now our staff hates us

106
00:04:31.040 --> 00:04:32.420
and we're losing customers

107
00:04:32.420 --> 00:04:34.033
because we're so understaffed.

108
00:04:35.210 --> 00:04:38.600
So to offset that we decided to add a little bit more labor

109
00:04:38.600 --> 00:04:42.010
and increase our ordering just a little bit more.

110
00:04:42.010 --> 00:04:44.750
And we finally ended up finding our sweet spot,

111
00:04:44.750 --> 00:04:46.590
which maximizes our margins

112
00:04:46.590 --> 00:04:49.673
all while keeping the staff happy and customers happy.

113
00:04:50.980 --> 00:04:51.860
And that to me

114
00:04:51.860 --> 00:04:55.080
is how I like to look at the bias variance trade off.

115
00:04:55.080 --> 00:04:57.400
It's gonna be impossible to change one

116
00:04:57.400 --> 00:05:00.690
without effecting the other one, at least a little bit.

117
00:05:00.690 --> 00:05:02.970
And unfortunately, if you lose sight of that,

118
00:05:02.970 --> 00:05:05.570
you're going to end up with a really horrible model.

119
00:05:06.620 --> 00:05:08.730
Now with all of that being said,

120
00:05:08.730 --> 00:05:09.910
the last thing that I'll mention

121
00:05:09.910 --> 00:05:11.480
before I wrap the guide up

122
00:05:11.480 --> 00:05:13.850
is that unfortunately there's no real way

123
00:05:13.850 --> 00:05:16.770
of calculating the bias variance trade off,

124
00:05:16.770 --> 00:05:19.070
but as a framework bias and variance,

125
00:05:19.070 --> 00:05:21.020
give us a few more tools that we can use

126
00:05:21.020 --> 00:05:24.330
to better understand the behavior of an algorithm,

127
00:05:24.330 --> 00:05:26.020
which hopefully results in a model

128
00:05:26.020 --> 00:05:28.210
that's capable of making really strong

129
00:05:28.210 --> 00:05:30.610
and accurate predictions.

130
00:05:30.610 --> 00:05:32.910
And for now, I will wrap this up

131
00:05:32.910 --> 00:05:34.813
and I will see you in the next guide.