Introduction to Logistic Regression

WEBVTT

1
00:00:03.470 --> 00:00:04.790
In this guide,

2
00:00:04.790 --> 00:00:07.750
we'll be covering two really important topics.

3
00:00:07.750 --> 00:00:08.583
The first

4
00:00:08.583 --> 00:00:11.960
is another extremely popular classification algorithm

5
00:00:11.960 --> 00:00:14.280
called logistic regression.

6
00:00:14.280 --> 00:00:16.960
And the second is a group of algorithms

7
00:00:16.960 --> 00:00:19.340
that logistic regression belongs to,

8
00:00:19.340 --> 00:00:22.423
which we refer to as generalized linear models.

9
00:00:24.430 --> 00:00:26.190
Before we get into the material,

10
00:00:26.190 --> 00:00:27.330
you should know that this guide

11
00:00:27.330 --> 00:00:30.530
is gonna be a little more technical than the others.

12
00:00:30.530 --> 00:00:33.360
And I know I've said it a few times already,

13
00:00:33.360 --> 00:00:35.830
but it's definitely worth repeating.

14
00:00:35.830 --> 00:00:39.220
By no means do I expect you to master any of these concepts

15
00:00:39.220 --> 00:00:40.910
by the end of the course.

16
00:00:40.910 --> 00:00:43.210
And if I'm being completely honest,

17
00:00:43.210 --> 00:00:44.330
most of the time,

18
00:00:44.330 --> 00:00:45.510
you won't even need to know

19
00:00:45.510 --> 00:00:47.660
how to do the majority of the math

20
00:00:47.660 --> 00:00:49.700
because you'll have tools like scikit-learn

21
00:00:49.700 --> 00:00:50.803
to do it for you.

22
00:00:51.930 --> 00:00:55.350
However, I do think it's incredibly important

23
00:00:55.350 --> 00:00:56.730
that you have a really solid

24
00:00:56.730 --> 00:00:58.963
conceptual understanding of the material.

25
00:01:00.360 --> 00:01:01.630
With that being said,

26
00:01:01.630 --> 00:01:04.003
let's jump right into logistic regression.

27
00:01:05.130 --> 00:01:07.610
In the intro, I mentioned logistic regression

28
00:01:07.610 --> 00:01:11.210
is another really popular algorithm for machine learning.

29
00:01:11.210 --> 00:01:13.270
And I think it's because of its ability

30
00:01:13.270 --> 00:01:16.253
to solve questions that have two discrete outcomes.

31
00:01:17.810 --> 00:01:19.650
Thinking back, we could have used it

32
00:01:19.650 --> 00:01:22.173
on pretty much everything we've worked on so far.

33
00:01:23.450 --> 00:01:26.660
You can use it to decide if an email is spam or ham,

34
00:01:26.660 --> 00:01:28.653
if an image is a dog or a cat,

35
00:01:29.490 --> 00:01:32.770
if a student will pass or fail an exam,

36
00:01:32.770 --> 00:01:35.523
or to find out if a tumor is benign or malignant.

37
00:01:37.740 --> 00:01:40.840
But to get into some of the more technical stuff,

38
00:01:40.840 --> 00:01:45.130
at its core, logistic regression is a discriminative model,

39
00:01:45.130 --> 00:01:46.930
meaning it doesn't care so much

40
00:01:46.930 --> 00:01:48.900
about how the data was generated,

41
00:01:48.900 --> 00:01:52.453
it really just wants to categorize any new observation.

42
00:01:53.300 --> 00:01:55.710
And to build on that a little bit more,

43
00:01:55.710 --> 00:01:58.950
we also consider it to be a generalized linear model

44
00:01:58.950 --> 00:02:02.563
that uses a logistic function to model dependent variables.

45
00:02:03.800 --> 00:02:05.510
To help you understand what it means

46
00:02:05.510 --> 00:02:08.480
to classify objects using a logistic function,

47
00:02:08.480 --> 00:02:10.323
we can take a look at this visual.

48
00:02:11.180 --> 00:02:13.370
So every logistic regression model

49
00:02:13.370 --> 00:02:15.920
is gonna look something like this.

50
00:02:15.920 --> 00:02:19.440
You'll probably find some variation in the overall shape,

51
00:02:19.440 --> 00:02:21.410
but every logistic regression model

52
00:02:21.410 --> 00:02:24.720
will have the same basic sigmoid or S curve,

53
00:02:24.720 --> 00:02:26.963
which represents the actual function.

54
00:02:28.000 --> 00:02:30.860
You'll also see two horizontal asymptotes

55
00:02:30.860 --> 00:02:33.040
with the upper boundary set at one

56
00:02:33.040 --> 00:02:35.183
and the lower boundary at zero.

57
00:02:36.250 --> 00:02:40.950
The dashed line at .5 is gonna be another important marker,

58
00:02:40.950 --> 00:02:42.793
but we'll talk about that later on.

59
00:02:45.130 --> 00:02:46.790
Now, really quick,

60
00:02:46.790 --> 00:02:49.780
I'm gonna need to go back in time for just a second,

61
00:02:49.780 --> 00:02:51.720
because when we were going through it,

62
00:02:51.720 --> 00:02:54.250
I purposefully didn't bring this up,

63
00:02:54.250 --> 00:02:55.970
but simple linear regression

64
00:02:55.970 --> 00:02:58.660
was actually the first generalized linear model

65
00:02:58.660 --> 00:03:00.230
that we worked with.

66
00:03:00.230 --> 00:03:02.230
But it's such a straightforward concept

67
00:03:02.230 --> 00:03:05.363
that I hate making it more complicated than it needs to be.

68
00:03:06.730 --> 00:03:10.200
But anyway, when it comes to a generalized linear model,

69
00:03:10.200 --> 00:03:12.000
the most important thing to know

70
00:03:12.000 --> 00:03:15.840
is that they're all made up of three main components:

71
00:03:15.840 --> 00:03:17.860
probability distribution,

72
00:03:17.860 --> 00:03:20.573
a linear predictor, and a link function.

73
00:03:21.710 --> 00:03:24.800
So usually when I'm explaining logistic regression,

74
00:03:24.800 --> 00:03:27.683
I like to break it up into those three categories.

75
00:03:29.340 --> 00:03:31.700
Starting with probability distribution,

76
00:03:31.700 --> 00:03:33.900
let's think back to linear regression

77
00:03:33.900 --> 00:03:36.070
where one of the basic assumptions

78
00:03:36.070 --> 00:03:37.500
was that the data had to have

79
00:03:37.500 --> 00:03:39.663
normal or Gaussian distribution.

80
00:03:40.710 --> 00:03:44.660
In logistic regression, that type of distribution won't work

81
00:03:44.660 --> 00:03:47.420
because we're no longer modeling the continuous function

82
00:03:47.420 --> 00:03:50.620
with essentially an infinite number of outcomes.

83
00:03:50.620 --> 00:03:53.580
Instead, we're trying to create a model

84
00:03:53.580 --> 00:03:56.873
that makes predictions based on two discrete outcomes.

85
00:03:57.990 --> 00:03:59.990
So to remedy that problem,

86
00:03:59.990 --> 00:04:03.320
logistic regression and other binary classifiers

87
00:04:03.320 --> 00:04:05.830
use something called Bernoulli distribution,

88
00:04:05.830 --> 00:04:08.000
which helps determine the probability

89
00:04:08.000 --> 00:04:09.363
of a preferred outcome.

90
00:04:10.320 --> 00:04:12.490
When it comes to choosing their preferred outcome,

91
00:04:12.490 --> 00:04:14.240
it's fairly arbitrary

92
00:04:14.240 --> 00:04:16.640
and by no means is what I'm about to tell you

93
00:04:16.640 --> 00:04:18.860
a hard and fast rule.

94
00:04:18.860 --> 00:04:21.220
But generally in machine learning,

95
00:04:21.220 --> 00:04:23.810
the preferred or successful outcome

96
00:04:23.810 --> 00:04:26.410
is associated with whatever outcome

97
00:04:26.410 --> 00:04:29.113
had the higher probability in the training data.

98
00:04:30.900 --> 00:04:33.110
After the preferred outcome is chosen,

99
00:04:33.110 --> 00:04:35.640
it's assigned the value of one,

100
00:04:35.640 --> 00:04:39.463
and the non-preferred or failure outcome is given zero.

101
00:04:42.270 --> 00:04:45.130
The second component of the generalized linear model

102
00:04:45.130 --> 00:04:46.643
is the linear predictor.

103
00:04:47.700 --> 00:04:49.220
And for logistic regression,

104
00:04:49.220 --> 00:04:51.950
it's pretty much the same as the predictor that we used

105
00:04:51.950 --> 00:04:53.940
in multiple linear regression,

106
00:04:53.940 --> 00:04:56.963
which if you remember follows this equation.

107
00:04:58.370 --> 00:05:01.120
In case the specifics are a little fuzzy,

108
00:05:01.120 --> 00:05:03.220
the leading coefficient beta

109
00:05:03.220 --> 00:05:06.610
is a parameter that acts like a weight for each feature

110
00:05:06.610 --> 00:05:10.090
and whichever features have the largest coefficient,

111
00:05:10.090 --> 00:05:13.910
typically have the strongest relationship with the fit line.

112
00:05:13.910 --> 00:05:16.440
Now, there is a fairly significant difference

113
00:05:16.440 --> 00:05:18.120
in logistic regression

114
00:05:18.120 --> 00:05:20.760
because the goal is no longer value prediction,

115
00:05:20.760 --> 00:05:22.083
it's class prediction.

116
00:05:23.730 --> 00:05:27.470
And just by looking at the two models, it's fairly apparent

117
00:05:27.470 --> 00:05:28.370
there's no way

118
00:05:28.370 --> 00:05:31.340
we can manipulate the parameters of the continuous function

119
00:05:31.340 --> 00:05:33.460
to get a good fit line.

120
00:05:33.460 --> 00:05:35.530
So instead what we'll need to do

121
00:05:35.530 --> 00:05:38.980
is transform the continuous function in such a way

122
00:05:38.980 --> 00:05:40.220
that it's able to relate

123
00:05:40.220 --> 00:05:42.383
to the discrete probability function.

124
00:05:44.750 --> 00:05:47.690
And that leads us into the third and final component

125
00:05:47.690 --> 00:05:49.770
of the generalized linear model,

126
00:05:49.770 --> 00:05:51.343
which is the link function.

127
00:05:53.140 --> 00:05:55.680
Now, as I just mentioned,

128
00:05:55.680 --> 00:05:57.870
link functions are what give us the ability

129
00:05:57.870 --> 00:06:01.940
to relate the distribution function to the linear predictor.

130
00:06:01.940 --> 00:06:04.840
And the link function that we use in logistic regression

131
00:06:04.840 --> 00:06:08.460
is called the logit or log-odds function.

132
00:06:08.460 --> 00:06:11.773
And that's used to calculate the logarithm of the odds.

133
00:06:12.860 --> 00:06:16.240
And really quick, if you're not familiar with the term odds,

134
00:06:16.240 --> 00:06:19.360
it's technically not a straight up probability.

135
00:06:19.360 --> 00:06:22.603
It's really referring to a ratio of probabilities.

136
00:06:23.630 --> 00:06:26.240
As an example, you know that when we flip a coin,

137
00:06:26.240 --> 00:06:29.530
the probability of it landing and on heads is 1/2,

138
00:06:29.530 --> 00:06:32.693
and the probability of it landing on tails is also 1/2.

139
00:06:33.900 --> 00:06:37.903
Well, that equates to a 1:1 ratio of heads to tails.

140
00:06:39.200 --> 00:06:41.610
When we apply that to logistic regression,

141
00:06:41.610 --> 00:06:43.400
the odds that need to be calculated

142
00:06:43.400 --> 00:06:46.360
relate back to Bernoulli distribution

143
00:06:46.360 --> 00:06:48.150
where the probability of success

144
00:06:48.150 --> 00:06:50.693
is divided by the probability of failure.

145
00:06:51.920 --> 00:06:55.610
Then when we take the logarithm or natural log of the odds,

146
00:06:55.610 --> 00:06:58.313
it gives us the equation of the logit function.

147
00:06:59.810 --> 00:07:02.330
That just leaves us with how the link function

148
00:07:02.330 --> 00:07:04.650
relates to the predictor function,

149
00:07:04.650 --> 00:07:07.400
which is something we've talked about in a prior guide.

150
00:07:09.010 --> 00:07:11.360
I'm not exactly sure which one it was,

151
00:07:11.360 --> 00:07:13.170
but it was in one of the regression guides

152
00:07:13.170 --> 00:07:14.998
where I showed you how to transform

153
00:07:14.998 --> 00:07:17.630
quadratics into a linear model.

154
00:07:17.630 --> 00:07:19.250
Basically, all we had to do

155
00:07:19.250 --> 00:07:20.560
was take the natural log

156
00:07:20.560 --> 00:07:23.510
of both the independent and dependent variables.

157
00:07:23.510 --> 00:07:24.513
And that was it.

158
00:07:25.940 --> 00:07:28.410
Then if we wanted to transform the linear function

159
00:07:28.410 --> 00:07:30.070
back to the quadratic,

160
00:07:30.070 --> 00:07:32.470
all we had to do was use Euler's constant

161
00:07:32.470 --> 00:07:34.963
which is the inverse function to natural log.

162
00:07:37.230 --> 00:07:39.240
Using that same principle,

163
00:07:39.240 --> 00:07:41.780
the logarithm in the logit function

164
00:07:41.780 --> 00:07:44.550
transforms the original prediction function

165
00:07:44.550 --> 00:07:45.900
into a function that's able

166
00:07:45.900 --> 00:07:47.963
to relate to the distribution function.

167
00:07:49.500 --> 00:07:52.220
Now, if we assume there's a linear relationship

168
00:07:52.220 --> 00:07:55.430
between the logit function and the linear predictor,

169
00:07:55.430 --> 00:07:57.793
it allows us to postulate this equation.

170
00:07:59.260 --> 00:08:02.560
And if we apply the inverse function to both sides,

171
00:08:02.560 --> 00:08:05.000
it results in this equation,

172
00:08:05.000 --> 00:08:07.990
which can then be used to link predicted values

173
00:08:07.990 --> 00:08:10.713
to a linear combination of input variables.

174
00:08:12.540 --> 00:08:13.700
Before we finish up,

175
00:08:13.700 --> 00:08:15.300
there's really just one last thing

176
00:08:15.300 --> 00:08:16.790
that we need to talk about,

177
00:08:16.790 --> 00:08:18.543
and that's the decision boundary.

178
00:08:19.510 --> 00:08:20.790
I kind of alluded to this

179
00:08:20.790 --> 00:08:22.840
at the very beginning of the guide,

180
00:08:22.840 --> 00:08:25.640
but the decision boundary in logistic regression

181
00:08:25.640 --> 00:08:28.093
is always set at the 1/2 marker.

182
00:08:29.040 --> 00:08:31.160
And that's where you can kind of think of it

183
00:08:31.160 --> 00:08:33.370
as an imaginary divide

184
00:08:33.370 --> 00:08:37.033
where the class decision is really just a 50-50 guess.

185
00:08:39.080 --> 00:08:40.960
And I think that about covers

186
00:08:40.960 --> 00:08:43.490
everything that we needed to go through.

187
00:08:43.490 --> 00:08:46.260
And hopefully most of it made sense

188
00:08:46.260 --> 00:08:47.890
because in the next guide,

189
00:08:47.890 --> 00:08:50.830
we'll be moving forward by using scikit-learn

190
00:08:50.830 --> 00:08:53.480
to build a logistic regression model.

191
00:08:53.480 --> 00:08:56.590
But for now, I will wrap things up

192
00:08:56.590 --> 00:08:58.440
and I will see you in the next guide.