- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.470 --> 00:00:04.790
In this guide,
2
00:00:04.790 --> 00:00:07.750
we'll be covering two really important topics.
3
00:00:07.750 --> 00:00:08.583
The first
4
00:00:08.583 --> 00:00:11.960
is another extremely popular classification algorithm
5
00:00:11.960 --> 00:00:14.280
called logistic regression.
6
00:00:14.280 --> 00:00:16.960
And the second is a group of algorithms
7
00:00:16.960 --> 00:00:19.340
that logistic regression belongs to,
8
00:00:19.340 --> 00:00:22.423
which we refer to as generalized linear models.
9
00:00:24.430 --> 00:00:26.190
Before we get into the material,
10
00:00:26.190 --> 00:00:27.330
you should know that this guide
11
00:00:27.330 --> 00:00:30.530
is gonna be a little more technical than the others.
12
00:00:30.530 --> 00:00:33.360
And I know I've said it a few times already,
13
00:00:33.360 --> 00:00:35.830
but it's definitely worth repeating.
14
00:00:35.830 --> 00:00:39.220
By no means do I expect you to master any of these concepts
15
00:00:39.220 --> 00:00:40.910
by the end of the course.
16
00:00:40.910 --> 00:00:43.210
And if I'm being completely honest,
17
00:00:43.210 --> 00:00:44.330
most of the time,
18
00:00:44.330 --> 00:00:45.510
you won't even need to know
19
00:00:45.510 --> 00:00:47.660
how to do the majority of the math
20
00:00:47.660 --> 00:00:49.700
because you'll have tools like scikit-learn
21
00:00:49.700 --> 00:00:50.803
to do it for you.
22
00:00:51.930 --> 00:00:55.350
However, I do think it's incredibly important
23
00:00:55.350 --> 00:00:56.730
that you have a really solid
24
00:00:56.730 --> 00:00:58.963
conceptual understanding of the material.
25
00:01:00.360 --> 00:01:01.630
With that being said,
26
00:01:01.630 --> 00:01:04.003
let's jump right into logistic regression.
27
00:01:05.130 --> 00:01:07.610
In the intro, I mentioned logistic regression
28
00:01:07.610 --> 00:01:11.210
is another really popular algorithm for machine learning.
29
00:01:11.210 --> 00:01:13.270
And I think it's because of its ability
30
00:01:13.270 --> 00:01:16.253
to solve questions that have two discrete outcomes.
31
00:01:17.810 --> 00:01:19.650
Thinking back, we could have used it
32
00:01:19.650 --> 00:01:22.173
on pretty much everything we've worked on so far.
33
00:01:23.450 --> 00:01:26.660
You can use it to decide if an email is spam or ham,
34
00:01:26.660 --> 00:01:28.653
if an image is a dog or a cat,
35
00:01:29.490 --> 00:01:32.770
if a student will pass or fail an exam,
36
00:01:32.770 --> 00:01:35.523
or to find out if a tumor is benign or malignant.
37
00:01:37.740 --> 00:01:40.840
But to get into some of the more technical stuff,
38
00:01:40.840 --> 00:01:45.130
at its core, logistic regression is a discriminative model,
39
00:01:45.130 --> 00:01:46.930
meaning it doesn't care so much
40
00:01:46.930 --> 00:01:48.900
about how the data was generated,
41
00:01:48.900 --> 00:01:52.453
it really just wants to categorize any new observation.
42
00:01:53.300 --> 00:01:55.710
And to build on that a little bit more,
43
00:01:55.710 --> 00:01:58.950
we also consider it to be a generalized linear model
44
00:01:58.950 --> 00:02:02.563
that uses a logistic function to model dependent variables.
45
00:02:03.800 --> 00:02:05.510
To help you understand what it means
46
00:02:05.510 --> 00:02:08.480
to classify objects using a logistic function,
47
00:02:08.480 --> 00:02:10.323
we can take a look at this visual.
48
00:02:11.180 --> 00:02:13.370
So every logistic regression model
49
00:02:13.370 --> 00:02:15.920
is gonna look something like this.
50
00:02:15.920 --> 00:02:19.440
You'll probably find some variation in the overall shape,
51
00:02:19.440 --> 00:02:21.410
but every logistic regression model
52
00:02:21.410 --> 00:02:24.720
will have the same basic sigmoid or S curve,
53
00:02:24.720 --> 00:02:26.963
which represents the actual function.
54
00:02:28.000 --> 00:02:30.860
You'll also see two horizontal asymptotes
55
00:02:30.860 --> 00:02:33.040
with the upper boundary set at one
56
00:02:33.040 --> 00:02:35.183
and the lower boundary at zero.
57
00:02:36.250 --> 00:02:40.950
The dashed line at .5 is gonna be another important marker,
58
00:02:40.950 --> 00:02:42.793
but we'll talk about that later on.
59
00:02:45.130 --> 00:02:46.790
Now, really quick,
60
00:02:46.790 --> 00:02:49.780
I'm gonna need to go back in time for just a second,
61
00:02:49.780 --> 00:02:51.720
because when we were going through it,
62
00:02:51.720 --> 00:02:54.250
I purposefully didn't bring this up,
63
00:02:54.250 --> 00:02:55.970
but simple linear regression
64
00:02:55.970 --> 00:02:58.660
was actually the first generalized linear model
65
00:02:58.660 --> 00:03:00.230
that we worked with.
66
00:03:00.230 --> 00:03:02.230
But it's such a straightforward concept
67
00:03:02.230 --> 00:03:05.363
that I hate making it more complicated than it needs to be.
68
00:03:06.730 --> 00:03:10.200
But anyway, when it comes to a generalized linear model,
69
00:03:10.200 --> 00:03:12.000
the most important thing to know
70
00:03:12.000 --> 00:03:15.840
is that they're all made up of three main components:
71
00:03:15.840 --> 00:03:17.860
probability distribution,
72
00:03:17.860 --> 00:03:20.573
a linear predictor, and a link function.
73
00:03:21.710 --> 00:03:24.800
So usually when I'm explaining logistic regression,
74
00:03:24.800 --> 00:03:27.683
I like to break it up into those three categories.
75
00:03:29.340 --> 00:03:31.700
Starting with probability distribution,
76
00:03:31.700 --> 00:03:33.900
let's think back to linear regression
77
00:03:33.900 --> 00:03:36.070
where one of the basic assumptions
78
00:03:36.070 --> 00:03:37.500
was that the data had to have
79
00:03:37.500 --> 00:03:39.663
normal or Gaussian distribution.
80
00:03:40.710 --> 00:03:44.660
In logistic regression, that type of distribution won't work
81
00:03:44.660 --> 00:03:47.420
because we're no longer modeling the continuous function
82
00:03:47.420 --> 00:03:50.620
with essentially an infinite number of outcomes.
83
00:03:50.620 --> 00:03:53.580
Instead, we're trying to create a model
84
00:03:53.580 --> 00:03:56.873
that makes predictions based on two discrete outcomes.
85
00:03:57.990 --> 00:03:59.990
So to remedy that problem,
86
00:03:59.990 --> 00:04:03.320
logistic regression and other binary classifiers
87
00:04:03.320 --> 00:04:05.830
use something called Bernoulli distribution,
88
00:04:05.830 --> 00:04:08.000
which helps determine the probability
89
00:04:08.000 --> 00:04:09.363
of a preferred outcome.
90
00:04:10.320 --> 00:04:12.490
When it comes to choosing their preferred outcome,
91
00:04:12.490 --> 00:04:14.240
it's fairly arbitrary
92
00:04:14.240 --> 00:04:16.640
and by no means is what I'm about to tell you
93
00:04:16.640 --> 00:04:18.860
a hard and fast rule.
94
00:04:18.860 --> 00:04:21.220
But generally in machine learning,
95
00:04:21.220 --> 00:04:23.810
the preferred or successful outcome
96
00:04:23.810 --> 00:04:26.410
is associated with whatever outcome
97
00:04:26.410 --> 00:04:29.113
had the higher probability in the training data.
98
00:04:30.900 --> 00:04:33.110
After the preferred outcome is chosen,
99
00:04:33.110 --> 00:04:35.640
it's assigned the value of one,
100
00:04:35.640 --> 00:04:39.463
and the non-preferred or failure outcome is given zero.
101
00:04:42.270 --> 00:04:45.130
The second component of the generalized linear model
102
00:04:45.130 --> 00:04:46.643
is the linear predictor.
103
00:04:47.700 --> 00:04:49.220
And for logistic regression,
104
00:04:49.220 --> 00:04:51.950
it's pretty much the same as the predictor that we used
105
00:04:51.950 --> 00:04:53.940
in multiple linear regression,
106
00:04:53.940 --> 00:04:56.963
which if you remember follows this equation.
107
00:04:58.370 --> 00:05:01.120
In case the specifics are a little fuzzy,
108
00:05:01.120 --> 00:05:03.220
the leading coefficient beta
109
00:05:03.220 --> 00:05:06.610
is a parameter that acts like a weight for each feature
110
00:05:06.610 --> 00:05:10.090
and whichever features have the largest coefficient,
111
00:05:10.090 --> 00:05:13.910
typically have the strongest relationship with the fit line.
112
00:05:13.910 --> 00:05:16.440
Now, there is a fairly significant difference
113
00:05:16.440 --> 00:05:18.120
in logistic regression
114
00:05:18.120 --> 00:05:20.760
because the goal is no longer value prediction,
115
00:05:20.760 --> 00:05:22.083
it's class prediction.
116
00:05:23.730 --> 00:05:27.470
And just by looking at the two models, it's fairly apparent
117
00:05:27.470 --> 00:05:28.370
there's no way
118
00:05:28.370 --> 00:05:31.340
we can manipulate the parameters of the continuous function
119
00:05:31.340 --> 00:05:33.460
to get a good fit line.
120
00:05:33.460 --> 00:05:35.530
So instead what we'll need to do
121
00:05:35.530 --> 00:05:38.980
is transform the continuous function in such a way
122
00:05:38.980 --> 00:05:40.220
that it's able to relate
123
00:05:40.220 --> 00:05:42.383
to the discrete probability function.
124
00:05:44.750 --> 00:05:47.690
And that leads us into the third and final component
125
00:05:47.690 --> 00:05:49.770
of the generalized linear model,
126
00:05:49.770 --> 00:05:51.343
which is the link function.
127
00:05:53.140 --> 00:05:55.680
Now, as I just mentioned,
128
00:05:55.680 --> 00:05:57.870
link functions are what give us the ability
129
00:05:57.870 --> 00:06:01.940
to relate the distribution function to the linear predictor.
130
00:06:01.940 --> 00:06:04.840
And the link function that we use in logistic regression
131
00:06:04.840 --> 00:06:08.460
is called the logit or log-odds function.
132
00:06:08.460 --> 00:06:11.773
And that's used to calculate the logarithm of the odds.
133
00:06:12.860 --> 00:06:16.240
And really quick, if you're not familiar with the term odds,
134
00:06:16.240 --> 00:06:19.360
it's technically not a straight up probability.
135
00:06:19.360 --> 00:06:22.603
It's really referring to a ratio of probabilities.
136
00:06:23.630 --> 00:06:26.240
As an example, you know that when we flip a coin,
137
00:06:26.240 --> 00:06:29.530
the probability of it landing and on heads is 1/2,
138
00:06:29.530 --> 00:06:32.693
and the probability of it landing on tails is also 1/2.
139
00:06:33.900 --> 00:06:37.903
Well, that equates to a 1:1 ratio of heads to tails.
140
00:06:39.200 --> 00:06:41.610
When we apply that to logistic regression,
141
00:06:41.610 --> 00:06:43.400
the odds that need to be calculated
142
00:06:43.400 --> 00:06:46.360
relate back to Bernoulli distribution
143
00:06:46.360 --> 00:06:48.150
where the probability of success
144
00:06:48.150 --> 00:06:50.693
is divided by the probability of failure.
145
00:06:51.920 --> 00:06:55.610
Then when we take the logarithm or natural log of the odds,
146
00:06:55.610 --> 00:06:58.313
it gives us the equation of the logit function.
147
00:06:59.810 --> 00:07:02.330
That just leaves us with how the link function
148
00:07:02.330 --> 00:07:04.650
relates to the predictor function,
149
00:07:04.650 --> 00:07:07.400
which is something we've talked about in a prior guide.
150
00:07:09.010 --> 00:07:11.360
I'm not exactly sure which one it was,
151
00:07:11.360 --> 00:07:13.170
but it was in one of the regression guides
152
00:07:13.170 --> 00:07:14.998
where I showed you how to transform
153
00:07:14.998 --> 00:07:17.630
quadratics into a linear model.
154
00:07:17.630 --> 00:07:19.250
Basically, all we had to do
155
00:07:19.250 --> 00:07:20.560
was take the natural log
156
00:07:20.560 --> 00:07:23.510
of both the independent and dependent variables.
157
00:07:23.510 --> 00:07:24.513
And that was it.
158
00:07:25.940 --> 00:07:28.410
Then if we wanted to transform the linear function
159
00:07:28.410 --> 00:07:30.070
back to the quadratic,
160
00:07:30.070 --> 00:07:32.470
all we had to do was use Euler's constant
161
00:07:32.470 --> 00:07:34.963
which is the inverse function to natural log.
162
00:07:37.230 --> 00:07:39.240
Using that same principle,
163
00:07:39.240 --> 00:07:41.780
the logarithm in the logit function
164
00:07:41.780 --> 00:07:44.550
transforms the original prediction function
165
00:07:44.550 --> 00:07:45.900
into a function that's able
166
00:07:45.900 --> 00:07:47.963
to relate to the distribution function.
167
00:07:49.500 --> 00:07:52.220
Now, if we assume there's a linear relationship
168
00:07:52.220 --> 00:07:55.430
between the logit function and the linear predictor,
169
00:07:55.430 --> 00:07:57.793
it allows us to postulate this equation.
170
00:07:59.260 --> 00:08:02.560
And if we apply the inverse function to both sides,
171
00:08:02.560 --> 00:08:05.000
it results in this equation,
172
00:08:05.000 --> 00:08:07.990
which can then be used to link predicted values
173
00:08:07.990 --> 00:08:10.713
to a linear combination of input variables.
174
00:08:12.540 --> 00:08:13.700
Before we finish up,
175
00:08:13.700 --> 00:08:15.300
there's really just one last thing
176
00:08:15.300 --> 00:08:16.790
that we need to talk about,
177
00:08:16.790 --> 00:08:18.543
and that's the decision boundary.
178
00:08:19.510 --> 00:08:20.790
I kind of alluded to this
179
00:08:20.790 --> 00:08:22.840
at the very beginning of the guide,
180
00:08:22.840 --> 00:08:25.640
but the decision boundary in logistic regression
181
00:08:25.640 --> 00:08:28.093
is always set at the 1/2 marker.
182
00:08:29.040 --> 00:08:31.160
And that's where you can kind of think of it
183
00:08:31.160 --> 00:08:33.370
as an imaginary divide
184
00:08:33.370 --> 00:08:37.033
where the class decision is really just a 50-50 guess.
185
00:08:39.080 --> 00:08:40.960
And I think that about covers
186
00:08:40.960 --> 00:08:43.490
everything that we needed to go through.
187
00:08:43.490 --> 00:08:46.260
And hopefully most of it made sense
188
00:08:46.260 --> 00:08:47.890
because in the next guide,
189
00:08:47.890 --> 00:08:50.830
we'll be moving forward by using scikit-learn
190
00:08:50.830 --> 00:08:53.480
to build a logistic regression model.
191
00:08:53.480 --> 00:08:56.590
But for now, I will wrap things up
192
00:08:56.590 --> 00:08:58.440
and I will see you in the next guide.