- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.010 --> 00:00:04.250
In this guide,
2
00:00:04.250 --> 00:00:05.610
we're gonna use what we've learned
3
00:00:05.610 --> 00:00:07.167
about the Naive Bayes classifier
4
00:00:07.167 --> 00:00:10.450
and extend that into how it can be used
5
00:00:10.450 --> 00:00:12.163
to create a spam filter.
6
00:00:13.520 --> 00:00:15.010
As a quick refresher,
7
00:00:15.010 --> 00:00:18.010
Naive Bayes is based on Bayes' rules,
8
00:00:18.010 --> 00:00:20.080
which gives us a really easy way
9
00:00:20.080 --> 00:00:22.530
to determine the posterior probability
10
00:00:22.530 --> 00:00:23.963
of an event occurring.
11
00:00:25.700 --> 00:00:28.930
In the last example, it was the probability
12
00:00:28.930 --> 00:00:31.230
of a family having twin girls,
13
00:00:31.230 --> 00:00:33.913
knowing that the first was already a girl.
14
00:00:35.090 --> 00:00:39.010
And when we expand that to more than one feature,
15
00:00:39.010 --> 00:00:41.720
the posterior probability for each feature
16
00:00:41.720 --> 00:00:44.420
is calculated and then the probabilities
17
00:00:44.420 --> 00:00:45.840
are multiplied together
18
00:00:45.840 --> 00:00:47.663
to get the final probability.
19
00:00:48.700 --> 00:00:50.510
Then the final probability
20
00:00:50.510 --> 00:00:53.310
is calculated for the other class as well
21
00:00:53.310 --> 00:00:56.350
and whichever has the greater probability
22
00:00:56.350 --> 00:00:59.393
determines what class the object belongs to.
23
00:01:01.750 --> 00:01:03.926
In the example that we'll be working through next,
24
00:01:03.926 --> 00:01:06.270
the object is an email
25
00:01:06.270 --> 00:01:09.360
and the features are all of the unique words
26
00:01:09.360 --> 00:01:10.603
in that email.
27
00:01:11.600 --> 00:01:14.920
And what that means is that the posterior probability
28
00:01:14.920 --> 00:01:19.710
has to be calculated for every unique word in the email.
29
00:01:19.710 --> 00:01:22.833
And that determines if it's spam or not spam.
30
00:01:24.940 --> 00:01:27.410
And it might sound like a tall task
31
00:01:27.410 --> 00:01:30.010
but it's actually pretty simple
32
00:01:30.010 --> 00:01:31.923
when we get to use Sklearn.
33
00:01:34.490 --> 00:01:35.950
Just like before,
34
00:01:35.950 --> 00:01:39.173
we're gonna start off by importing everything that we need.
35
00:01:40.030 --> 00:01:42.540
And the only core library we're gonna need
36
00:01:42.540 --> 00:01:45.480
in this example is pandas
37
00:01:45.480 --> 00:01:48.821
but we'll also be using the train and test function,
38
00:01:48.821 --> 00:01:51.550
the multinomial Naive Bayes class,
39
00:01:51.550 --> 00:01:54.810
which we can use for text classification.
40
00:01:54.810 --> 00:01:57.330
We also have a CountVectorizer
41
00:01:57.330 --> 00:02:00.390
and a bunch of different metric functions
42
00:02:00.390 --> 00:02:02.670
and I'll explain what all of those do
43
00:02:02.670 --> 00:02:04.543
as we work through the example.
44
00:02:05.780 --> 00:02:07.640
The DataFrame itself is made up
45
00:02:07.640 --> 00:02:11.763
of just over 5,500 rows and two columns.
46
00:02:12.890 --> 00:02:16.370
The first column contains the class label,
47
00:02:16.370 --> 00:02:19.440
which is either going to be spam or ham,
48
00:02:19.440 --> 00:02:22.690
which is another way of saying not spam.
49
00:02:22.690 --> 00:02:26.783
Then the second column contains all of the actual emails.
50
00:02:28.900 --> 00:02:32.220
Our input x will contain every row
51
00:02:32.220 --> 00:02:33.720
and the last column,
52
00:02:33.720 --> 00:02:35.653
which contains all of the emails.
53
00:02:36.850 --> 00:02:39.410
And the output, which is y,
54
00:02:39.410 --> 00:02:41.440
contains all of the class labels
55
00:02:41.440 --> 00:02:44.523
that indicate if an email is spam or ham.
56
00:02:46.610 --> 00:02:48.500
Unlike some of our other examples,
57
00:02:48.500 --> 00:02:50.680
we don't need to reshape the data
58
00:02:50.680 --> 00:02:52.333
to create a column vector.
59
00:02:53.310 --> 00:02:56.230
Instead, I just added the value function
60
00:02:56.230 --> 00:02:57.970
to remove the column labels
61
00:02:57.970 --> 00:02:59.943
and only return the values.
62
00:03:00.950 --> 00:03:02.781
Notice when I add the value function
63
00:03:02.781 --> 00:03:06.140
and run it again, the object type changes
64
00:03:06.140 --> 00:03:09.623
from a pandas series to a NumPy array.
65
00:03:15.310 --> 00:03:17.530
Next, let's move into the console
66
00:03:17.530 --> 00:03:20.963
and do a quick overview of our class labels.
67
00:03:22.360 --> 00:03:26.360
First, let's pass in df followed by squared brackets
68
00:03:26.360 --> 00:03:27.383
in the column name.
69
00:03:29.050 --> 00:03:31.163
Then .unique.
70
00:03:33.700 --> 00:03:35.640
And what we get back is an array
71
00:03:35.640 --> 00:03:38.500
with all of the unique values in the column
72
00:03:38.500 --> 00:03:40.590
and there aren't any nan values,
73
00:03:40.590 --> 00:03:42.323
so that's a good thing.
74
00:03:44.470 --> 00:03:46.963
Now, let's pass in the same thing again.
75
00:03:49.860 --> 00:03:52.423
But this time, we'll finish up with value_counts.
76
00:03:58.300 --> 00:03:59.910
And this time, we get an array
77
00:03:59.910 --> 00:04:03.093
with a total number of ham and spam entries.
78
00:04:04.660 --> 00:04:06.540
Without doing the exact math,
79
00:04:06.540 --> 00:04:08.430
it looks like there's about six
80
00:04:08.430 --> 00:04:10.870
to seven times as many ham values
81
00:04:10.870 --> 00:04:12.673
as there are spam values.
82
00:04:13.800 --> 00:04:15.460
That'll be something that we can talk about
83
00:04:15.460 --> 00:04:16.593
later on as well.
84
00:04:18.020 --> 00:04:19.470
But for now, let's move on
85
00:04:19.470 --> 00:04:21.573
with building the classification model.
86
00:04:23.940 --> 00:04:25.830
In the first Naive Bayes example
87
00:04:25.830 --> 00:04:27.070
that we went through,
88
00:04:27.070 --> 00:04:30.730
we started out by creating the classifier object.
89
00:04:30.730 --> 00:04:33.330
But this time, it's gonna be a little bit different.
90
00:04:34.270 --> 00:04:37.100
And that's because before we can generate a model,
91
00:04:37.100 --> 00:04:39.100
we need to go through all of the emails
92
00:04:39.100 --> 00:04:41.000
to figure out the probability
93
00:04:41.000 --> 00:04:44.973
of every word belonging to spam or ham emails.
94
00:04:46.770 --> 00:04:49.040
To do that, we're going to leverage
95
00:04:49.040 --> 00:04:51.310
the CountVectorizer class.
96
00:04:51.310 --> 00:04:54.690
So we'll start by creating a vectorizer variable
97
00:04:55.610 --> 00:04:58.763
and assigning that to the CountVectorizer class.
98
00:05:03.460 --> 00:05:07.130
Then we'll create a second variable named counts,
99
00:05:07.130 --> 00:05:10.140
which will be assigned to the fit_transform method
100
00:05:10.140 --> 00:05:11.973
of the CountVectorizer class.
101
00:05:19.960 --> 00:05:23.550
This is unlike anything we've used so far in the course
102
00:05:23.550 --> 00:05:26.410
but what the fit_transform method is doing
103
00:05:26.410 --> 00:05:28.173
is tokenizing the data.
104
00:05:29.490 --> 00:05:31.100
And what tokenizing means
105
00:05:31.100 --> 00:05:33.260
is that all of the meaningful data,
106
00:05:33.260 --> 00:05:35.140
which are the words in an email,
107
00:05:35.140 --> 00:05:37.410
are being replaced by a number,
108
00:05:37.410 --> 00:05:39.683
which are commonly referred to as a token.
109
00:05:41.410 --> 00:05:43.370
So when we use the CountVectorizer
110
00:05:43.370 --> 00:05:45.630
to fit and transform the data,
111
00:05:45.630 --> 00:05:47.550
it goes through each email,
112
00:05:47.550 --> 00:05:51.130
counting how many times each word was used.
113
00:05:51.130 --> 00:05:53.520
Then it iterates back through the dataset
114
00:05:53.520 --> 00:05:55.170
and replaces each word
115
00:05:55.170 --> 00:05:57.513
with the number of times the word was counted.
116
00:05:59.640 --> 00:06:01.860
And once we have that word count,
117
00:06:01.860 --> 00:06:04.733
that's what we'll be using to calculate probabilities.
118
00:06:06.390 --> 00:06:07.910
So now we can go ahead
119
00:06:07.910 --> 00:06:10.440
and create our classifier object
120
00:06:10.440 --> 00:06:14.233
and assign that to the multinomial Naive Bayes class.
121
00:06:18.700 --> 00:06:21.760
The next step is to create the classification model
122
00:06:21.760 --> 00:06:23.093
by using the fit method.
123
00:06:27.370 --> 00:06:29.000
And this is where we wanna pass
124
00:06:29.000 --> 00:06:31.220
in the transformed training set
125
00:06:31.220 --> 00:06:32.623
that we just created.
126
00:06:33.530 --> 00:06:36.443
And we'll follow that up with the output training set.
127
00:06:39.720 --> 00:06:41.880
Okay, now we are done building
128
00:06:41.880 --> 00:06:43.783
the actual classification model.
129
00:06:44.660 --> 00:06:47.163
But we still need a way of evaluating it.
130
00:06:48.120 --> 00:06:50.040
So the first thing that we wanna do
131
00:06:50.040 --> 00:06:52.280
is tokenize the test set
132
00:06:52.280 --> 00:06:55.010
and we can do that using the transform method
133
00:06:55.010 --> 00:06:56.633
from the vectorizer class.
134
00:07:09.410 --> 00:07:12.330
Then we'll finish by creating a prediction object
135
00:07:12.330 --> 00:07:14.580
that will make class predictions based
136
00:07:14.580 --> 00:07:16.313
on the transformed test set.
137
00:07:32.420 --> 00:07:35.780
Just like with regression, Sklearn gives us a bunch
138
00:07:35.780 --> 00:07:39.053
of different tools we can use for model evaluation.
139
00:07:41.500 --> 00:07:43.540
The first option that we'll go through
140
00:07:43.540 --> 00:07:46.690
uses the accuracy_score function
141
00:07:46.690 --> 00:07:48.960
and this will return the percentage
142
00:07:48.960 --> 00:07:51.970
of correct predictions made by the model
143
00:07:51.970 --> 00:07:54.203
from the test set that we passed in.
144
00:08:02.020 --> 00:08:03.650
The next evaluation method
145
00:08:03.650 --> 00:08:05.211
that I'd like us to talk about
146
00:08:05.211 --> 00:08:07.683
is called the confusion_matrix.
147
00:08:21.840 --> 00:08:25.040
So after we pass in both of the test sets,
148
00:08:25.040 --> 00:08:26.853
let's run it again.
149
00:08:34.340 --> 00:08:37.543
And I'll walk you through what this two-by-two matrix means.
150
00:08:39.220 --> 00:08:40.980
Depending how your data was split
151
00:08:40.980 --> 00:08:43.260
during the train, test step,
152
00:08:43.260 --> 00:08:46.900
your confusion matrix will probably have different values.
153
00:08:46.900 --> 00:08:49.720
But it's all gonna mean the same thing.
154
00:08:49.720 --> 00:08:52.420
So what the confusion matrix does
155
00:08:52.420 --> 00:08:56.373
is split the prediction results into four main categories.
156
00:08:58.110 --> 00:08:59.980
The element in the first row
157
00:08:59.980 --> 00:09:02.520
of the first column indicates the number
158
00:09:02.520 --> 00:09:04.730
of ham predictions the model made
159
00:09:04.730 --> 00:09:07.303
that were correctly classified as ham.
160
00:09:08.670 --> 00:09:11.930
The value in the first column, second row indicates
161
00:09:11.930 --> 00:09:14.090
how many ham predictions were made
162
00:09:14.090 --> 00:09:15.723
that were actually spam.
163
00:09:18.790 --> 00:09:20.780
Then over in the second column,
164
00:09:20.780 --> 00:09:23.690
the first row indicates the number of spam predictions
165
00:09:23.690 --> 00:09:25.590
that were actually ham
166
00:09:25.590 --> 00:09:27.400
and the second row is the number
167
00:09:27.400 --> 00:09:29.943
of spam predictions that were correctly predicted.
168
00:09:32.050 --> 00:09:34.608
A confusion matrix can be a good tool to use
169
00:09:34.608 --> 00:09:36.652
when one of the incorrect predictions
170
00:09:36.652 --> 00:09:39.303
is way more important than the other.
171
00:09:40.870 --> 00:09:44.680
Say, for example, instead of this being a spam filter,
172
00:09:44.680 --> 00:09:47.173
it was classifying possible tumors.
173
00:09:48.260 --> 00:09:50.050
Incorrectly classifying someone
174
00:09:50.050 --> 00:09:53.080
with a tumor into the non-tumor group
175
00:09:53.080 --> 00:09:55.448
has much more severe consequences
176
00:09:55.448 --> 00:09:57.038
than if a non-tumor patient
177
00:09:57.038 --> 00:10:00.403
was incorrectly classified as having a tumor.
178
00:10:01.450 --> 00:10:02.810
Because at that point,
179
00:10:02.810 --> 00:10:05.210
they can always go in for further testing
180
00:10:05.210 --> 00:10:07.070
where it will eventually be determined
181
00:10:07.070 --> 00:10:09.460
it was a misdiagnosis.
182
00:10:09.460 --> 00:10:13.200
But obviously, if someone is told they don't have a tumor,
183
00:10:13.200 --> 00:10:14.850
and they actually do,
184
00:10:14.850 --> 00:10:18.393
the outcome has the potential to be much more costly.
185
00:10:20.410 --> 00:10:22.490
So moving on to the next method,
186
00:10:22.490 --> 00:10:24.720
we have the f1 score
187
00:10:24.720 --> 00:10:27.270
and this can be used to check the accuracy
188
00:10:27.270 --> 00:10:29.253
of a binary classification model.
189
00:10:38.160 --> 00:10:40.823
And let's see what this error message says.
190
00:10:42.340 --> 00:10:44.260
Oh, so for this, we're gonna need
191
00:10:44.260 --> 00:10:46.220
to declare what class label
192
00:10:46.220 --> 00:10:48.410
is the positive result
193
00:10:48.410 --> 00:10:50.140
and since we're checking to see
194
00:10:50.140 --> 00:10:52.040
if an email is spam,
195
00:10:52.040 --> 00:10:54.090
that's what will yield a positive result.
196
00:10:56.260 --> 00:10:57.970
So let's run this again
197
00:10:57.970 --> 00:11:00.223
and now we have our accuracy score.
198
00:11:02.360 --> 00:11:04.060
The last method that we'll be covering
199
00:11:04.060 --> 00:11:05.920
is the precision_score,
200
00:11:05.920 --> 00:11:09.453
which assesses the precision of the classification model.
201
00:11:17.070 --> 00:11:19.133
And it looks like I forgot it again.
202
00:11:25.150 --> 00:11:27.003
Now let's go ahead and run it.
203
00:11:32.630 --> 00:11:34.170
And it looks like we have a score
204
00:11:34.170 --> 00:11:38.063
of just under .99, which is a really good score.
205
00:11:39.500 --> 00:11:41.010
To finally wrap this guide up,
206
00:11:41.010 --> 00:11:44.150
let's go ahead and make a couple new emails
207
00:11:44.150 --> 00:11:46.163
to see how the model responds.
208
00:11:50.910 --> 00:11:54.080
The first will be for Free Viagra!!!
209
00:12:00.750 --> 00:12:03.203
And we'll make the second message from a friend.
210
00:12:07.600 --> 00:12:10.270
Before we can use the model to make a prediction,
211
00:12:10.270 --> 00:12:12.520
we're gonna have to tokenize both new_emails.
212
00:12:26.490 --> 00:12:28.903
Then we'll apply that to the predict function.
213
00:12:44.170 --> 00:12:45.470
And now we can go ahead
214
00:12:45.470 --> 00:12:47.943
and check the new email class predictions.
215
00:12:54.660 --> 00:12:57.323
And it looks like the model did a really good job.
216
00:12:59.670 --> 00:13:02.580
Well, that finally brings us to the end of this guide.
217
00:13:02.580 --> 00:13:04.333
So I will see you in the next one.