How to Implement Naive Bayes in Python
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

WEBVTT

1
00:00:03.210 --> 00:00:04.530
In this guide,

2
00:00:04.530 --> 00:00:07.700
we're gonna be covering our first classification technique,

3
00:00:07.700 --> 00:00:09.203
which is Naive Bayes.

4
00:00:10.220 --> 00:00:13.240
When I was first learning about the Bayes classifier,

5
00:00:13.240 --> 00:00:15.880
I was slightly shocked or maybe confused

6
00:00:15.880 --> 00:00:18.150
when I found out that the Bayes classifiers

7
00:00:18.150 --> 00:00:20.123
aren't necessarily Bayesian.

8
00:00:20.970 --> 00:00:23.230
In fact, Bayesian methods

9
00:00:23.230 --> 00:00:27.360
weren't even really named after Thomas Bayes or his theorem.

10
00:00:27.360 --> 00:00:30.780
It was actually a statistician named Ronald Fisher,

11
00:00:30.780 --> 00:00:34.090
who mockingly referred to the inverse probability method

12
00:00:34.090 --> 00:00:36.010
as Bayesian.

13
00:00:36.010 --> 00:00:38.993
And for whatever reason, the name stuck.

14
00:00:41.020 --> 00:00:43.390
Now Naive Bayes is an algorithm

15
00:00:43.390 --> 00:00:46.190
that's based on modeling joint distribution

16
00:00:46.190 --> 00:00:50.000
and belongs to a group of probabilistic classifiers

17
00:00:50.000 --> 00:00:53.810
that use Bayes' theorem along with the Naive assumption

18
00:00:53.810 --> 00:00:57.410
that every feature is independent from each other.

19
00:00:57.410 --> 00:01:00.900
So in order for us to understand how Naive Bayes works,

20
00:01:00.900 --> 00:01:04.003
it's pretty important to understand Bayes's theorem.

21
00:01:05.870 --> 00:01:08.390
If you're not familiar with Bayes's theorem,

22
00:01:08.390 --> 00:01:10.270
it's essentially an application

23
00:01:10.270 --> 00:01:12.940
of the chain rule of probability.

24
00:01:12.940 --> 00:01:16.550
And it's just set up in a way to give us a really easy way

25
00:01:16.550 --> 00:01:19.143
of determining posterior probabilities.

26
00:01:20.260 --> 00:01:23.110
Or in other words, it's a simplified way

27
00:01:23.110 --> 00:01:26.740
to estimate the likelihood of an unknown event occurring

28
00:01:26.740 --> 00:01:30.163
based on the probabilities of other known events.

29
00:01:31.310 --> 00:01:33.700
The concept of posterior probability

30
00:01:33.700 --> 00:01:35.810
can be a little confusing.

31
00:01:35.810 --> 00:01:39.020
But it really clicked for me when I started looking at it,

32
00:01:39.020 --> 00:01:40.663
like it's a game of jeopardy.

33
00:01:41.830 --> 00:01:44.030
For us to use Bayes's theorem.

34
00:01:44.030 --> 00:01:46.380
We need to know all of the probabilities

35
00:01:46.380 --> 00:01:48.600
that make up the answer.

36
00:01:48.600 --> 00:01:50.860
Then that will allow us to work backwards,

37
00:01:50.860 --> 00:01:52.513
to figure out the question.

38
00:01:53.560 --> 00:01:56.010
And mathematically, Bayes's theorem

39
00:01:56.010 --> 00:01:58.503
is represented by the following equation.

40
00:02:01.940 --> 00:02:04.310
To use a really simple example,

41
00:02:04.310 --> 00:02:07.120
let's say a family is having twins,

42
00:02:07.120 --> 00:02:09.233
and the first child out is a girl.

43
00:02:10.500 --> 00:02:13.350
The question is, what is the probability

44
00:02:13.350 --> 00:02:15.540
of the couple having twin girls

45
00:02:15.540 --> 00:02:17.573
given that the first was a girl?

46
00:02:19.730 --> 00:02:21.880
Using Bayes' theorem, we'll say that

47
00:02:21.880 --> 00:02:24.590
the probability of having two girls

48
00:02:24.590 --> 00:02:27.310
given one girl was already born

49
00:02:27.310 --> 00:02:31.470
is equal to the probability of having at least two girls

50
00:02:31.470 --> 00:02:34.530
multiplied by the probability of having one girl,

51
00:02:34.530 --> 00:02:36.093
given there's already two.

52
00:02:37.420 --> 00:02:40.470
And all of that is divided by the probability

53
00:02:40.470 --> 00:02:42.563
of at least one girl being born.

54
00:02:44.180 --> 00:02:46.240
Right away we know that the probability

55
00:02:46.240 --> 00:02:50.300
of the family having at least one girl given two were born,

56
00:02:50.300 --> 00:02:53.283
has to be 100%, or one.

57
00:02:54.580 --> 00:02:58.680
We also know that there's only four baby combinations,

58
00:02:58.680 --> 00:03:03.453
girl girl, girl boy, boy girl, and boy boy.

59
00:03:04.870 --> 00:03:06.540
So looking at that,

60
00:03:06.540 --> 00:03:10.120
we know that the probability of having at least two girls

61
00:03:10.120 --> 00:03:12.380
is one out of four.

62
00:03:12.380 --> 00:03:15.420
And the probability of having at least one girl

63
00:03:15.420 --> 00:03:16.923
is three out of four.

64
00:03:18.080 --> 00:03:21.830
So the overall probability of the family having two girls,

65
00:03:21.830 --> 00:03:24.140
knowing the first was already a girl,

66
00:03:24.140 --> 00:03:27.933
is one out of three, or right around 33%.

67
00:03:29.600 --> 00:03:32.730
Now for us as machine learning developers,

68
00:03:32.730 --> 00:03:35.053
the math isn't the most important part.

69
00:03:35.900 --> 00:03:37.900
To me, it's much more valuable

70
00:03:37.900 --> 00:03:41.050
to have an understanding of the core concept.

71
00:03:41.050 --> 00:03:43.870
Because having the ability to apply your knowledge

72
00:03:43.870 --> 00:03:46.313
will take you way further in the industry.

73
00:03:48.020 --> 00:03:51.070
Now, in terms of what Scikit-learn offers,

74
00:03:51.070 --> 00:03:53.573
there are four different Naive Bayes methods,

75
00:03:55.220 --> 00:03:56.773
Gaussian Naive Bayes,

76
00:03:58.380 --> 00:04:00.223
Multinomial Naive Bayes,

77
00:04:02.830 --> 00:04:04.453
Compliment Naive Bayes,

78
00:04:06.808 --> 00:04:08.313
Bernoulli Naive Bayes,

79
00:04:10.980 --> 00:04:13.163
and Categorical Naive Bayes.

80
00:04:14.940 --> 00:04:15.830
In a couple of guides,

81
00:04:15.830 --> 00:04:18.570
we'll be doing a more in-depth example,

82
00:04:18.570 --> 00:04:19.430
but for now,

83
00:04:19.430 --> 00:04:21.973
let's go through some of the basic functionality.

84
00:04:23.170 --> 00:04:25.050
I've done the importing already.

85
00:04:25.050 --> 00:04:27.590
And the newest tool that we'll be applying,

86
00:04:27.590 --> 00:04:31.453
is the Gaussian Naive Bayes, or NB class.

87
00:04:33.340 --> 00:04:38.340
And what we have for the data is a 12 by four matrix,

88
00:04:38.490 --> 00:04:42.160
where each row is represented by a student,

89
00:04:42.160 --> 00:04:45.720
and column one represent exam one grades,

90
00:04:45.720 --> 00:04:47.700
column two exam two,

91
00:04:47.700 --> 00:04:49.870
column three is exam three,

92
00:04:49.870 --> 00:04:52.773
and column four, our final exam grades.

93
00:04:53.640 --> 00:04:56.960
Then in the final column, instead of a course grade,

94
00:04:56.960 --> 00:04:59.273
there's either a zero or one.

95
00:05:00.350 --> 00:05:04.410
So any student who had a course grade below 70%,

96
00:05:04.410 --> 00:05:06.250
was given a zero,

97
00:05:06.250 --> 00:05:10.093
and any student above a 70 was assigned a one.

98
00:05:11.980 --> 00:05:13.500
In terms of our variables,

99
00:05:13.500 --> 00:05:17.120
the feature variable is made up of all of the exam scores

100
00:05:18.320 --> 00:05:20.120
and the target variable contains

101
00:05:20.120 --> 00:05:22.503
the final column of the grade matrix.

102
00:05:23.420 --> 00:05:27.530
And even though the sample we're working with is pretty tiny

103
00:05:27.530 --> 00:05:29.470
as a formality, I decided to add

104
00:05:29.470 --> 00:05:31.683
the train and test split anyway.

105
00:05:33.500 --> 00:05:35.720
Now you probably already noticed

106
00:05:35.720 --> 00:05:37.550
that the setup for Naive Bayes,

107
00:05:37.550 --> 00:05:40.023
is pretty similar to a regression model.

108
00:05:42.050 --> 00:05:45.270
Instead of a regressor, we have a classifier,

109
00:05:45.270 --> 00:05:48.340
but we still use the fit function and training data

110
00:05:48.340 --> 00:05:49.543
to generate a model.

111
00:05:50.600 --> 00:05:52.330
We can still use the score function

112
00:05:52.330 --> 00:05:54.163
to check the accuracy of the model.

113
00:05:55.960 --> 00:05:58.830
And again, because our sample is so small,

114
00:05:58.830 --> 00:06:00.930
our results will probably be different,

115
00:06:00.930 --> 00:06:03.530
so keep that in mind as you're working through this.

116
00:06:04.640 --> 00:06:05.580
All right.

117
00:06:05.580 --> 00:06:08.550
The prediction function works the same as well,

118
00:06:08.550 --> 00:06:11.350
but this time it'll predict the class and input

119
00:06:11.350 --> 00:06:13.430
most likely belongs to,

120
00:06:13.430 --> 00:06:15.463
instead of some numerical value.

121
00:06:17.090 --> 00:06:20.370
And for me, the student is in the class of students

122
00:06:20.370 --> 00:06:23.563
whose course grade was below 70%.

123
00:06:24.950 --> 00:06:27.983
And finally, we can check the class probability.

124
00:06:31.860 --> 00:06:34.530
So there is a 78% chance,

125
00:06:34.530 --> 00:06:37.010
the student belongs to the class of students

126
00:06:37.010 --> 00:06:40.320
who received a grade lower than 70,

127
00:06:40.320 --> 00:06:43.060
and a 22% chance, they belong to the class

128
00:06:43.060 --> 00:06:45.283
where students scored above a 70.

129
00:06:47.250 --> 00:06:50.260
And like I said, we'll be doing more of this later on,

130
00:06:50.260 --> 00:06:53.513
but those are really the basics of how Naive Bayes works.

131
00:06:55.500 --> 00:06:59.400
Overall, the Naive Bayes classifier can be a great tool

132
00:06:59.400 --> 00:07:02.503
that offers a simple solution for classification modeling.

133
00:07:03.940 --> 00:07:06.730
And a few of the reasons why people like to use it,

134
00:07:06.730 --> 00:07:08.960
is that it doesn't need as much training data

135
00:07:08.960 --> 00:07:10.973
as other classification methods.

136
00:07:11.910 --> 00:07:15.550
It can also handle continuous and discrete data.

137
00:07:15.550 --> 00:07:19.340
It's highly scalable with a large number of features,

138
00:07:19.340 --> 00:07:22.763
and it's even fast enough for real-time predictions.

139
00:07:25.290 --> 00:07:27.330
And so with all that being said,

140
00:07:27.330 --> 00:07:28.720
I will wrap this guide up

141
00:07:28.720 --> 00:07:30.517
and I will see you in the next one.