- Read Tutorial
- Watch Guide Video
Resources
In this guide, we'll be going though the three of the most popular Python libraries used in data science. And those are Numerical Python, Matplotlib and Pandas. All three of these libraries come as pre-installed packages in Anaconda. So it makes getting started really easy.
Before we get into it, I'll like to show you where the official documentation is for each library. When you open Anaconda, you'll start at the Home screen. Then all you have to do is click on the Learning tab. So if you ever have any syntax questions, or just want to explore the libraries a little bit more. These are great resources.
The first library that we're going to go through is Numerical Python. Which is more commonly referred to as just NumPy. This is the core library for scientific computation in Python, and NumPy's primary data type is an array, which is very similar to a Python list. The reason Python is so popular is that it gives us a bunch of useful features for operations on arrays and matrices, along with a fairly large collection of mathematical functions. Numpy is also fantastic at processing large collections of data fast and efficiently. So a lot of processes that normally take bunch of lines of code to write, are built directly into the library. Numpy allows us to simply call them and use them right in our own program. It's something you'll find incredibly helpful once we start implementing machine learning algorithms. Or when you start building out complex APIs that deal with large collections of complex data.
To start, let's go ahead and launch Spyder. Then just like every library that we'll be using, NumPy needs to be imported before we can actually use it. So we'll do that by assign it to the alias np. The first thing that I'd like to show you is how closely related an array is to a list. So let's actually make a regular python list containing the number zero to 15. Then we'll create second variable named num_array and use NumPy to convert the original list into an array by using the array function. To see the result, we can either run the code first and then move into the IPython console to pass in the variables, or we can use the print function and run it with the rest of the code. But either way you choose to do it, we get the same basic result.
import numpy as np num_list = list(range(16)) num_array = np.array(num_list) print(num_list) print(num_array)
The next thing we're going to do is create the same array, but this time we'll be using Python's linspace function. I'm also going to use the NumPy documentation manual. That way you can actually get some practice using it. So let's go back to the Anaconda Navigator and select the Learning tab. Then go into the NumPy section and click on the documentation link up at the top. From here, we can either manually go through the documentation or search by keyword.
The documents can be a little confusing. So when I'm using something for the first time, I'll read through the description to make sure it's the right tool for the job. Now if we scroll down just a little bit, there's usually a few see also options that typically have similar functionality. So I'll go though those just to make sure there's not a better fit for what I need. In this situation, we could easily use the arrange function to perform the same task.
The last thing I look at are the required and optional parameters, along with the default input for any optional parameter that I'll be using. If the default's where I need it to be, I just don't include that in my code.
With all that being said, let's go ahead and make the third variable and name it linspace_array. Then we'll call NumPy by passing in the alias followed by linspace. The only required parameters are start and stop values in a closed interval. So we just need to pass in zero and 15. Then the only optional parameter that we actually need to include is num. The default's 50. So if we were to run this. Our zero to 15 interval is chopped up into 50 evenly spaced float values. That's obviously not what we're looking for. So we can go ahead and change the num from 50 to 16, When we run it again. We end up with 16 evenly spaced elements.
linspace_array = np.linspace(0, 15, 16)
Now, using the linspace array, let's break it down into subgroups. This is actually one of the really handy features in NumPy because it gives us a way of nesting an array inside a master array. So let's say that we want four nested arrays each containing four elements. We can do that by creating a reshape variable. Then we just call the reshape function from NumPy, followed by the array that we want to reshape, along with a tuple containing the number of nested arrays, and the number of elements in each of those arrays. By nesting the arrays into the master array, we've essentially created a two dimensional matrix with four rows and four columns. This is a really important skill to have because when we start implementing machine learning algorithms, some of them require us to perform steps just like this. We'll have to create an entire collection of data, and then slice it up into usable components.
reshape_array = np.reshape(linspace_array, (4, 4))
I think we've done enough with NumPy for now, so let's move on to Matplotlib, which is a plotting library that helps us create visual representations of the data. Matplotlib is a pretty diverse library, but we'll mainly use if for histograms, power spectra, bar charts, error charts, or scatter plots. And in my experience the most effective way of learning Matplotlib is just by using it. So we're going to run though a project that highlights the practicality of Matplotlib and how it can be used in a classroom to help improve student performance.
Let's say we've been hired by a university that's considering the implementation of machine learning into all of their curriculum. But before they commit to spending all that money, they want us to build a system that gives realtime data to professors, allowing them to dynamically adjust course material based on student comprehension. And basically how it's going to work is that each classroom is broken down into different topics. At the end of each topic, the professor posts a question that each student has to answer with a connected response remote. Once the answers are fed into the database. It's our job to build a system that can present the results to the professor in realtime. The university also wants 80% class comprehension before the professor can move to the next topic. We're obviously not going to build out this entire system. Instead we'll just be using conditional statements, along with some histograms to notify the professor on whether or not more time is needed on a topic.
So kind of like we did before we're going to start by importing Matplotlib, along with the pyplot interface. Then we'll assign it to the alias plt. We're also going to use NumPy again, so we'll just import that right now as well. Normally we'd be using an API for this. to use as our input. We'll name the variable api_data. And then the elements in the array are going to be 50, 54, 57, 58, 58, 60, 61, 61, 62, 63, 65, 66, 67, 68, 68, 71, 72, 72, 76, and 82.
import matplotlib.pyplot as plt import numpy as np api_data = np.array([50, 54, 57, 58, 58, 60, 61, 61, 62, 63, 65, 66, 67, 68, 68, 71, 72, 72, 76, 82])
Alrighty. Next we'll set up the conditional statement. So we'll start with if the mean is greater than or equal to 80. And then we actually don't need an expression for the else condition. We want both statements to produce a histogram, so let's start by putting those both together. To use the histogram function, we just pass in plt.hist, followed by parentheses. The first attribute the hist function requires is an array. So we can just go ahead and pass in api_data. The next attribute we're going to add is range. Since the lowest possible comprehension percentage is zero and the highest is 100, we'll be using that as our actual range. The third attribute that we'll be passing in will be the number of bins, and let's make that 20. We're also going to be adding a size attribute, which isn't really necessary. I just think it makes it easier to distinguish the bins from one another. The last thing we'll do is add a color attribute that will act as a signal to the professor. Since we're working on the histogram that indicates the professor can move on, we'll use the color green for go.
The first histogram is pretty much done. So we can go ahead and just copy this and use it in the else conditional. We just have to make sure that we change the color from green to red. And that pretty much takes care of everything we need for the histograms. So let's take a look at everything we've done so far. It looks like everything worked.
if np.mean(api_data) >= 80: plt.hist(api_data, range = (0, 100), bins = 20, rwidth = .93, color = 'g') else: plt.hist(api_data, range = (0, 100), bins = 20, rwidth = .93, color = 'r')
Now we can finish up by adding a title, and access labels. And do some fine tuning to the tick marks. These are all pretty straightforward function calls so we'll start with the title function. To be nice to the professor we're going to add in the comprehension average, make the text a little bit bigger, then use bold font as well. The first attribute that we'll pass in calls for a string, so we can use an f-string to make sure the dynamic percentage is included. In the next attribute, fontsize is going to be equal to 16. And finally fontweight equals bold.
plt.title(f'Mean Comprehension: {np.mean(api_data)} %', fontsize = 16, fontweight = 'bold')
Okay, the title's done. So we can move on to labeling the X axis. Using plt, we'll call the xlabel function and then add our string. And for that we'll say comprehension percentage. And for fun let's go ahead and make that bold as well. Next, we'll do the exact same thing for the Y axis. But we'll change the string to Number of Students. Then the last adjustment that we'll have to make is for the tick marks. The ticks function requires an array, so I'll use NumPy's arrange function that I mentioned earlier. We want the ticks to go from zero to 100. But because arrange is a half-open interval, we'll have to pass in 101 as our actual stop value. And then we'll set the step to 10. Now let's run this and see how it looks.
plt.xlabel('Comprehension (%)', fontweight = 'bold') plt.ylabel('Number of Students', fontweight = 'bold') plt.xticks(np.arrange(101, step = 10))
To finish this up, let's actually comment this variable out. And I'm just going to create another api_data variable right below it. And we're going to create a new list of elements. But in this one we'll be using 75, 80, 82, 78, 69, 94, 87, 90, 72, 81, 75, 70, 100, 86, 88, 82, 81, 76, 83 and 74.
api_data = np.array([75, 80, 82, 78, 69, 94, 87, 90, 72, 81, 75, 70, 100, 86, 88, 82, 81, 76, 83, 74])
And when we run it, everything comes out perfectly.
The last library that we'll look at is the Pandas library, which is actually built on top of NumPy. And Pandas is great because it gives us an easy way to manipulate, examine, clean up, and merge tabular data. You'll find yourself using Pandas quite often in any data-driven field, because most of the time you don't actually need an entire DataFrame. While it's nice to have a large, diverse collection of data, not all of it will be relevant to the project you're working on.
We're going to start this out the same way as the first two examples. We'll import Pandas library and assign it to the alias pd. It's not always the case, but it's pretty much common practice to use the variable df to represent a DataFrame. So to follow common practice, let's create a variable named df, and assign it to the DataFrame we'll be using. Pandas can import a handful of data types, but in our case we'll be working with a csv file named dynamic_learning. If you actually know the file path, you can pass it directly in. Or you can drag file down into the terminal, and then copy and paste the path information. You can choose either way, you just have to make sure that you pass it in as a string.
import pandas as pd df = pd.read_csv('YOUR FILE PATH')
If you work in a data-driven field long enough, eventually you're going to come across a dataset that you're completely unfamiliar with. Or, you're going to be working on so many different projects, you just don't remember which is which. Either way, Pandas has a few really helpful tools that you can use to solve the problem. To start, we can take a quick look at the shape, size, and organizational structure.
Let's move into the Python console. Then using our imported data let's start by passing in df.columns. And what's returned is a Pandas index containing an array with the column labels. Next we can pass in df.shape, which returns an array containing the number of rows and columns in the DataFrame. Or we can pass in df.size to find out the total number of elements in the DataFrame.
We of course still want to take a look at the actual data, but let's say that we don't want to return the entire DataFrame. There's a few functions that can help us with that as well. To start, let's take a look at the first five rows of the DataFrame. For that we can leverage the head function. So let's stay over in this console, and we can pass in df.head followed by parentheses. When we take a look at this there's actually one thing that I'd like to point out, and that's that Pandas actually doesn't use any of the columns as the index. Instead it basically adds its own zero-base indexing column, which you can see on the far left. It wouldn't make any sense in this example. But we do have the ability to set or change the index if we ever want to. To set the index from the get go, all you have to do is include that in the initial import code. So we'll go back and ad index_col and have it equal year. Then when we run it. The zero-base index is replaced by the year column.
But like I said earlier, this doesn't make any sense and it pretty much defeats the whole purpose of using an index. So let's get rid of that and I'll show you how to change the index after an initial import. Normally I would probably create a new variable. But for this, let's just do a reassignment. We can use set_index, followed by parentheses. And then pass in the column name semester. And now that we realize this is also a horrible way to index, we can go back to the original index by using the reset_index function.
df = df.set_index('semester')
df = df.reset_index()
The last topic I'd like to talk about in this guide is how we can use Pandas location-base indexing for object selection. So in this example we'll go over a couple different ways of extracting rows and columns from a DataFrame. Let's say the same university from the Matplotlib section decided to hire us. Assuming enough information has been gathered, we want to take a look to see if average grades have actually gone up in the classes that use dynamic learning. We don't need the entire DataFrame to do that. We really only need the average_grade column and the dynamic_learning column. So we'll go over two ways of extracting the columns that we need.
The first method we'll go through uses the iloc function, which utilizes integer location-based indexing. Using iloc, we'll start by passing in the DataFrame we want to use, followed by iloc and squared brackets. One of the really nice things about this function is that it lets us use an array, integer or slice as an input. So the first argument that we'll be passing in defines which rows that we want to include. And since we want the entire range, we'll just use a slice here. Then a second argument is for the columns that we want included. We could use another slice, but let's use an array instead. And since we're using the array, we'll need to include another set of square brackets., followed by the integer location corresponding to average_grade and dynamic_learning. I already know the index location, it's five and six. But if you ever have to check, you can use get_loc, and that returns the integer location.
using_iloc = df.iloc[:, [5, 6]]
When we run this you can see that the array still has the original 48 rows. But now we just have the two columns. And we can also do the exact same thing, but using the loc function. This works pretty much the same exact way as iloc, but instead it uses labels to access rows and columns. So let's create a using_loc variable. Then we can pass in the DataFrame we want to use, followed by loc, and then squared brackets again. And just like before the first attribute will be the entire range of rows. Then the second attribute's going to be an array, but this time we'll pass in the actual label names. Now when we run this, we get the exact same result.
using_loc = df.loc[:, ['average_grade', 'dynamic_learning']]
The last feature that I want to show you using loc, uses booleans. And what we're going to do is break down the using_loc DataFrame into two smaller DataFrames. The first is going to be made up of all the rows where dynamic_learning was used. Then the second DataFrame will consist of all the rows where dynamic_learning wasn't used. So for yes_dynamic_learning we'll pass in the DataFrame, loc function and square brackets again. But this time we only want to include the rows where dynamic_learning equals Yes. In the second attribute, we want to keep both the columns. So we can just add a colon. Then for no_dynamic_learning we're going to do the exact same thing, but this time we'll replace the Yes with a No.Then when we run it. We have our two separate DataFrames.
yes_dynamic_learning = df.loc[df.dynamic_learning=='Yes', :] no_dynamic_learning = df.loc[df.dynamic_learning=='No', :]
Those are just a few of the basic functions available to you in the Pandas library. And as we move through the course, I highly encourage you to take some time to go through these user guides to gain a more in-depth understanding of the functionality, and how to take advantage of what these libraries do best.