- Read Tutorial
- Watch Guide Video
In this guide, we're going to quickly go through how to import data as well as how to segment independent and dependent variables. We already worked through a lot of this during the Pandas section of library imports but this should still act as a good review.
If you remember back, we used the read function in Pandas to import the DataFrame. Well, we're going to start off the exact same way in this example. We'll create a DataFrame variable called df. Then we'll use the read function and pass in the DataFrame we want to work with. I'd also like to quickly mention, while we've only worked with CSV files, Pandas also has the ability to read other text files, like JSON and HTML.
Once we have the DataFrame imported, we can go ahead and check the label names. If you don't remember how to do that, all you have to do is pass in df.columns. So looking at the array labels, we'll be making average_grade the dependent variable., and all the other columns are going to be the independent or feature variables.
import numpy as np import pandas as pd df = pd.read_csv('YOUR FILE PATH')
For the first variable, we're going to go ahead and name it features, which will include all the independent variables. After we pass in df, you can either use loc or iloc, followed by squared brackets. Then we'll pass in a slice for the rows we want to access, which in this case is going to be all of them. Then next, we use an array and I'll pass in the index locations of all the columns I want included.
features = df.iloc[:, [0, 1, 2, 3, 4, 6]
The second variable is going to be called dependent_variable. We're going to pass in the DataFrame again, and either the loc or iloc function again. We want all the rows, but now we only want average_grade, so I'll pass in the related index integer, which is five. Then when we run it, we end up getting two separate DataFrames.
dependent_variables = df.iloc[:, 5]
But what I'd like to do is actually convert them into a NumPy matrix, which allows us to work with nested arrays instead. And since Pandas was written on top of NumPy, this is a really simple conversion. All we have to do is create our variable, pass in the DataFrame that we want to convert. Then pass in to_numpy, followed by parentheses. Then we'll do that one more time for the dependent variable. Now when we run it, we end up with two sets of arrays. And most of the time, these are the steps that you'll be taking when you're working with your own data.
features = features.to_numpy() dependent_variables_array = dependent_variables.to_numpy()
Remember, NumPy is the primary library we'll be using for mathematical operations. So when we do this conversion, we're getting the data ready to be worked with and analyzed. And that's really about all we needed to cover in this guide.
Code
import numpy as np import pandas as pd df = pd.read_csv('YOUR FILE PATH') features = df.iloc[:, [0, 1, 2, 3, 4, 6] dependent_variables = df.iloc[:, 5] features = features.to_numpy() dependent_variables_array = dependent_variables.to_numpy()