Hey Guys, welcome to the next episode of “How to Learn Machine Learning Tutorials”. In the previous episode, we have learned how to create and process the datasets using R and Python. Moreover, we have got an introduction to dependent and independent variables.
In case you have missed it, I definitely recommend checking it out by the following link: https://cyberhulk.net/learn-machine-learning-intro-datasets/.
The Matrix of Features
Today I want you to understand a fundamental rule in Machine Learning. We need to distinguish two features which will frequently be used throughout this set of tutorials:
- Matrix of features
- Dependent Variables Factor
Let’s understand what the “Matrix of features is”. Open spyder and click on the data set.
The matrix of Features is a term used in machine learning to describe the list of columns that contains independent variables which should be processed including all lines in the dataset. This lines of the dataset are called lines of observations.
We are going to create the matrix of features for the four independent variables in the ten lines of dataset called “ten lines of observations” we have which are:
Let’s jump into creating our matrix of features.
Open spyder and type the following line of code:
x = dataset.iloc[:, :-1].values
You would ask, what did we do just here? We have declared a variable “x” then used a dataset variable called “dataset” and called method “iloc” with specific parameters to select all the columns of independent variables. Note, “iloc” works on the positions in the index (so it only takes integers). Colon on the left means we have selected all of the lines of observations, whereas colon on the right with integer -1 selects all of the columns except the last one which is logical in our case because we want to select only the independent variables with their lines.
We took all the columns and excluded the last column which is the column that contains our dependent variable. Here the “.values” is just part of the python syntax meaning we want to get access to the values in the dataset we have selected. Let’s see what we receive as a result of this operation. Type “x” in the spyder console.
As you can see, we have selected first four columns excluding the last one.
For now, we are done with the selection of the matrix of features; let’s pass to the next feature called “The Dependent Variables Vector”.
Dependent Variables Vector
The dependent variables vector is a term used in Machine Learning to define the list of dependent variables in the existing dataset. Here we also have lines of observations which is the list of those variables by lines.
We are going to create the “dependent variables vector” which is the last column – named “self-employed” consisting of the ten lines of observations.
In Spyder type the following line:
y = dataset.iloc[:, :4].values
Let’s go over this line of code. We have created a variable “y” and used the same method aloc as in the matrix of features. By using the left colon, we have selected all of the lines which we call lines of observations, whereas instead of the right colon this time we have to put number “4” by which we select the last column with index 4 and position 5. Remember indexes in python as in most scripting languages start with 0.
This operation selected the last column which is our dependent variable “self-employed” and all of 10 lines of observations.
To test this, type “y” in spyder console.
As you may see, we have selected the dependent variables list, with the output of “yes” or “no”, indicating whether a person is self-employed or not.
So that’s it with python, time to switch to R. Open rstudio, you will see that operations there are much more straightforward because we do not have to make a distinction between the matrix of features the dependent variable vector.
We will have to set a working directory in R as well. In R you have to choose the files section in the bottom right of the screen, select your working directory as a file path, click more and choose “set as a working directory” option.
If all is correct, you should see an output in the console.
setwd("~/Dev/AI/Intro to datasets")– of course with your working directory path.
Now we are ready to start importing the dataset. To do this in R, you should follow simple steps.
Create a new R file and name it “data_prep_draft.r” and save it to the same working directory as other records created previously.
Then we will just need one line of code. As in Python, we are going to call it “dataset’ the variable that will be the dataset itself and use a method read.csv to read and import the CSV file created earlier.
dataset = read.csv('self-employed.csv')
After performing this operation, you should be able to see the imported dataset.
There are two clear distinctions that you should know:
- Unlike python indexes in r start at 1, so in our lines of observations you should see ten lines, indexed from 1 to 10
- You do not have to programmatically mention the difference between the matrix of features and the dependent variables vector in R
And this will start perfectly making sense for you as we dive deeper into the next tutorials. That’s it for today, stay tuned for the upcoming tutorials, and I hope my explanation is super easy for you. My wish is to make the learning curve as easy as possible in the complex world of machine learning and data science.
In the next tutorial, we are going to learn about how to take care of the missing data, because sometimes your dataset contains missing data and you have to take care of this.
As always feel free to ask questions in the comments below, will be happy to answer.