In the previous tutorial, we managed to set-up a proper working environment with all the tools, needed for the beginning, to start your journey in the data science world.
If you have missed the previous tutorial then you have missed the part of installing the environment which is essential if you want to follow examples, so I recommend checking it out by the following link: https://cyberhulk.net/learn-machine-learning/.
Data is the King
Data plays a big part in machine learning, and it is essential to understand and use the right terminology when talking about it.
In this tutorial, you will discover precisely how to describe and talk about data in machine learning. We are going to walk through datasets which will significantly help you with understanding machine learning algorithms in general.
Creating a Dataset
As always let’s immediately jump into practice.
Head on to http://www.convertcsv.com/csv-viewer-editor.htm and use the grid to create a csv file:
I have created a table with several European locations and distribution of salaries based on age and occupation, however, what the most interesting is whether this person is self-employed or not. Ideally, I would like to have similarity patterns between location, age, salary and occupation and understand what factors are key decision makers for people to get into self-employment in IT. Now the sweet spot, in case you have just 11 rows like me, it’s relatively easy to understand the pattern, however, imagine if you have billions of rows full with data.
You got the point. Moreover, we have just covered the first terminology in machine learning – “dependent” and “independent” variables.
Ok, you would ask what’s the difference? Easy, independent variables are the first four rows of the table, which are:
So the sole purpose of the independent variables is to predict the status or condition of the dependent variable, which in our case is whether the person is self-employed or not.
By feeding lots of data with independent variables we will be able to predict the dependent variable. Easy right? Do not rush, we have some prep work to do before we begin the magic of predictions.
We will need to use two different environments installed and configured in the first tutorial:
Open Spyder IDE installed last time and create a new file, name it however you prefer; I will name it “data_prep_draft.py”.
Why do we need to import a library? First, we need to understand what is a Library.
A library is a tool that you can use to complete a specific task. The benefit here is that you do not have to write code that someone else has already written and tested.
In our case, you just have to provide the specific input, and the library will return the desired output.
During this set of tutorials, we are going to mostly use libraries to save time and make your learning process fun and machine learning models as efficient as possible.
This time I am going to use the three essential libraries which are:
- Numpy – import it by typing
import numpyand use it shortcut
Numpy is mostly used to perform math based operations during the machine learning process.
- Matplotlib – import the sublibrary “pyplot” by typing
import matplotlib.pyplotand use it as a shortcut
as plt. We will use the library to plot charts in python.
- Pandas – import the panda library and make a shortcut using
import pandas as pdWe are going to use panda to import datasets and manage them.
As a result you should be able to execute the imported libraries in spyder.
So, for now, we are done with Python, next step is to open R. The good news here is that we do not need to import the libraries for this tutorial, the required once are already there out of the box in Rstudio.
For datasets we will have to use both environments once again:
Let’s start with Python.
Before we jump to importing datasets, you will need a working directory. To set a working directory, you need to open file explorer in Spyder and choose a target folder on your OS.
I have placed the “data_prep_draft.py” file and previously created CSV file in a folder called “Intro to datasets” and pointed Spyder to read from that directory.
Declare a variable “dataset” and assign panda shortcut “pd” with its method “read_csv” with a filename as a parameter to read the data from csv file in your working directory.
dataset = pd.read_csv('self_employed.csv')
To test if the function is working correctly, add a log statement by typing
print("dataset") and head on to console on the right to see the output.
You can also double check in variable explorer tab, that dataset is right there waiting for you to play with it.
If you double click on it, you can see a dataset in a nice formatted way.
Do not be surprised if you see that indexing starts at zero, that’s a typical behaviour for high-level programming languages. However, you will see a that it is done a completely different way in R.
That’s it for today, I hope you enjoyed it. In the next tutorial we are going to go deeper into terminology and discuss two critical features in machine learning:
- The Matrix of Features
- The Dependent Variables Vector
As always feel free to ask questions in the comments above, will be happy to answer.