Articles, Artificial Intelligence

How to Learn Machine Learning? – Intro to Datasets

how-to-learn-machine-learning

In the previous tutorial, we managed to set-up a proper working environment with all the tools, needed for the beginning, to start your journey in the data science world.

If you have missed the previous tutorial then you have missed the part of installing the environment which is essential if you want to follow examples, so I recommend checking it out by the following link: https://cyberhulk.net/learn-machine-learning/.

Data is the King

Data plays a big part in machine learning, and it is essential to understand and use the right terminology when talking about it.

In this tutorial, you will discover precisely how to describe and talk about data in machine learning. We are going to walk through datasets which will significantly help you with understanding machine learning algorithms in general.

Creating a Dataset

As always let’s immediately jump into practice.

Head on to http://www.convertcsv.com/csv-viewer-editor.htm and use the grid to create a csv file:

CSV-Dataset

I have created a table with several European locations and distribution of salaries based on age and occupation, however, what the most interesting is whether this person is self-employed or not. Ideally, I would like to have similarity patterns between location, age, salary and occupation and understand what factors are key decision makers for people to get into self-employment in IT. Now the sweet spot, in case you have just 11 rows like me, it’s relatively easy to understand the pattern, however, imagine if you have billions of rows full with data.

You got the point. Moreover, we have just covered the first terminology in machine learning – “dependent” and “independent” variables.

Ok, you would ask what’s the difference? Easy, independent variables are the first four rows of the table, which are:

  1. Country
  2. Age
  3. Salary
  4. Occupation

So the sole purpose of the independent variables is to predict the status or condition of the dependent variable, which in our case is whether the person is self-employed or not.

By feeding lots of data with independent variables we will be able to predict the dependent variable. Easy right? Do not rush, we have some prep work to do before we begin the magic of predictions.

Importing Libraries

We will need to use two different environments installed and configured in the first tutorial:

  1. Python
  2. R

Open Spyder IDE installed last time and create a new file, name it however you prefer; I will name it “data_prep_draft.py”.

data_prep_draft

Why do we need to import a library? First, we need to understand what is a Library.

A library is a tool that you can use to complete a specific task. The benefit here is that you do not have to write code that someone else has already written and tested.

In our case, you just have to provide the specific input, and the library will return the desired output.

During this set of tutorials, we are going to mostly use libraries to save time and make your learning process fun and machine learning models as efficient as possible.

This time I am going to use the three essential libraries which are:

  1. Numpy – import it by typing import numpy and use it shortcut as np.
    Numpy is mostly used to perform math based operations during the machine learning process.
  2. Matplotlib – import the sublibrary “pyplot” by typing import matplotlib.pyplot and use it as a shortcut as plt. We will use the library to plot charts in python.
  3. Pandas – import the panda library and make a shortcut using import pandas as pd We are going to use panda to import datasets and manage them.

As a result you should be able to execute the imported libraries in spyder.

imported-libs

So, for now, we are done with Python, next step is to open R. The good news here is that we do not need to import the libraries for this tutorial, the required once are already there out of the box in Rstudio.

imported-libraries-rstudio

Importing Datasets

For datasets we will have to use both environments once again:

  1. Python
  2. R

Let’s start with Python.

Before we jump to importing datasets, you will need a working directory. To set a working directory, you need to open file explorer in Spyder and choose a target folder on your OS.

I have placed the “data_prep_draft.py” file and previously created CSV file in a folder called “Intro to datasets” and pointed Spyder to read from that directory.

working-directory-dataset

Declare a variable “dataset” and assign panda shortcut “pd” with its method “read_csv” with a filename as a parameter to read the data from csv file in your working directory.

dataset = pd.read_csv('self_employed.csv')

read-csv-dataset

To test if the function is working correctly, add a log statement by typing print("dataset") and head on to console on the right to see the output.

dataset-printed

You can also double check in variable explorer tab, that dataset is right there waiting for you to play with it.

variable-imported

If you double click on it, you can see a dataset in a nice formatted way.

dataset-imported

Do not be surprised if you see that indexing starts at zero, that’s a typical behaviour for high-level programming languages. However, you will see a that it is done a completely different way in R.

That’s it for today, I hope you enjoyed it. In the next tutorial we are going to go deeper into terminology and discuss two critical features in machine learning:

 

  1. The Matrix of Features
  2. The Dependent Variables Vector

 

As always feel free to ask questions in the comments above, will be happy to answer.

 

Share with the world!Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedIn

Leave a Reply

Be the First to Comment!

Notify of
avatar