Blog

A Guide to Machine Learning with Python

  • (4.0)
  • | 1735 Ratings

Machine Learning with Python


Machine Learning is undoubtedly one of the hottest trends in the IT sector nowadays. Whether you are a software developer, a product manager, or a business analyst, machine learning has the power to change your work structure and boost your career to an all new level.

Machine Learning is continuously and very effectively transforming sectors from healthcare and consumer electronics to retail, from the past decade. Being a part of this busy IT Industry, you must be looking for a path to get a concrete understanding of Machine Learning, which is not only practical and rigorous but also fast and concise. Here, we will surely help in achieving your desired goals.



There are a lot of resources available to gain knowledge on Machine Learning, but Python is the one that can make your journey the way you want to be. Python is one of the most commonly used languages for machine learning, as it is easily understandable and fast to use. 


[Related Page: What is Machine Learning and it’s Future]


Unlike some other programming languages like R, Python is a complete language, which means a platform that can be used for both research and development. To understand the terminology of Machine Learning, everyone must keep the following points in mind, as these steps collaboratively make this awesome technology.


Machine Learning with Python


  • Define Objective- Firstly, you need to define your objective, like recognizing an image or a text.
  • Collection of Data- Then, you will need to collect all the data associated with your machine learning operation.
  • Preparing the Data- Then, we have to load the data at suitable places and prepare it as per our needs and requirements for further process.
  • Develop/Choose a Model- Now, you will have to choose a working model among the models created over the years by data scientists and researchers.
  • Training- Then, you will move to the part called machine learning, where a machine is trained to perform different tasks with the help of different algorithms.
  • Evaluate/Analyze- In this part, you evaluate, analyze or test if the entire system and workflow is working fine or not.
  • Hyper Parameter Tuning- In this part, you can easily tune or improvise things done in your training part.
  • Prediction- This is the final part, where you make your system or machine predict values.

Get ahead in your career by learning Machine Learning through Mindmajix? Machine Learning Training.

Now, without waiting anymore, let us look at some step-by-step instructions on how Python can be used for Machine Learning.


Installing the Python and SciPy platform


Python is a high-level, interpreted, general purpose programming language and SciPy is an open source software library designed specially for python. SciPy is basically used to perform technical computing and scientific computing operations and it is highly useful to attain Machine Learning in Python.

To get started, firstly you need to install Python and its SciPy libraries in your system. Now, to install Python, you can simply visit the company’s official page, where you will get different versions of Python and you can choose the best one according to your system. After getting it installed, you will need to install the required SciPy libraries in your system. There are a total number of 5 libraries you will need to have in your system. Take a look at these following 5 libraries:


  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

You can find many ways to install these above-mentioned libraries, but we suggest you take a look at the scipy installation page. Here, you will have to proper guidance and all the required instructions to install these libraries on different platforms including Mac OS X, Windows, and Linux.


Check out Python Interview Questions

 

Start Python and Check Versions


It is necessary to check out the successful installation and proper working of your Python environment. Hence, we are elaborating a script to test your working environment. This script will import each and every required library and will print its version. For this, open a command line and simply start the python interpreter, then copy or type the below-given script:

Script:


# Check the versions of libraries

# Python version
import sys
print(‘Python: {}’.format(sys.version))
# scipy
import scipy
print(‘scipy: {}’.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

If your Python working environment is working fine, this script will represent an output like shown below. Compare the output with your versions to check out the precession. In case, you get an error message, you will need to google it out and resolve it before getting ahead.


Related Page: Defining Functions - Python

Output:

<br />Python: 3.6.8 (default, Dec 30 2018, 13:01:55)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
scipy: 1.1.0
numpy: 1.15.4
matplotlib: 3.0.2
pandas: 0.23.4
sklearn: 0.20.2

 

Load The Data

To make our learning more understandable, we are using the iris flowers dataset as an example. We are using this dataset because it is as famous as the “hello world” dataset in statistics and machine learning, and is used by numerous other people.


Check Out Machine Learning Tutorials


Now, the dataset includes a total number of 150 observations of iris flowers. We will also see a total of 4 columns for measuring the flowers in cm. Apart from this, we will also see a fifth column, which is meant to specify the species of the observed flower. Loading the data will take two steps for its completion, which can simply be seen below:


Import Libraries


First of all, we need to import all the functions, objects, and modules, which are going to be used for our operation. For this, you can easily use the following script mentioned below:

Script:


# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC<br /><br />

You must keep it in mind that everything must load without any error. You need a smoothly working SciPy environment before heading on further. If you get any error, you will need to resolve it first.


[Related Article: Machine Learning Vs Artificial Intelligence]


Load Dataset

We can use the UCI Machine Learning repository to load the data directly without facing any hurdle. The UCI Machine Learning repository is basically a collection of domain theories, databases, and data generators, available over the internet to analyze the machine learning algorithms. After that, Pandas (A specialized software library for Python) is getting used here to load the data. Moreover, Pandas will also be used in the exploration of data with both data visualization and descriptive statistics. You can use the following script to get the task done without any hustle.

Script:


# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)<br /><br />

Here, we are also specifying the names of each and every column while loading the data, which will be helpful later while exploring data. You must keep high attention that the dataset must load without any error. If there is any network problem in your system, you can also choose to download the iris.csv file straight into your system and can use it via the same methodology. You will just need to change the URL to local file name and location.


Summarize the Dataset


In this part, we will have a look at the entire data set in different ways. This step is essential to get all the required information related to the structure of dataset. Now, in this part, we have 4 parts by which we will get the info regarding the entire dataset. Kindly look at each of them mentioned in the below given points:


Dimensions of Dataset

In this part, we can have a quick view on the number of instances (rows) and attributes (columns) the dataset includes. For this, you can simply use the command shown below:

Command:


# shape
print(dataset.shape)


You will see a total number of 150 rows (instances) and 5 columns (attributes) in a structure like mentioned below:


Output:

(150, 5) 

Peek at the Data

In this part, you will be able to roll your eyeball on the dataset. For this, you can use the command mentioned here:

Command:
 

# head
print(dataset.head(20))


By using this command, you will be able to see the top 20 rows of your dataset. You will get an output like this:


Output:

        sepal-length  sepal-width  petal-length  petal-width   class
0            5.1          3.5        1.4                 0.2  Iris-setosa            
1            4.9          3.0        1.4                 0.2  Iris-setosa
2            4.7          3.2        1.3                 0.2  Iris-setosa
3            4.6          3.1        1.5                 0.2  Iris-setosa
4            5.0          3.6        1.4                 0.2  Iris-setosa
5            5.4          3.9        1.7                 0.4  Iris-setosa
6            4.6          3.4        1.4                 0.3  Iris-setosa
7            5.0          3.4        1.5                 0.2  Iris-setosa
8            4.4          2.9        1.4                 0.2  Iris-setosa
9            4.9          3.1        1.5                 0.1  Iris-setosa
10          5.4          3.7        1.5                 0.2   Iris-setosa
11          4.8          3.4        1.6                 0.2   Iris-setosa
12          4.8          3.0        1.4                 0.1   Iris-setosa
13          4.3          3.0        1.1                 0.1   Iris-setosa
14          5.8          4.0        1.2                 0.2   Iris-setosa
15          5.7          4.4        1.5                 0.4   Iris-setosa
16          5.4          3.9        1.3                 0.4   Iris-setosa
17          5.1          3.5        1.4                 0.3   Iris-setosa
18          5.7          3.8        1.7                 0.3   Iris-setosa
19          5.1          3.8        1.5                 0.3   Iris-setosa

 

Statistical Summary

In this step, we can have a look at the summary of each and every attribute. It will include mean, count, the max and min values along with a few percentiles. You can execute it with the help of the below given command:


Command:


# descriptions
print(dataset.describe())

The output by using this command can be seen here, where we can see that every numerical value has same scale and is ranging between 0 to 8 cm.

Output:

          sepal-length  sepal-width  petal-length  petal-width
count     150.000000   150.000000  150.000000   150.000000
mean     5.843333   3.054000      3.758667       1.198667
std         0.828066   0.433594      1.764420       0.763161
min        4.300000   2.000000      1.000000       0.100000
25%       5.100000   2.800000      1.600000       0.300000
50%       5.800000   3.000000      4.350000       1.300000
75%       6.400000   3.300000      5.100000       1.800000
max       7.900000   4.400000      6.900000       2.500000<br /><br />

Class Distribution

Here, we can see the total number of rows (instances) belonging to each class. You can get the accurate value by using the command shown here:

Command:
 

# class distribution
print(dataset.groupby(‘class’).size())


You will get an output showcasing the instances of each class like we have shown here:

 
Output:
 

class
Iris-setosa         50
Iris-versicolor    50
Iris-virginica    50

 

Data Visualization

After getting all the required information regarding the dataset, we can easily extend it with the help of some visualizations. For this, we have two types of plots to be recommended, which can also be seen in these below given points:

  • Univariate Plots

The univariate plots in this section will help you in understanding single attributes better. You can easily create box and whisker plots of each attribute by giving them numeric input variables. You can use the following command given below to get the work done without facing any hurdle:

Command:


# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()


It will give you a clear idea about the distribution of input variables. You will get an output as shown in the below-given image:

Apart from this, you can also choose to create a histogram to get an idea of the distribution of each input variable. For this, you can use the following command shown below:

Command:


# histograms
dataset.hist()
plt.show()


  • Multivariate Plots

This part will help you in understanding the relationship between attributes. You can easily have a look at the interaction between attributes by using the command mentioned below:

Command:


# scatter plot matrix
scatter_matrix(dataset)
plt.show()


The diagonal grouping of soma attribute pairs showcases high correlation and a predictable relationship. You can check out the output in the below-shown image:

Evaluate Some Algorithms

After getting all the required essential information from the above-given operations, it’s now time to create few models of data and check their accuracy on the unseen data. For this, here we have to perform 3 operations that can be seen in the following points mentioned below:

Create a Validation Dataset

First of all, we need to know whether the model we built is good enough to satisfy our expectations or not. Then, by using the statistical methods, we will estimate the precision of the created model on the unseen data. Our model must be concrete enough to provide accurate values while evaluating the actual unseen data.

For this, we will hold back some of the data, by which the algorithms will not get to see them. Then, we simply use this data to get a totally independent idea of how accurate our model needs to be. We will split the data into two parts (i.e. 80% & 20%), in which the 80% data will be used to train our models and the 20% of data will stay back as a validation dataset. To execute this operation, you can simply use the below given command script:


Script:

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

Now, after using this script, you have saved the training data in X_train and Y_train to prepare models, whereas, the data in X_validation and Y_validation sets to be used later.


Test Harness

In this part, you will need to use the 10-fold cross validation method to calculate the accuracy. This methodology will split the dataset into 10 parts, which will train on 9 and will test on 1. This method will be repeated for all of the combinations of train-test splits. You can use the below-given command to execute this operation.

Command:


# Test options and evaluation metric
seed = 7
scoring = 'accuracy'


The randomly chosen seed doesn’t matter and we are using the metric of ‘accuracy’ here to evaluate all the models. It can be termed as a ratio of the total number of correctly predicted instances divided by the number of instances available in the dataset multiply by 100 which will give a percentage value. 

Build Models

Now, we cannot predict which one of the algorithms will be suitable for this problem or what kind of configurations we must use. Concerning the plots, we can see that some classes are partially linearly separable in certain dimensions. Here, we are evaluating 6 different algorithms which can also be seen below.

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbours (KNN)
  • Classification and Regression Trees (CART)
  • Gaussian Naive Bayes (NB)
  • Support Vector Machines (SVM)

These algorithms are forming a perfect mixture of nonlinear (KNN, CART, NB, and SVM) and simple linear (LR and LDA) algorithms. Furthermore, we will need to reset the seed value to ensure the fine performance of each algorithm using the same data splits. Now, you can build and evaluate the models by using the below-given command script:


Script:
 
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto')))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
            results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

 

Make Predictions

The KNN algorithm we used above is quite simple and accurate on the basis of the above performed tests. Now we need to find out the accuracy level of the model on our validation set. This will provide us the bottom line on the accuracy of the best model. It will be of worth to keep a validation set, it will help you if you face data leak or overfitting to the training set. We can also run the KNN model directly on the validation set by using the following command script mentioned below:

Make Predictions


Script:


# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

By using the above-given script, you can summarize the results as a confusion matrix, a classification report, and a final accuracy score. By looking at the output shown below, we can get that the accuracy is either 0.9 or 90% and we can also see the three errors indicated by the confusion matrix. At last, you can see the breakdown of each class done via classification report by precision, recall, f1-score, and support.


Output:
0.9
[[ 7  0  0]
[ 0 11  1]
[ 0  2  9]]
                precision   recall   f1-score   support

    Iris-setosa    1.00      1.00      1.00      7
Iris-versicolor    0.85      0.92      0.88     12
Iris-virginica    0.90      0.82      0.86      11

      micro avg    0.90      0.90      0.90        30
      macro avg    0.92      0.91      0.91        30
   weighted avg    0.90      0.90      0.90        30

 

Summary

Now,let us have a brief summary about our entire process. We got started by installing Python and its SciPy libraries in our system, then we checked that Python as well as its installed libraries are up to date and are working fine with the system. Then, we initialized the loading of data by installing all the essential functions, objects, and modules to be used in our process. After that, we have option of either using Pandas to load the data or downloading it directly in the system using the UCI Machine Learning Repository.

After loading dataset, we have summarized it by analyzing it from different ways or angles. Summarizing the dataset helps us in getting every minute detail regarding its structure. Now comes the Data Visualization part which can be done in two ways, either univariate (to understand single attributes) or multivariate (to understand relation between two or more attributes). After this, we evaluate algorithms by creating a validation dataset, testing its harness and building different models to choose from. Finally we make predictions, which is the most suitable model for our operation. This is our entire process, which is elaborated in a detailed manner above.


Frequently Asked Machine Learning Interview Questions

 

Conclusion


Machine Learning is undeniably a revolutionary technology that can change the entire working of this world with its advancements. So, if you want to make a career in this technology, then it is really a great idea. Machine Learning with Python is really more easy and understandable than other measures. A small demo of Machine Learning in Python has already been elaborated in the above-given article, you can check it out yourself and see if you want to go for it or not. Choose wisely and learn smartly.

Subscribe For Free Demo

Free Demo for Corporate & Online Trainings.

Ravindra Savaram
About The Author

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.


DMCA.com Protection Status

Close
Close