Machine Learning – Gradient Descent


What is Machine Learrning

I decided to prepare and discuss about machine learning algorithms in different series which is valuable and can be unique throughout the internet. I calim that there is rare resource which is SIMPLE and COMPLETE in machine learning.


I want to explain how algorithms I machine learning are working by going through low level explanation instead of just have a short glance on high level. I believe this way can widen our perspective and prepare us to apply them correctly. Because having high level and superficial theory about nice algorithm lead us nowhere, in return if we know what is happening under the layer of these algorithm can help us to use them more properly.

For have a better understanding about machine learning I prefer to talk briefly about the whole story off machine learning.

Machine learning is a branch of computer science which has been extended from pattern recognition and artificial intelligence. Its meaning comes from where computer can learn how to predict phenomena by the aid of training data which are sample data have been occurred in the past and machine tries to anticipate what is the result of an event according to having specific conditions. This learning process occurs in two different ways. One is supervised and another is unsupervised.

Supervised learning is to teach alphabetic to kid under the monitor of his teacher and unsupervised learning is such as kid who tries to walk alone after trial and errors he will understand how to walk correctly. In a better instance, in supervised we tend to classify objects but in unsupervised we actually do not know what we are looking for, we just study and watch some objects and we try to cluster them according their similarity.


Classification – Supervised: Assume a father show a dog to his son and then if son sees another race of dog can detect that it is dog but from another kind of dog. The goal was clear; son should see some dogs and his father told him that “it is dog” then he should detect dog –even from other color or race or size- in future.


Clustering – Unsupervised: But now story is different, a father brings his son to the zoo and show him different animals without telling him that what is their name. So his son just can categorize in his brain the various animals according to their color, size and etc. If I future he sees animal which is similar to one of those animal, he can say that “this animal –which does not know its name- is similar to those animals that I have seen in the zoo on that cage.”  So the goal is here that son should cluster new animal to another animal that he has seen by now.

Supervised has f(X) = y and there is value or each features while in unsupervised there is just (X) without knowing its f(X).

Result: as a result, finally we encounter two different example of problem to solve:

Supervised – Classification: According past data which there are specific output value- target or dependent variable which is called “Y” for one or more than one features or independent variable which is called “X” or “X1, X2, …., Xn”. For example, there is data set of patient information and we want to find whether patient will get cancer or not; or what is price of some object in near futures according to their changing price in the past.

Linear Regression

In some problem we have no limitation for “Y” value such as price prediction, but in others such as when we want to know what is the probability of cancer in one patient, we have to give result between zero and one.


Y = aX + b

Y is target value, dependent, output

a is regression coefficient, slope

X is independent value, feature, input

b is constant value

I decided to make everything as KISS = Keep It Simple Stupid

Therefore I designed small data set by myself and we can study on it easily and then you can go to my github and study according to real tutorial data set from machine learning course at university of Stanford by Andrew NG. Click HERE for more info about this course. 

This data set is based on relation between study hours of students and their achieved grade. We draw a graph according to above data and assign study hours to horizontal axis and grade to vertical axis. You see I student study more they can gain better grade. So after looking at this graph, we want to predict if one student study 8 hours how much would be his grade?

Gradient Descent

We need a line as blue line to determine progress of changing grade based on study hours. Y and X are known by data set value, we just need a, b value to draw a blue line or “Prediction line”. No matter how much we are precise, it is sometimes impossible or difficult to draw a prediction line without error. Error rate in machine learning is inevitable; but we should find best a, b and minimize error rate to have an accurate output.

The vertical distance between actual value which is black star and predicted value on blue line which is yellow star is called “Prediction Error” or “Cost”. In order to calculate prediction error; we have to minus Prediction Error = Yactual – Ypredicted  Yprediction = aX + b

Because we have more than one record in data set, we need to generalize above formula to:

Sum of Squared Errors (SSE) = ½ Sum (YactualYpredicted)2

But we need to minimize SSE as far as possible, we learnt in high school mathematic that we should derivate formula to make it optimum. Because we have two variable so we need two differentiations. Because we are working on a, b, so we should make partial differentiations. In partial differentiation based on “a”, others variable such as “X”, “Y” and “b” in SSE should be kept as constant.

Before study deeply on above data set we need to standardize – normalize or scaling value in order to have equal range and/or variance. Because study hours are between (2,7) but grade is between (50,70), so there is variance in their scale makes some difficulty to compare them.

First step is that compute “cost function” based on Ө = [a, b], these “a” and “b” are values which we have selected randomly.

  1. Select Ө = [a, b], a is slope and b is intercept randomly.
  2. Compute “Cost Function”.
  3. Compute “Gradient Descent” for Ө = [a, b].
    1. anew = aold – r*∑ ∂SSE/∂a  r is learning rate
      • ∂SSE/∂a = -(Y-Yp)X
    2. bnew = bold – r*∑ ∂SSE/∂b
      •  ∂SSE/∂b = -(Y-Yp)
  4. Again Compute “Cost Function” Cost Function
  5. Compare if new Cost Function value is less than before; if “Yes” so you are in right direction, lets continue.
  6. Repeat and iterate step 3 to 5 until find best value.

** r is learning rate is the pace of learnning, cost function should be decreased in every iteration and get convergence . If cost function in each repeat with different a, b is decreased, so we selected suitable r.

Using the code MATLAB

I used MATLAB R2012a (

1. Start to call Function”executegd.m”;

Firstly there is “executegd.m”, I called all of function from here. In first part I loaded data frrom data.txt which is very simple text. Then I standardize data with mentioned formula above. Then I draw points according new coordinates.  Then I assume a=0, b=0 and computed “Cost Function”, then I started to calculate best a, b by the aid of “Gradient Descent” with learninng rater r=0.01 and 1500 iteration.

clear ; close all; clc

%% ======================= Part 1: Load Data ==============================
% Load Data
fprintf('Plotting Data ...\n')
data = load('data.txt');

%% ======================= Part 2: Standardize Data =======================
% Standardize Data
data = standardizeData(data);
X = data(:, 1);
y = data(:, 2);
m = length(y); % number of training examples

%% ======================= Part 3: Plotting ===============================
% Plot Data
plotData(X, y);

fprintf('Program paused. Press enter to continue.\n');

%% =================== Part 4: Prediction Error - Copmute Cost ============
fprintf('Running Gradient Descent ...\n')

X = [ones(m, 1), data(:,1)]; % Add a column of ones to x
theta = zeros(2, 1); % initialize fitting parameters

% Some gradient descent settings
iterations = 1500;
alpha = 0.01;

% compute and display initial cost
predictionError(X, y, theta)

%% =================== Part 5: Gradient descent ===========================
% run gradient descent
theta = gradientDescent(X, y, theta, alpha, iterations);

% print theta to screen
fprintf('Theta found by gradient descent: ');
fprintf('%f %f \n', theta(1), theta(2));

% Plot the linear fit
hold on; % keep previous plot visible
plot(X(:,2), X*theta, '-')
legend('Training data', 'Linear regression')
hold off % don't overlay any more plots on this figure

2. Standardize “standardizeData.m”

function [stdata] = standardizeData(data)
%Standardize Data

X = data(:, 1);
Y = data(:, 2);
m = length(Y); % number of training examples
stdata = zeros(m, 2); % initialize fitting parameters

% StdDev = SQRT( sum[(X-mean)^2/(n-1)] )
meanX = mean(X);
stdX = std(X);

for i = 1:m
 X(i) = ((X(i) - meanX)/stdX);

%Standardize(X) = X-mean /Std(X)

meanY = mean(Y);
stdY = std(Y);

for i = 1:m
 Y(i) = ((Y(i) - meanY)/stdY); 

 stdata(:, 1)= X(:);
 stdata(:, 2)=Y(:);

3. Plot “plotData.m”

function plotData(x, y)

figure; % open a new figure window



4. Compute “Cost Function” “predictionError.m”

function J = predictionError(X, y, theta)
%COMPUTECOST Compute cost for linear regression
% J = COMPUTECOST(X, y, theta) computes the cost of using theta as the
% parameter for linear regression to fit the data points in X and y

% Initialize some useful values

m = length(y); % number of training examples
%theta = zeros(2, 1); % initialize fitting parameters

% You need to return the following variables correctly

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta
% You should set J to the cost.

J=1/(2*m)*sum((X*theta - y).^2);

% =========================================================================


5.  Compute theta(a, b) which build best predictio line by Gradient Descent, “gradientDescent.m”

function [theta] = gradientDescent(X, y, theta, alpha, num_iters)
%GRADIENTDESCENT Performs gradient descent to learn theta
% theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
%m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

m = length(y); % number of training examples

for iter = 1:num_iters

 if iter == 1
 J_history(iter) = predictionError(X, y, theta);
 elseif iter > 1

 DefSSEb = sum(B);
 DefSSEa = sum(C);


 theta(1,1) = (bold - (alpha*(DefSSEb/m)));
 theta(2,1) = (aold - (alpha*(DefSSEa/m)));

 J_history(iter) = predictionError(X, y, theta);

theta = theta(:);

Points of Interest

I found Machine Learning very exciting, I decided to work on it.

Gradient Descent is first and foremost step to learn machine learning. As summarize Machine learning is “getting data and work on data then give back result which is called its prediction”

Therefore, error will occur 100%, the goal is using machine for prediction –because of huge and big data, machine can do it faster- but by observing error and try to select better prediction by gradient descent.

Gradient Descent story will not finish here in linear regression (with just one X), you will encounter it in multi variable and neural network.  


Feel free to leave any feedback on this article; it is a pleasure to see your opinions and vote about this code. If you have any questions, please do not hesitate to ask me here.