Linear Regression - Cost Function Intuition

Single Variable

Let h_θ(x) be the hypothesis which predicts the value of target Y

h_θ(x) = θ_o + θ₁x

Our goal is to choose θ_o and θ₁ such that the error between h_θ(x) and the actual target Y is minimized. Intuitively, the function of Mean Square Error comes to mind.

Below is the cost function J(θ), where m is the number of training set:

oneval-cost-func

Multiple Variable

Applying this to a more general problem with multi-variables:

h_θ(x) = θ_o + θ₁x₁ + θ₂x₂ +…+ θ_nx_n = θ^Tx

where

θ^Tx is 1 by (n+1) vector
x is (n+1) by 1 vector
x_o = 1

vector-form

Gradient Decent Intuition

Gradient Decent is the method used to find θ such that the cost function J(θ) is minimized. The intuition can be understood using a simplied version of h(θ)

Assume h_θ(x) = θ₁x and we have following 3 data sets:

x	y
9	50
0	-10
-10	-34

J(θ) =

Example, if we let θ₁ = 2,

J(θ₁) = [ (9 * θ₁ - 50)² + (0 * θ₁ - (-10))² + ((-10) * θ₁ - (-34))² ] / 6 = 220

Below is the visualization of θ and J(θ), when θ is from [0, 7.8] with 0.2 increament:

Such graph indicates that the cost function is a convex function which means it’s a single bowl shape with global minimum. This is a nice feature which means no matter where we initialize the values of parameters θ, gradient decent with appropriate learning rate (α) will always converge and reach the global minimu.

Gradient Decent Algorithm

Reference taken from Andrew Ng’s ML Course on Coursera Here is the formal equation of Gradient Decent, using Batch Gradient Decent

Key points:

x⁽ⁱ⁾ is the ith row from the training set
x_j is jth feature (column), and x₀ is 1
α is the learning rate, how big each step is

about α

If α is too small, gradient decent takes long time and many iterations to converge. If α is too big, gradient decent will overshoot the global minimum and fail to converge. To know if α is good, typically one could plot J(θ) as function of number of iteration. The value of J(θ) should decrease quickly and evetually flattern

Normal Equation

Instead of taking many iterations, normal equation is able to solve the value of θ^T in a single step. θ = (X^TX)^-1X^Ty

X^-1 is the inverse of matrix X
y is the target

Gradient Decent vs Normal Equation

Normal Equation does not require any iteration, nor does it require one to choose the learning rate α

However, the cost of the Normal Equation is O(n³) where n is the number of features, while gradient decent is O(kn²)Thus, when data set has huge number of features, eg 10000, it’s too costly to solve by normalization function. Instead, gradient decent would still be able to solve in reasonable amount of time.

Factors that speed-up G.D!

gradient decent works best when all feature are in the same range:

between [-1, 1] or,
between [-0.5, 0.5]

Two techniques can be used:

feature scaling:
mean normalization

feature-scaling

Where μi is the average si is the range or standard deviation