Linear Regression - Cost Function Intuition

Single Variable

Let hθ(x) be the hypothesis which predicts the value of target Y

hθ(x) = θo + θ1x

Our goal is to choose θo and θ1 such that the error between hθ(x) and the actual target Y is minimized. Intuitively, the function of Mean Square Error comes to mind.

Below is the cost function J(θ), where m is the number of training set:

oneval-cost-func

Multiple Variable

Applying this to a more general problem with multi-variables:

hθ(x) = θo + θ1x1 + θ2x2 +…+ θnxn = θTx

where

  • θTx is 1 by (n+1) vector
  • x is (n+1) by 1 vector
  • xo = 1

vector-form

Gradient Decent Intuition

Gradient Decent is the method used to find θ such that the cost function J(θ) is minimized. The intuition can be understood using a simplied version of h(θ)

Assume hθ(x) = θ1x and we have following 3 data sets:

x y
9 50
0 -10
-10 -34

J(θ) = cost-function.png

Example, if we let θ1 = 2,

J(θ1) = [ (9 * θ1 - 50)2 + (0 * θ1 - (-10))2 + ((-10) * θ1 - (-34))2 ] / 6 = 220

Below is the visualization of θ and J(θ), when θ is from [0, 7.8] with 0.2 increament: cost-fun-visual.png

Such graph indicates that the cost function is a convex function which means it’s a single bowl shape with global minimum. This is a nice feature which means no matter where we initialize the values of parameters θ, gradient decent with appropriate learning rate (α) will always converge and reach the global minimu.

Gradient Decent Algorithm

Reference taken from Andrew Ng’s ML Course on Coursera Here is the formal equation of Gradient Decent, using Batch Gradient Decent

gradient-decent.png

Key points:

  • x(i) is the ith row from the training set
  • xj is jth feature (column), and x0 is 1
  • α is the learning rate, how big each step is

about α

If α is too small, gradient decent takes long time and many iterations to converge. If α is too big, gradient decent will overshoot the global minimum and fail to converge. To know if α is good, typically one could plot J(θ) as function of number of iteration. The value of J(θ) should decrease quickly and evetually flattern

Normal Equation

Instead of taking many iterations, normal equation is able to solve the value of θT in a single step. θ = (XTX)-1XTy

  • X-1 is the inverse of matrix X
  • y is the target

Gradient Decent vs Normal Equation

Normal Equation does not require any iteration, nor does it require one to choose the learning rate α

However, the cost of the Normal Equation is O(n3) where n is the number of features, while gradient decent is O(kn2)Thus, when data set has huge number of features, eg 10000, it’s too costly to solve by normalization function. Instead, gradient decent would still be able to solve in reasonable amount of time.

Factors that speed-up G.D!

gradient decent works best when all feature are in the same range:

  • between [-1, 1] or,
  • between [-0.5, 0.5]

Two techniques can be used:

  • feature scaling:
  • mean normalization

feature-scaling

Where μi is the average si is the range or standard deviation