import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

np.random.seed(0)
torch.manual_seed(0)

Maths for ML

  1. Given a vector $\epsilon$, we can calculate $\sum\epsilon_{i}^{2}$ using $\epsilon^{T} \epsilon$

# As per the convention we take epsilon to be a 2D column vector
y = np.array([1.3, 2.5, 6.4, 8.1, 9.0]).reshape(-1, 1)
y_hat = np.array([1.5, 2.0, 5.9, 8.5, 9.0]).reshape(-1, 1)

epsilon = np.abs(y - y_hat)

epsilon_square_sum1 = np.sum(epsilon**2)
epsilon_square_sum2 = (epsilon.T @ epsilon).item()

assert np.allclose(epsilon_square_sum1, epsilon_square_sum2)
  1. $(AB)^{T} = B^{T}A^{T}$

A = np.random.randn(50, 10)
B = np.random.randn(10, 20)

ab_t = (A @ B).T
b_t_a_t = B.T @ A.T
assert np.allclose(ab_t, b_t_a_t)
  1. For a scalar $s$, $s = s^{T}$
  1. Derivative of scalar $s$ with respect to (yes!, I wrote wrt as full here 😁) vector $\theta$ $$\theta = \begin{bmatrix} \theta_{1} \\ \theta_{2} \\ \vdots \\ \theta{n} \end{bmatrix}$$ $$\frac{\partial s}{\partial \theta} = \begin{bmatrix} \frac{\partial s}{\partial \theta_{1}} \\ \frac{\partial s}{\partial \theta_{2}} \\ \frac{\partial s}{\partial \theta_{3}} \\ \vdots \\ \frac{\partial s}{\partial \theta_{n}} \end{bmatrix} $$
  1. If $A$ is a matrix and $\theta$ is a vector, and $A\theta$ is a scalar. Then $$ \frac{\partial A \theta}{\partial \theta} = A^{T} $$

🤔 Taking some similarity with $a\theta$, where both $a$ and $\theta$ are scalar, I have an idea that it would be A. But shape of gradient would be $N \times 1$, so $A^{T}$ is my guess before starting any calculations.

N = 20
# as A $\theta$ is scalar, so A.shape[0] should be 1.
A = torch.randn((1, N))
theta = torch.randn((N, 1), requires_grad=True)
scalar = A @ theta
scalar.backward()
assert torch.allclose(theta.grad, A.T)
  1. Assume $Z$ is a matrix of form $X^{T}X$, then $$ \frac{\partial (\theta^{T}Z\theta)}{\partial \theta} = 2Z^{T}\theta$$

🤔 Let me again make a good guess before any calculation, if $\theta$ and $Z$ are both scaler, then the derivative would look like $2Z\theta$. So my guess would $2Z\theta$, which is equal to $2Z^{T}\theta$ as both are $Z$ is symmetric.

X = torch.randn((N, N))
Z = X.T @ X
theta = torch.randn((N, 1), requires_grad=True)

scalar = theta.T @ Z @ theta
scalar.backward()

assert torch.allclose(theta.grad, 2 * Z.T @ theta)

Let's skip over the content of Rank topic for now.

The maximum rank possible for a matrix is $max(R, C)$

But an interesting question would be 🤔, what is the minimum rank possible for a matrix, is it 0, is it 1?

Ans: Rank is zero, in case of zero matrix.

Just a leaving thought, if I would have been a developer of Numpy, I would not have allowed np.eye as the method for identity matrix. Better to use np.identity only. 😞

Introduction to Linear Regression

Considering weight as a linear function of height:

  • $weight_{1} \approx \theta_{0} + \theta_{1} * height_{1}$
  • $weight_{2} \approx \theta_{0} + \theta_{1} * height_{2}$
  • $weight_{N} \approx \theta_{0} + \theta_{1} * height_{N}$

Add extra columns of $1s$ for the bias term in $\theta$

$$ W_{N\times1} = X_{N\times2} \, \theta_{2\times1} $$ where the feature matrix $X$, $X = \begin{bmatrix} 1 & height_{1} \\ 1 & height_{2} \\ \vdots & \vdots \\ 1 & height_{N} \end{bmatrix}$

  • $\theta_{0}$, Bias/Intercept term : (the value of $y$, when $x$ is set to zero)
  • $\theta_{1}$, Slope term : (the increase in $y$, when $x$ is increased by 1 unit)

Generalized Linear Regression

  • $N$: Number of training samples
  • $M$: Number of features
$$ \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{N} \\ \end{bmatrix} _{N \times 1} = \begin{bmatrix} 1 & x_{1, 1} & x_{1, 2} & \ldots & x_{1, M} \\ 1 & x_{2, 1} & x_{2, 2} & \ldots & x_{2, M} \\ \vdots & \vdots & \vdots & \ldots & \vdots \\ 1 & x_{N, 1} & x_{N, 2} & \ldots & x_{N, M} \\ \end{bmatrix} _{N \times (M + 1)} \begin{bmatrix} \theta_{0} \\ \theta_{1} \\ \vdots \\ \theta_{M} \end{bmatrix} _{(M + 1)\times 1} $$

$$ \hat{y} = X \theta $$

Now, the task at our hand is to estimate "good" values of $\theta$, which will give "good" approximation to the actual values.But how do we decide if a set of values of $\theta$ is "better" than another value of $\theta$. We need a metric for evalution here.

Let $\epsilon_{i}$ be $y_{i} - \hat{y}_{i}$, where $\epsilon_{i} \sim \mathcal{N} (0, \sigma^{2})$. We are assuming that $\epsilon_{i}$ is coming from this normal distribution.

We want $|\epsilon_{1}|$, $|\epsilon_{2}|$, $|\epsilon_{3}|$ ... , $|\epsilon_{N}|$ to be small.

So we can try to minimize L2 norm (Squared Error) or L1 norm.

weight_height_df = pd.read_csv(
    "https://raw.githubusercontent.com/yadav-sachin/blog/master/_notebooks/assets/2022-02-17-machine-learning-quiz2-practice/weight-height.csv"
)
# take 30 points
sampled_idx = np.random.choice(np.arange(len(weight_height_df)), size=30, replace=False)
weight_height_df = weight_height_df.iloc[sampled_idx][["Height", "Weight"]].sort_values(
    by=["Height"]
)


def plot_func(theta0, theta1):
    x = weight_height_df["Height"]
    y = weight_height_df["Weight"]
    y_hat = theta0 + x * theta1
    fig, ax = plt.subplots(figsize = (10, 8))
    ax.scatter(x, y, label="Actual")
    ax.plot(x, y_hat, label="Pred", linestyle = "--")
    ax.legend()
    ax = plt.gca()
    ax.set_ylim([50, 400])
    mse_val = np.mean((y - y_hat)**2)
    ax.set_title(rf"$\theta_{0}$={theta0}, $\theta_{1}$={theta1}    MSE val: {mse_val:.3f}")


interact(
    plot_func,
    theta0=widgets.FloatSlider(name = "theta0 (bias)", value=-300, min=-1000, max=1000, step=1),
    theta1=widgets.FloatSlider(name = "theta1 (slope)", value=7.5, min=-20, max=20, step=0.01),
)
<function __main__.plot_func(theta0, theta1)>

Note: Run the notebook in Colab to view the interactive plot above, where we manually change parameters (using sliders) and fit the line through training points with Mean Squared error as the guiding value.

Normal Equation

$$ y = X\theta + \epsilon$$ Objective: To minimize $\epsilon^{T} \epsilon$ $$\epsilon^{T} \epsilon = y y^{T} - 2 y^{T}X\theta + \theta^{T}X^{T}X\theta$$ $$\frac{\partial (\epsilon^{T} \epsilon)}{\partial \theta} = -2X^{T}y + 2X^{T}X\theta$$ (we use some of our results from previous chapter "Maths for ML")

Setting it to zero, $$ \theta^{*} = (X^{T}X)^{-1}X^{T}y$$

Geometric Interpretation of Linear Regression

We have $\hat{y} = X\theta$, where $X$ is shape of $(N \times M)$, where $M$ is #features and $N$ is #samples.

When we multiply $X$ with column vector $\theta$, the get a column vector which is the linear combination of the columns of matrix $X$. The linear combination, the coeffients of combination are decided by the parameters in $\theta$. So $X\theta$ lies in the span of columns of X.

Our objective is to get a $\theta$ to minimize $\mathopen|| y - X\theta \mathclose||$.

The span of columns of $X$ would be a plane in $N$ dimensional space, $X\theta$ lies on this plane. So the least distance, is when the $X\theta - y$ is perpendicular to this plane (span of columns of $X$).

Therefore $(y - \hat{y}) \bot (x_{j}) \forall j$ $$X^{T} (y - X\theta^{*}) = 0$$ We get, $$\theta^{*} = (X^{T}X)^{-1}X^{T}y$$

Linear Regression

Relation between #samples and #features in Linear Regression

Let $N$ be number of samples and $M$ be the number of variables/features.

Under-determined system: In Linear Algebra, if we have $N$ equations with $M$ variables and $N < M$, then it is called an under-determined system. In this case, we will have infinite solutions.

Over-determined sytem: When $N > M$, the system is over-determined. In this case the sum of residuals $\sum \mathopen| \epsilon_{i} \mathclose| > 0$, in most cases unless we are able to get perfect fit.

Variable Transformation

In case the output does not seem like a linear combination of the variables, we can also use the higher powers of variables in the linear combination.

We can also use other transformations like logarithm, multiples of more than one variable, etc. And we would still call it "Linear" Regression, as we are here talking about the linearity in coefficients/parameter space ($\theta$).

Multi-collinearity

There can be situations when $X^{T}X$ is a singular matrix. such as $$ X = \begin{bmatrix} 1 & 1 & 2 \\ 1 & 2 & 4 \\ 1 & 3 & 6 \end{bmatrix}$$ For $$\begin{bmatrix} x_{1} & x_{2} & y \\ 1 & 2 & 4 \\ 2 & 4 & 6 \\ 3 & 6 & 8 \end{bmatrix} $$

In this case, a perfect fit is not possible.

In case, we still want to use the normal equation. The ways are:

  • Regularise: We can add some jitter/noise to the diagonal values and make it invertible.
  • Drop Variables: As here in this case, $x_{2} = 2 * x_{1}$, so we may drop either one of these variables.
  • Using different subsets of the data samples might work
  • Avoid dummy variable trap

This problem also arises due to dummy variable trap.

Dummy Variables

Let's assume if we have a categorical variable in linear regression setup, where the air pollution is a function of (#Vehicles, Wind-speed and Wind-direction). As we cannot give categorical values, we need to have corresponding numerical values.

Such as if we have wind-direction, there are 4 categories of this variables namely North, West, East and South. If we go ahead and numerically encode them, North values become $0$, West values become $1$ and so on.

The problem here is that we cannot assign such numerical ordering to the categories, as numerical categories have comparion in case of values. Such as $2$ = $2 \times 1$, but is contribution due to West = $2 \times$ North

N-1 Variable Encoding: Then to avoid this problem, we use $C-1$ variable encodings, where $C$ is the number of the categories of the categorical variable. The logic is that, to specify the class of the sample, I need to ask at least $C-1$ binary questions.

  • Is it Class $1$?
  • Is it Class $2$?
  • $\vdots$
  • Is it Class $C-1$?

One-hot Variable Encoding: I can ask $C$ binary questions.

  • Is it Class $1$?
  • Is it Class $2$?
  • $\vdots$
  • Is it Class $C - 1$?
  • Is it Class $C$?

But which one is better? The $C_{th}$ class variable is redundant, as we can determine its values based on $C-1$ class variables.

So One-hot encoding can cause multi-collinearity in Linear Regression, as one of the added variables is redundant and dependent on other variables.

Convexity

Convexity defined over an interval $[\alpha, \beta]$, is such that the line segment joining two points $(a, f(a))$, $(b, f(b))$ is above or on the function for all points between in $[a, b]$, given $\alpha \leq a, b \leq \beta$.

One of the problem I noticed is that students/people are just giving out more complex definitions of "convexity", which makes the things look complex but are actually very simple. So let's not discuss: other definitions reducing distance, Rolle's theroem, etc. and keep the definition simple.

Prove $f(x) = x^{2}$ is Convex

Let the two chosen points are $(a, f(a))$ and $(b, f(b))$.

Then any point on the line segment joining the points can be described as $$(t \times a + (1 - t) \times b, t\times f(a) + (1 - t) \times f(b))$$

the corresponding point on the curve at the same $x$-coordinate is $$(t \times a + (1 - t) \times b, f(t \times a + (1 - t) \times b))$$ here $ 0 < t < 1$

The difference in $y$ is: $$t(1-t)(a - b)^{2}$$ which is greater than or equal to zero. Hence the line-segment is always above the function in all points between $a$ and $b$. So function is convex.

Double-Derivative Test

If double-derivative wrt x > 0, then convex.

Double-Derivative Test for multi-parameter function It is equal to Hessian matrix. A function $f(x_{1}, x_{2}, x_{3}, \ldots, x_{n})$ is convex iff Hessian Matrix is positive semidefinite for all possible values of $(x_{1}, x_{2}, x_{3}, \ldots, x_{n})$.

$$(Hess f)_{ij} \equiv \frac{\partial^{2} f}{\partial x_{i} \partial x_{j} } $$

The Hessian matrix is of shape $(N \times N)$

Let $f(x_{1}, x_{2}) = x_{1}^{2} + x_{2}^{2}$

$$ Hess(f) = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix} $$ This is positive semidefinite, so $x_{1}^{2} + x_{2}^{2}$ is convex function.

Tip: A quick way to check if a matrix is semi-positive definite is that you take determinant of 1X1, 2X2, ..., NXN upper sub-matrices containing the first pivot point and check if greater than or equal to zero.

Convexity of Least Squares in Linear Regression

$$f(\theta) = \mathopen|| y - X\theta \mathclose||^{2}$$ This is equal to $y y^{T} - 2 y^{T}X\theta + \theta^{T}X^{T}X\theta$.

We double differentiate it and get $X^{T}X$, which is a positive semidefinite matrix.

So this is where we connect convexity to Linear Regression.

"Spongebob: Excited"

If $f(x)$ and $g(x)$ is convex,

  • $f(x) + g(x)$ is convex.
  • $kf(x)$ is convex, for $k \geq 0$.

Convex function has this unique property of having a unique minima, which is the global minima.

Gradient Descent

Contour Plots and Gradients

Gradients is the direction for maximum increase in the function value.

To decrease the value of the function the most, we move in the opposite direction of gradient (negative of gradient vector).

Algorithm

  • Start with some $x_{0}$
  • Till convergence or iterations exhausted
    • $x_{i} = x_{i - 1} - \alpha\frac{\partial f(x)}{\partial x}$, $\alpha$ is the step-size or learning-rate

Quiz Day has arrived

As very less time remaining to quiz, so moving to important points where need to remember.