Linear Regression
Vaclav Smıdl
March 3, 2020
Linear regression and OLS
Fit by a linear function:
y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...
......
...
In matrix notation θ = [a, b]T :
y = Xθ + e,
Minimize∑
i e2i = eT e = ||y− Xθ||22:
d(eT e)dθ = 0.
ddθ ((y− Xθ)T (y− Xθ)) = 0
ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0
−X T y + X T Xθ = 0
0 2 4 6 8 10
x
0
2
4
6
8
10
12
14
y
Data
Solution:
θ = (X T X )−1X T y.
Linear regression and OLS
Fit by a linear function:
y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...
......
...
In matrix notation θ = [a, b]T :
y = Xθ + e,
Minimize∑
i e2i = eT e = ||y− Xθ||22:
d(eT e)dθ = 0.
ddθ ((y− Xθ)T (y− Xθ)) = 0
ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0
−X T y + X T Xθ = 0
0 2 4 6 8 10
x
0
2
4
6
8
10
12
14
y
Data
Solution:
θ = (X T X )−1X T y.
Linear regression and OLS
Fit by a linear function:
y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...
......
...
In matrix notation θ = [a, b]T :
y = Xθ + e,
Minimize∑
i e2i = eT e = ||y− Xθ||22:
d(eT e)dθ = 0.
ddθ ((y− Xθ)T (y− Xθ)) = 0
ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0
−X T y + X T Xθ = 0
0 2 4 6 8 10
x
0
2
4
6
8
10
12
14
y
Data
Solution:
θ = (X T X )−1X T y.
Example: The European Tracer Experiment
I Release of 340kg of PMCH for 12hours
I Direct modeling:with known meteo and release rates
I Inverse:Find release profile, how much andwhen released
Least squares solution:
1. Define a time window of possible releases: t ∈ [t1, . . . , tn] withmagnitudes θi
2. Compute concentrations cj,i at all measurement locationsj ∈ [1 . . .m] , for release of 1 unit of tracer at time ti , ∀i
3. Solve linear problemy1y2...
ym
=
c1,1 c1,2 · · · c1,nc2,1 c2,2
.... . .
c1,m cm,n
θ1θ2...θn
,for vector of measurements y and vector of unknowns θ.
� Fails. Poorly conditioned.
Ridge regression
Ordinary least squares:
θ = (X T X )−1X T y.
problematic for numerical stability. For min eig(X T X )→ 0,
θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X
θ = (X T X + αI)−1X T y. (1)
minimal eigenvalue is α.I Add penalization for large values :
θ = arg minθ
(||y− Xθ||22 + α||θ||22
),
where α is a suitably chosen coefficient. Yields (1).
Ridge regression
Ordinary least squares:
θ = (X T X )−1X T y.
problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.
I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X
θ = (X T X + αI)−1X T y. (1)
minimal eigenvalue is α.I Add penalization for large values :
θ = arg minθ
(||y− Xθ||22 + α||θ||22
),
where α is a suitably chosen coefficient. Yields (1).
Ridge regression
Ordinary least squares:
θ = (X T X )−1X T y.
problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),
I Increase eigenvalues of X T X
θ = (X T X + αI)−1X T y. (1)
minimal eigenvalue is α.I Add penalization for large values :
θ = arg minθ
(||y− Xθ||22 + α||θ||22
),
where α is a suitably chosen coefficient. Yields (1).
Ridge regression
Ordinary least squares:
θ = (X T X )−1X T y.
problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X
θ = (X T X + αI)−1X T y. (1)
minimal eigenvalue is α.I Add penalization for large values :
θ = arg minθ
(||y− Xθ||22 + α||θ||22
),
where α is a suitably chosen coefficient. Yields (1).
Ridge regression – selection of α
Selection of α:1. Cross-validation
(analytical solution for α)2. L-curve
plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ
I Bayes10−5 100
0
200
400
600
800
1000
ETEX ERA−40 B
setting of tuning parameter
Probability model
p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)
∝ exp{−1
2ω||y− Xθ||22 −12α||θ||
22
}
Ridge regression – selection of α
Selection of α:1. Cross-validation
(analytical solution for α)2. L-curve
plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ
I Bayes10−5 100
0
200
400
600
800
1000
ETEX ERA−40 B
setting of tuning parameterProbability model
p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)
∝ exp{−1
2ω||y− Xθ||22 −12α||θ||
22
}
Ridge regression – selection of α
Probability model
p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)
∝ exp{−1
2ω||y− Xθ||22 −12α||θ||
22
}Introduce prior
p(α) = G(δ0, γ0),
computep(α|y,X )
or
p(θ|X , y, ω) =∫
p(θ, α|X , y, ω)dα
.
y
θωX
α
γ δ
Bayesian estimation of α
Joint likelihood
p(y, θ, α|X ) = p(y|θ,X , ω)p(θ|α)p(α)= N (Xθ, ωI)N (0, α−1I)G(δ, γ),
p(θ, α|y,X ) ∝ αδ+ d2 −1 exp
{−1
2ω||y− Xθ||22 −12α||θ||
22 − γα
}Conditional for θ:
p(θ|α, y,X ) = N (θ, (X T X + αI)−1),
Conditional for α:
p(α|θ, y,X ) = G(δ0 + d2 , γ0 + ||θ||22),
Solution: EM, VB, Gibbs...
Ridge regression – polynomial case
Define a degenerate case:I true model
y = x2 + e
I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the
covariance matrix
0.5 1 1.5 2 2.5x
1
2
3
4
y
y= 0 + 0x + 1x 2 + 0.05e
0 0.1 0.2 0.3 0.4 0.5norm( 3)
0
1
2
3
4
5
norm
(y-X-
)
,=exp(-10:2)
10 -5 10 0 10 5 10 10-0.1
0
0.1
0.2
0.3
1x
x 2
x 3
x 4
x 5
Ridge regression – polynomial case
Define a degenerate case:I true model
y = x2 + e
I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the
covariance matrix
0.5 1 1.5 2 2.5x
1
2
3
4
y
y= 0 + 0x + 1x 2 + 0.05e
0 0.1 0.2 0.3 0.4 0.5norm( 3)
0
1
2
3
4
5
norm
(y-X-
)
,=exp(-10:2)
10 -5 10 0 10 5 10 10-0.1
0
0.1
0.2
0.3
1x
x 2
x 3
x 4
x 5
Ridge regression – polynomial case
Define a degenerate case:I true model
y = x2 + e
I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the
covariance matrix
0.5 1 1.5 2 2.5x
1
2
3
4
y
y= 0 + 0x + 1x 2 + 0.05e
0 0.1 0.2 0.3 0.4 0.5norm( 3)
0
1
2
3
4
5
norm
(y-X-
)
,=exp(-10:2)
10 -5 10 0 10 5 10 10-0.1
0
0.1
0.2
0.3
1x
x 2
x 3
x 4
x 5
Sparse Linear Regression:Prior has peak at zero and heavy tailSpike and slab prior:
p(θ) = λN (0, σ0) + (1− λ)N (0, σ1),
Laplace prior
p(θ) = (2b)−1 exp(− 1
2b |x |).
with joint likelihood (LASSO)
p(y, θ|X , b) = exp(−1
2 ||y− Xθ||22 −1
2b ||θ||1),
Horseshoe... prior p(θ) = N (0, λ), p(λ) = Cauchy(0, τ), p(τ) =Cauchy(0, 1)
ARD prior:p(θ) =
∫p(θ|α)p(α)dα = St(0, σ, ν),
with two possible hidden variable formulations.
Automatic relevance determination (ARD)
Probability model
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, I)N (0, diag[α1, . . . , αp])
∝ exp{−1
2 ||y− Xθ||22 −12∑
iαiθ
2i
}
Introduce prior
p(αi ) = G(δ0, γ0), p(α) =∏
ip(αi )
computep(α|y,X )
orp(θ|X , y) =
∫p(θ, α|X , y)dα
.
y
θiωX
αi
γ0 δ0
i = 1..p
Variational Bayes for Automatic relevance determination
Probability model
p(y, θ|X , α) = N (Xθ, I)N (0, diag[α1, . . . , αp])∏
iG(δ, γ)
Posterior factors
p(αi |y,X ) = G(δ0 + 12 , γi ),
γi = γ0 + 12⟨θ2
i⟩
p(θ|y,X ) = N (θ,Σθ),θ = (X T X + diag 〈α〉)−1X T y,Σθ = (X T X + diag 〈α〉)−1.
Iterated least squares.
Variational Bayes for Automatic relevance determination
0 50 100 150 200 250 300 350 400 450 500iterations
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
3
1x
x 2
x 3
x 4
x 5
Revisiting toy example:
xi
i=1,2
ms = 1
µ = 0 φ
α ≈ 0 β ≈ 0
p(xi |m) = N (m, 1),p(m|φ) = N (0, φ),
p(φ) = iG(γ0, δ0)
Solution using:I Variational BayesI Numerical evaluation on grid
Revisiting toy example:
p(m, ?|x1=10,x2=12, ,=1e-08, -=1e-08
100 200 300 400s
8
9
10
11
12
13
14
m
marginal
p( ?|x1,x2)q( ?|x1,x2)
marginal
p(m|x1,x2)q(m|x1,x2)
Revisiting toy example:
p(m, ?|x1=3,x2=2.7, ,=1e-08, -=1e-08
5 10 15 20 25 30s
0
1
2
3
4
5
m
marginal
p( ?|x1,x2)q( ?|x1,x2)
marginal
p(m|x1,x2)q(m|x1,x2)
Revisiting toy example:
p(m, ?|x1=2,x2=2, ,=1e-08, -=1e-08
2 4 6 8 10 12 14s
-1
0
1
2
3
4
5
m
marginal
p( ?|x1,x2)q( ?|x1,x2)
marginal
p(m|x1,x2)q(m|x1,x2)
Revisiting toy example:
p(m, ?|x1=3,x2=0, ,=1e-08, -=1e-08
5 10 15s
-1
0
1
2
3
4
m
marginal
p( ?|x1,x2)q( ?|x1,x2)
marginal
p(m|x1,x2)q(m|x1,x2)
Outliers
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
dataOLS 5thOLS 2th
I How to minimize the effect of an outlier?
I Outlier detection, robust statistics, etc.I Hierarchical model?
Outliers
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
dataOLS 5thOLS 2th
I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.
I Hierarchical model?
Outliers
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
dataOLS 5thOLS 2th
I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.I Hierarchical model?
Outliers hierarchical model
Probability model
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).
Is the variance of the noise homogenous?New model:
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).
Priorp(βi ) = G(δ, γ), p(β) =
∏i
p(βi )
Outliers hierarchical model
Probability model
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).
Is the variance of the noise homogenous?
New model:
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).
Priorp(βi ) = G(δ, γ), p(β) =
∏i
p(βi )
Outliers hierarchical model
Probability model
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).
Is the variance of the noise homogenous?New model:
p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).
Priorp(βi ) = G(δ, γ), p(β) =
∏i
p(βi )
Outliers
2 4 6 8 10x
10
20
30
40
50
60
70
80
90
100
y
y= 0 + 0x + 1x 2 + 0.05e
dataOLS 5thOLS 2th
0 2 4 6 8 10x
0
20
40
60
80
100
120
y
y= 0 + 0x + 1x 2 + 0.05e
dataVB-OLS
Outliers, both diagonal α and β
0 20 40 60 80 100iterations
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
3
1x
x 2
x 3
0 20 40 60 80 100iterations
0
0.02
0.04
0.06
0.08
0.1
0.12
-
I local minima, unstable for both α and β.
Conclusion
I Linear regression is solved by OLS.I When the data are not informative, we need to regularize:I Different prior assumptions yield different results
I ridge regression minimizes coefficientsI sparsity prior minimizes the number of non-zero coefficients
I Non-Gaussian residuesI Student-t residue,I Mixture residue, etc.
ETEX data challenge
Estimate source term of the ETEX experiment
y = Xθ,
where θ is assumed to be sparse, piece-wise linear, with non-negativeelements.
pointsθ using prior (e.g. ARD) 10
Outlier detection (e.g. ARD) 10