prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Transcript

Linear Regression

Vaclav Smıdl

March 3, 2020

Linear regression and OLS

Fit by a linear function:

y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...

......

...

In matrix notation θ = [a, b]T :

y = Xθ + e,

Minimize∑

i e2i = eT e = ||y− Xθ||22:

d(eT e)dθ = 0.

ddθ ((y− Xθ)T (y− Xθ)) = 0

ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0

−X T y + X T Xθ = 0

0 2 4 6 8 10

Data

Solution:

θ = (X T X )−1X T y.

Linear regression and OLS

Fit by a linear function:

y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...

......

...

In matrix notation θ = [a, b]T :

y = Xθ + e,

Minimize∑

i e2i = eT e = ||y− Xθ||22:

d(eT e)dθ = 0.

ddθ ((y− Xθ)T (y− Xθ)) = 0

ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0

−X T y + X T Xθ = 0

0 2 4 6 8 10

Data

Solution:

θ = (X T X )−1X T y.

Linear regression and OLS

Fit by a linear function:

y1 = ax1 +b1, +e1y2 = ax2 +b1 +e2,...

......

...

In matrix notation θ = [a, b]T :

y = Xθ + e,

Minimize∑

i e2i = eT e = ||y− Xθ||22:

d(eT e)dθ = 0.

ddθ ((y− Xθ)T (y− Xθ)) = 0

ddθ (yT y− θT X T y− yT Xθ + θT X T Xθ) = 0

−X T y + X T Xθ = 0

0 2 4 6 8 10

Data

Solution:

θ = (X T X )−1X T y.

Page 5: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Example: The European Tracer Experiment

I Release of 340kg of PMCH for 12hours

I Direct modeling:with known meteo and release rates

I Inverse:Find release profile, how much andwhen released

Page 6: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Least squares solution:

1. Define a time window of possible releases: t ∈ [t1, . . . , tn] withmagnitudes θi

2. Compute concentrations cj,i at all measurement locationsj ∈ [1 . . .m] , for release of 1 unit of tracer at time ti , ∀i

3. Solve linear problemy1y2...

c1,1 c1,2 · · · c1,nc2,1 c2,2

.... . .

c1,m cm,n

θ1θ2...θn

,for vector of measurements y and vector of unknowns θ.

� Fails. Poorly conditioned.

Page 7: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression

Ordinary least squares:

θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0,

θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)

minimal eigenvalue is α.I Add penalization for large values :

θ = arg minθ

(||y− Xθ||22 + α||θ||22

where α is a suitably chosen coefficient. Yields (1).

Page 8: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression

Ordinary least squares:

θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.

I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)

minimal eigenvalue is α.I Add penalization for large values :

θ = arg minθ

(||y− Xθ||22 + α||θ||22

where α is a suitably chosen coefficient. Yields (1).

Page 9: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression

Ordinary least squares:

θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),

I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)

minimal eigenvalue is α.I Add penalization for large values :

θ = arg minθ

(||y− Xθ||22 + α||θ||22

where α is a suitably chosen coefficient. Yields (1).

Page 10: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression

Ordinary least squares:

θ = (X T X )−1X T y.

problematic for numerical stability. For min eig(X T X )→ 0, θ →∞.I Replace inverse by pseudo-inverse (threshold),I Increase eigenvalues of X T X

θ = (X T X + αI)−1X T y. (1)

minimal eigenvalue is α.I Add penalization for large values :

θ = arg minθ

(||y− Xθ||22 + α||θ||22

where α is a suitably chosen coefficient. Yields (1).

Page 11: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – selection of α

Selection of α:1. Cross-validation

(analytical solution for α)2. L-curve

plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ

I Bayes10−5 100

200

400

600

800

1000

ETEX ERA−40 B

setting of tuning parameter

Probability model

p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)

∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

}

Page 12: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – selection of α

Selection of α:1. Cross-validation

(analytical solution for α)2. L-curve

plot of ||θ||22 versus ||y− Xθ||223. Joint estimation of α and θ

I Bayes10−5 100

200

400

600

800

1000

ETEX ERA−40 B

setting of tuning parameterProbability model

p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)

∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

}

Page 13: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – selection of α

Probability model

p(y, θ|X , α) = p(y|θ,X , ω)p(θ|α)= N (Xθ, ωI)N (0, α−1I)

∝ exp{−1

2ω||y− Xθ||22 −12α||θ||

}Introduce prior

p(α) = G(δ0, γ0),

computep(α|y,X )

p(θ|X , y, ω) =∫

p(θ, α|X , y, ω)dα

θωX

γ δ

Page 14: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Bayesian estimation of α

Joint likelihood

p(y, θ, α|X ) = p(y|θ,X , ω)p(θ|α)p(α)= N (Xθ, ωI)N (0, α−1I)G(δ, γ),

p(θ, α|y,X ) ∝ αδ+ d2 −1 exp

{−1

2ω||y− Xθ||22 −12α||θ||

22 − γα

}Conditional for θ:

p(θ|α, y,X ) = N (θ, (X T X + αI)−1),

Conditional for α:

p(α|θ, y,X ) = G(δ0 + d2 , γ0 + ||θ||22),

Solution: EM, VB, Gibbs...

Page 15: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – polynomial case

Define a degenerate case:I true model

y = x2 + e

I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the

covariance matrix

0.5 1 1.5 2 2.5x

y= 0 + 0x + 1x 2 + 0.05e

0 0.1 0.2 0.3 0.4 0.5norm( 3)

norm

(y-X-

)

,=exp(-10:2)

10 -5 10 0 10 5 10 10-0.1

0.1

0.2

0.3

x 2

x 3

x 4

x 5

Page 16: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – polynomial case

Define a degenerate case:I true model

y = x2 + e

I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the

covariance matrix

0.5 1 1.5 2 2.5x

y= 0 + 0x + 1x 2 + 0.05e

0 0.1 0.2 0.3 0.4 0.5norm( 3)

norm

(y-X-

)

,=exp(-10:2)

10 -5 10 0 10 5 10 10-0.1

0.1

0.2

0.3

x 2

x 3

x 4

x 5

Page 17: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Ridge regression – polynomial case

Define a degenerate case:I true model

y = x2 + e

I observations at x = [1, 2].I fit polynomial of 5th order.I Use αI to restore rank of the

covariance matrix

0.5 1 1.5 2 2.5x

y= 0 + 0x + 1x 2 + 0.05e

0 0.1 0.2 0.3 0.4 0.5norm( 3)

norm

(y-X-

)

,=exp(-10:2)

10 -5 10 0 10 5 10 10-0.1

0.1

0.2

0.3

x 2

x 3

x 4

x 5

Page 18: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Sparse Linear Regression:Prior has peak at zero and heavy tailSpike and slab prior:

p(θ) = λN (0, σ0) + (1− λ)N (0, σ1),

Laplace prior

p(θ) = (2b)−1 exp(− 1

2b |x |).

with joint likelihood (LASSO)

p(y, θ|X , b) = exp(−1

2 ||y− Xθ||22 −1

2b ||θ||1),

Horseshoe... prior p(θ) = N (0, λ), p(λ) = Cauchy(0, τ), p(τ) =Cauchy(0, 1)

ARD prior:p(θ) =

∫p(θ|α)p(α)dα = St(0, σ, ν),

with two possible hidden variable formulations.

Page 19: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Automatic relevance determination (ARD)

Probability model

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, I)N (0, diag[α1, . . . , αp])

∝ exp{−1

2 ||y− Xθ||22 −12∑

iαiθ

}

Introduce prior

p(αi ) = G(δ0, γ0), p(α) =∏

ip(αi )

computep(α|y,X )

orp(θ|X , y) =

∫p(θ, α|X , y)dα

θiωX

αi

γ0 δ0

i = 1..p

Page 20: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Variational Bayes for Automatic relevance determination

Probability model

p(y, θ|X , α) = N (Xθ, I)N (0, diag[α1, . . . , αp])∏

iG(δ, γ)

Posterior factors

p(αi |y,X ) = G(δ0 + 12 , γi ),

γi = γ0 + 12⟨θ2

i⟩

p(θ|y,X ) = N (θ,Σθ),θ = (X T X + diag 〈α〉)−1X T y,Σθ = (X T X + diag 〈α〉)−1.

Iterated least squares.

Page 21: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Variational Bayes for Automatic relevance determination

0 50 100 150 200 250 300 350 400 450 500iterations

-0.2

0.2

0.4

0.6

0.8

1.2

x 2

x 3

x 4

x 5

Page 22: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Revisiting toy example:

i=1,2

ms = 1

µ = 0 φ

α ≈ 0 β ≈ 0

p(xi |m) = N (m, 1),p(m|φ) = N (0, φ),

p(φ) = iG(γ0, δ0)

Solution using:I Variational BayesI Numerical evaluation on grid

Page 23: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Revisiting toy example:

p(m, ?|x1=10,x2=12, ,=1e-08, -=1e-08

100 200 300 400s

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal

p(m|x1,x2)q(m|x1,x2)

Page 24: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Revisiting toy example:

p(m, ?|x1=3,x2=2.7, ,=1e-08, -=1e-08

5 10 15 20 25 30s

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal

p(m|x1,x2)q(m|x1,x2)

Page 25: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Revisiting toy example:

p(m, ?|x1=2,x2=2, ,=1e-08, -=1e-08

2 4 6 8 10 12 14s

-1

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal

p(m|x1,x2)q(m|x1,x2)

Page 26: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Revisiting toy example:

p(m, ?|x1=3,x2=0, ,=1e-08, -=1e-08

5 10 15s

-1

marginal

p( ?|x1,x2)q( ?|x1,x2)

marginal

p(m|x1,x2)q(m|x1,x2)

Page 27: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?

I Outlier detection, robust statistics, etc.I Hierarchical model?

Page 28: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.

I Hierarchical model?

Page 29: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

I How to minimize the effect of an outlier?I Outlier detection, robust statistics, etc.I Hierarchical model?

Page 30: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers hierarchical model

Probability model

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).

Is the variance of the noise homogenous?New model:

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).

Priorp(βi ) = G(δ, γ), p(β) =

∏i

p(βi )

Page 31: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers hierarchical model

Probability model

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).

Is the variance of the noise homogenous?

New model:

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).

Priorp(βi ) = G(δ, γ), p(β) =

∏i

p(βi )

Page 32: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers hierarchical model

Probability model

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, β−1I)N (0, α−1I).

Is the variance of the noise homogenous?New model:

p(y, θ|X , α) = p(y|θ,X )p(θ|α)= N (Xθ, diag[β1, . . . , βn])N (0, α−1I).

Priorp(βi ) = G(δ, γ), p(β) =

∏i

p(βi )

Page 33: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers

2 4 6 8 10x

100

y= 0 + 0x + 1x 2 + 0.05e

dataOLS 5thOLS 2th

0 2 4 6 8 10x

100

120

y= 0 + 0x + 1x 2 + 0.05e

dataVB-OLS

Page 34: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Outliers, both diagonal α and β

0 20 40 60 80 100iterations

-0.2

0.2

0.4

0.6

0.8

1.2

x 2

x 3

0 20 40 60 80 100iterations

0.02

0.04

0.06

0.08

0.1

0.12

I local minima, unstable for both α and β.

Page 35: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

Conclusion

I Linear regression is solved by OLS.I When the data are not informative, we need to regularize:I Different prior assumptions yield different results

I ridge regression minimizes coefficientsI sparsity prior minimizes the number of non-zero coefficients

I Non-Gaussian residuesI Student-t residue,I Mixture residue, etc.

Page 36: Linear Regression - CASstaff.utia.cas.cz/smidl/files/hbm2020/prezentace03_20.pdf · Linear regression and OLS Fit by a linear function: y 1 = ax 1 +b1, +e 1 y 2 = ax 2 +b1+e 2, In

ETEX data challenge

Estimate source term of the ETEX experiment

y = Xθ,

where θ is assumed to be sparse, piece-wise linear, with non-negativeelements.

pointsθ using prior (e.g. ARD) 10

Outlier detection (e.g. ARD) 10

Documents

Ausgleichungsrechnung II Gerhard Navratil Regression und Kollokation Regression –Lineare Regression Kovarianzfunktion Kollokation –Ansatz –Schätzung der.

Documents

(Non) Linear Regression Modeling · The main focus is however on the estimation of regression models with near and exact multicollinearity, whereby we are more concerned about statistical

Documents

Simon Jäger - COnnecting REpositories · PDF fileWe document empirically that these estimates, which typically use local linear regression, are highly sensitive tocurvature in the

Documents

Nicht-Lineare Regression. Computerkurs: Quantitative Auswertung biochemischer Experimente Nicht-lineare Regression Ist nicht-lineare Regression hier angebracht?

Documents

Regression – ein kleiner Rückblick

Documents

AX Dokumentation Steuerungen

Documents

Multiple lineare Regression Für die Unterstützung bei der ... · Multiple lineare Regression In diesem Skript sollen die theoretischen Grundlagen zur multiplen linearen Regression

Documents