Compressed Sensing and Machine Learning · The following introductions to omprcessed sensing and...

Compressed Sensing and MachineLearning

Danny PankninStudent ID number: 315728

August 9, 2012

Bachelor thesis in Applied Functional Analysisat TU Berlin

Faculty 2at the Institute of Mathematics

Handed out by the first examiner:

Prof. Dr. Gitta Kutyniok

Second examiner:

Prof. Dr. Klaus Robert Muller

1

Eidesstattliche Erklärung

Die selbstständige und eigenständige Anfertigung versichert an Eides statt

Berlin, den

2

Abstract

When around the year 2004 rst theoretical results constituted compressed

sensing, it was common to assume structure preserving, i.e. roughly isomet-

ric, measurement methods in order to guarantee a unique reconstruction of

data from its measurements. These results revealed in advance that, using

these measurements, sampling rates far below the known lower bounds from

classical information theory suce to give this guarantee.

By intuition it should be clear that these measurement methods, due to

their structure preserving property, can be used to compress data related to

a learning problem like classication, such that solving the problem on the

measurements will yield the solution of original problem. This would allow

for solving problems which estimation of the solution would not come along

with a feasible consumption of resources like computation time or expensive

data acquisition. A rst attempt towards this idea is given by compressed

learning.

This thesis summarizes previous results in compressed learning, discusses

the contained necessary as well as too restrictive assumptions and discloses

some limits of the approach. The arrangement is given as follows:

The rst chapter introduces the original topics of science, namely com-

pressed sensing and machine learning, compressed learning arises from in

general, though with the focus on the areas concerning this thesis. In the

second chapter an overview about the classication task as a general problem

class will be given in more detail. Chapter three renes the previous results in

compressed learning including a rst, simple construction of a measurement

process. The too restrictive assumptions as well as an outline about how

to lessen them will be discussed in the fourth chapter. Chapter ve shows

in toy-experiments the performance of compressed learning given some mea-

surement methods. Finally, in the sixth chapter, the limits of application of

compressed learning will be discussed.

3

Zusammenfassung

Zu den Anfängen von compressed sensing um das Jahr 2004 war eine struk-

turerhaltende - annähernd isometrische - lineare Messmethode dünnbesetzter

Daten eine gebräuchliche Voraussetzung, um die eindeutige Rekonstrukti-

on der Daten aus den zugehörigen Messungen mit hoher Wahrscheinlichkeit

garantieren zu können. Dabei wurde gezeigt, dass die Anzahl an dafür be-

nötigten Messungen weit unter den aus der Informationstheorie stammenden

unteren Schranken liegen kann.

Diese Messmethoden bieten sich intuitiv an, um Daten für Lernprobleme

auf selbigen, wie Klassikation, in ihrer Dimension zu reduzieren bei gleich-

zeitiger Erhaltung der Lösung. Dies würde die Möglichkeit erönen, Probleme

auf Daten zu lösen, die sonst auf Grund ihrer (zeitlich) kostspieligen Gewin-

nung oder der aufwändigen Verarbeitung nicht mit akzeptablem Ressourcen-

verbrauch vereinbar wären. Compressed learning stellt den ersten Versuch dar

dieses Konzept umzusetzen.

Diese Arbeit fasst die bisherigen Resultate zu compressed learning zusam-

men, diskutiert die darin enthaltenen notwendigen sowie zu stark einschrän-

kenden Voraussetzungen und zeigt Grenzen des Ansatzes auf. Die Gliederung

ist wie folgt gegeben:

Das erste Kapitel gibt eine allgemeinere, wenn auch auf das Benötigte ziel-

gerichtete, Einführung über die beiden Ursprungswissenschaften compressed

sensing und machine learning. In Kapitel zwei wird das betrachtete Problem

der Klassikation detaillierter beleuchtet. Kapitel drei bereitet die ersten Re-

sultate in compressed learning auf inklusive einer simplen Konstruktion eines

Messmethode. Die zu stark einschränkenden Voraussetzungen sowie deren

Abmilderung werden in Kapitel vier diskutiert. Kapitel fünf stellt durch Ex-

perimente das Verhalten von compressed learning mit verschiedenen Mess-

methoden gegenüber. Letztlich werden im sechsten Kapitel die Grenzen der

Anwendbarkeit von compressed learning erörtert.

4

Contents

1 Introduction 6

1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Introduction to Compressed Sensing . . . . . . . . . . . . . . . . . . 61.3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . 10

2 Classication 12

2.1 Prelude . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Linear Classiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Linear Maximum Margin Classiers . . . . . . . . . . . . . . 132.2.2 Kernelized Linear Classiers . . . . . . . . . . . . . . . . . . 14

2.3 Overlapping Class Distributions . . . . . . . . . . . . . . . . . . . . 162.4 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Introduction to Compressed Learning 19

3.1 The Conceptual Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Links of the Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Applying the Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 A DP -Matrix via JLP . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4.1 DP -Scaling Property from JLP . . . . . . . . . . . . . . . . 253.4.2 DP -Angle-Preserving Property from JLP . . . . . . . . . . 253.4.3 A Gaussian sampled DP -Matrix . . . . . . . . . . . . . . . . 27

4 Discussion 29

4.1 Inessentiality of the DP -scaling property . . . . . . . . . . . . . . . 294.2 Relaxation using Coherent and Redundant Dictionaries . . . . . . . 34

5 Experiments 37

6 Limitations of Compressed Learning 42

6.1 The nite Data Dimension . . . . . . . . . . . . . . . . . . . . . . . 426.2 Kernel-SVMs vs. Compressed Learning . . . . . . . . . . . . . . . 43

7 Conclusion 44

5

1 Introduction

1.1 Notation

Although the mathematical notation in this thesis is held in a conventional way, itmight facilitate the reading uency to list some denitions right at the beginning.Rn : The n-dimensional vector space of real numbersR+

0 = x ∈ R | x ≥ 0 : The non-negative real numbers〈·, ·〉F : F × F → R, (x, y) 7→ 〈x, y〉F : The canonical scalar product over theHilbert space F‖ · ‖F : F → R+

0 , x 7→√〈x, x〉F : The by the Hilbert space F induced norm

`F1→F2(X ) = maxx∈X

‖x‖F2

‖x‖F1

: The distortion of the set X ⊆ F1 ∪ F2, when embedding

its elements from F1 into F2

BnR(x) = y ∈ Rn | ‖y−x‖2≤ R : The euclidean ball in Rn with radius R centered

around x ∈ Rn

|x| : The absolute value of x, if x ∈ R, or the cardinality of x, if x is a set‖x‖

0= |i ∈ 1, . . . , n | xi 6= 0| : The number of entries in x dierent from zero

Σk = x ∈ Rn | ‖x‖0≤ k : The subset of at most k-sparse elements in Rn

X : A set, that contains the data of interest, called data space or feature spaceY : A set, that contains a discrete (usually nite) amount of arbitrary elements,called the classes or labelsEx∼D f(x) : The expectation of f(x), where x has distribution D

1A : X → 0, 1, x 7→

1, if x ∈ A0, else

: The indicator function on A ⊆ X

sign : R → −1, 1, x 7→

1, if x ≥ 0−1, else

x+ = max(x, 0) : The positive part of x ∈ R

The following introductions to compressed sensing and machine learning areinspired and held in the manner of the references [DDEK12] and [Bis07], thatcontain extensive introductions to their corresponding eld of science.

1.2 Introduction to Compressed Sensing

The movement of signal processing from the analog domain to the digital domainin the last decades aorded new sensing systems covering higher delity and res-olution. Kotelnikov, Nyquist, Shannon, and Whittaker delivered the theoreticalfoundation for sampling continuous-time band-limited signals promoting this trend[Kot33, Nyq28, Sha49, Whi15]. Their results demonstrate that signals, like im-ages and videos, can be exactly recovered from a set of uniformly spaced samplestaken at the Nyquist rate of twice the highest frequency present in the signal ofinterest. Obviously a rapidly growing amount of measurement data accompanieswith the raising complexity of sensing systems. So in emerging applications for

6

data reconstruction infeasible many samples have to be taken with respect to re-source costs. This could be driven by for example physical limitations inhibitingconstruction of devices sampling at the required rate [Wal99]. Now, compressedsensing(CS) aims to nd a compressed representation of the high dimensional dataunder the constraint of minimal distortion of the signal. Finding a basis or frame(also dictionary) of the signal space, in which the data of interest exhibits a sparseor compressible representation, proofed to be a promising ansatz. A frame is anover-complete vector space generating system, in which a vector oers more thanone representation. A signal is called sparse or compressible, if it can be repre-sented with very few coecients - compared to the signal dimension - while loosingexactly no or respectively nearly no signal information. Representing data by thesecoecients is called sparse approximation, which forms the foundation of trans-form coding schemes that exploit signal sparsity and compressibility, including theJPEG, MPEG, and MP3 standards. The idea of CS dodges the concepts, thatunderlie the Nyquist-Shannon sampling theorem, as it does not focus on sam-pling in high dimension with compression afterwards but instead sense the datain compressed form directly. So potentially much better sampling results mightarise. CS grew out of the work of Candès, Romberg, and Tao and of Donoho,who showed that a nite-dimensional signal having a sparse or compressible repre-sentation can be recovered from a small set of linear, non-adaptive measurements[Can06, Bar07, CR06, CRT06a, CRT06b, CT06, Don06]. In order to delimit CSfrom classical sampling, one may look at two respects. First, rather than samplingthe signal at specic points in time, CS systems typically acquire measurementsin the form of inner products between the signal and more general test functions.Second, signal recovery from compressive measurements is dealt with highly non-linear methods instead of sinc interpolation as in the Nyquist-Shannon framework.Primary the mathematical theory of CS focused on measuring nite-dimensionalvectors in Rn. Recently new results by Hansen and Adcock lifted this theoreti-cal fundament to innite dimensions for the rst time in a consistent way and,in advance, disclosed the pitfalls earlier attempts suered from [HA11]. In con-trast, sampling theory considers by nature innite length, continuous-time signals.Dene the standard nite-dimensional CS model by a given signal x ∈ Rn and ameasurement system that acquires m linear measurements of x, i.e. this can bedescribed by a sensing matrix A ∈ Rm×n with the measurement y = Ax. Notethat in the standard CS framework non-adaptive measurements are assumed. Thiscan be captured in terms of the sensing matrix A being independent with respectto previously acquired measurements. In this model notation the main questionsof CS can be translated as follows. How should one design A such that it preservesthe information in the signal x and how can x be reconstructed from measurementsy? In the range of this thesis, i.e. the establishing CS in machine learning, onefocuses on the rst question, as pattern recognition (on the measurements) is themajor task, rather than data reconstruction. To address the rst question, observethe following. Denote by Σk = x ∈ Rn | ‖x‖

0≤ k the set of all at most k-sparse

vectors in Rn, where ‖x‖0

:= |xi | xi 6= 0|. Note that ‖·‖0is not a norm at all, but

7

it is the limit of the quasi norms ‖ · ‖ppas p→ 0, what justies the notation. For A

should hold, that Ax 6= Ax′ for every x 6= x′ ∈ Σk in order to preserve informationbetween k-sparse signals. In other words, since (x − x′) ∈ Σ2k, one would like tohave Σ2k and ker(A) disjoint. Intuitively one denes the spark of a given matrix Aas the smallest number of columns of A that are linearly dependent. It then holds,that

spark(A) > 2k ⇔ ∀x 6= x′ ∈ Rn : Ax 6= Ax′

Due to the fact, that spark(A) ∈ 2, . . . ,m + 1, obviously m ≥ 2k is a necessarycondition for the required information preservation. In the case of only compressibledata, which could be due to natural non-sparseness, a noisy measurement processor corrupted data, one furthermore has to ensure an environment around k-sparsesignals Σk + Bε(0) being disjoint with ker(A). This can be guaranteed by weakisometry conditions on A, which were rst introduced by Candès and Tao.

Denition 1 (Restricted Isometry Property, [CT05]).A satises the restricted isometry property of order k with a constant δk ∈ [0, 1],in shorthand notation ((k, δk)−RIP ), if it acts as a near-isometry on Σk. That is

∀x ∈ Σk : (1− δk)‖x‖22≤ ‖Ax‖2

2≤ (1 + δk)‖x‖2

2(1.1)

If A satises RIP of order 2k, it will therefore preserve distances between anypair of k-sparse vectors. This property plays a greater role in the remainder of thisthesis. Constructing such a sensing matrix - in addition independently from theconcrete data - is non-trivial and will be discussed in more depth in Chapter 4 and5. The restricted isometry property is a rather strong condition on A, in concretethe property is not necessary for succeeding recovery. However, with respect to anintuitive stability measure of a pair (A,∆), where ∆ denotes an arbitrary recoveryalgorithm, the lower bound of RIP becomes necessary.

Denition 2. The pair (A,∆) is C-stable with C > 0, if for any x ∈ Σk ande ∈ Rm

‖∆(Ax+ e)− x‖2≤ C‖e‖

2

holds.

Here e represents the noise due to an inexact measurement process. One canshow, that this stability measure implies a lower bound as it is demanded by RIP .

Theorem 1 (Theorem 3.1, [Dav10]). If the pair (A,∆) is C-stable, then

1

C‖x‖

2≤ ‖Ax‖

2

for all x ∈ Σ2k.

8

I.e. it is necessary, that A satises (2k, 1− 1C2 )−RIP .

Note that one can give a lower bound on the number of measurements taken,such that A can satisfy RIP . In concrete, if A satises (2k, δ2k) − RIP withδ2k ≤ 1

2, then necessarily m ≥ Ck log

(nk

)for some constant C > 0. This lower

bound does still not capture the dependence on the distortion degree, for examplegiven by δk in terms of RIP denition. Another property involving this dependencewas investigated by Johnson and Lindenstrauss, and is founded on the followinglemma.

Lemma 1 (Johnson, Lindenstrauss, [JL84]). Let 0 < ε < 1, X = x1, . . . , xp ⊂Rn and m & 1

ε2log(p). Then there exists a Lipschitz function f : Rn → Rm such

that(1− ε)‖u− v‖2

2≤ ‖f(u)− f(v)‖2

2≤ (1 + ε)‖u− v‖2

2

holds for all u, v ∈ X.

Especially a function f fullling this condition is regarded as satisfying theJohnson-Lindenstrauss property, also referred to as JLP . In the context of machinelearning the latter property seems much more suitable, as there naturally nitepoint clouds are considered. Since one can deduce from each of the propertiesthe other property (with adjusted constants), one can realize, how strong theyare related. With respect to construction of sensing matrices with the specicproperties the coherence of a matrix, a basis or frame should be investigated. Whentalking about the coherence of a basis or frame, assume the matrix A to be thematrix formed column-wise by the frame or basis vectors.

Denition 3. The coherence of a matrix A ∈ Rn×d, µ(A), is the largest absoluteinner product between any two columns of A:

µ(A) = max1≤i<j≤d

|〈ai, aj〉2|‖ai‖2‖aj‖2

It holds, that µ(A) ∈ [√

d−nn(d−1)

, 1]. For a small µ(A) A is called incoherent.

Incoherence is related to the above properties, as it gives an upper bound to thedistortion constant of RIP :

δk(A) ≤ (k − 1)µ(A)

Note, that there exist more relations between the listed denitions. Some of them,if they become relevant with respect to this thesis, like the concrete connectionbetween RIP and JLP , will be discussed later on.

9

1.3 Introduction to Machine Learning

The problem of nding patterns in arbitrary observable data is of interest ever sincethe humanity started progressing in all modern scientic disciplines. In the 16th

century as an example formulation of the empirical laws of planetary motion via as-tronomic observations enabled large developments in classical mechanics. MachineLearning is the research area concerned with automatic detection of patterns indata.

Such problems can be subdivided into classes according to further given infor-mation and desired abilities of the solution, which are:

Supervised learning problems

Unsupervised learning problems

Semi-supervised learning problems

Supervised learning problems are about learning a function on all possible trajecto-ries of the data given a (nite) subset of observed data, called training set, with theadditional information of their target values, the values which the function shouldproduce when evaluating data. If the amount of the target values is nite, it iscalled a classication problem. If in contrast the target values are continuous, itis called a regression problem. One can interpret the target values as a specicshape of a property that each data instance holds. For example the data could be`handwritten ciphers' represented as a 16 × 16-matrix of colour values, where thecorresponding property is the cipher identity. Problems on data, that possessesno target values, are denoted as unsupervised learning problems. In particularone does not try to match a function to some (not known or not existing) targetvalues. Instead the aliated types of problems are more of a data preprocessingor an inferring about the data driving process kind. In some cases one has onlypartial knowledge about target values. This could be due to the expense of manualacquisition of these, when for example e-mails should be tagged as spam or normalmail. Maybe one has only target values for a subset of all possible classes, due tothe fact, that one has not enough information about or no training instances ofsome classes. This happens in computer security problems like intrusion detection,where one monitors the integrity of a (computer-)system. There, one observes onlythe normal behavior of the system in general until an attack occurs. Now suchan attack could either be known in its pattern, when another similar attack waslearned before, or no comparable attack exists so far. In the latter case one stilldesires to detect the attack even though its pattern, and therefor it's classication,is unknown. This class of outlier- or change-point-detection problems are regardedas semi-supervised learning problems.

In the remaining part of this thesis only supervised learning problems will bediscussed. Given a specic data set further research areas of machine learning, toallow automatic detection, deal with data preprocessing, decision theory and modelselection. Data preprocessing is needed, when from the raw form of the data the

10

target value is not deducible in a generic way, as it might be the case, when thedata instances are not comparable, the dimensionality of the data is too large tohandle or the information for the decision are hidden. For a text categorization taskas an example, the characteristics of a text document that assign it to the correctcategory are hidden in language specic text modules or words, and therefor notautomatically, and not even in general manually, accessible. Especially here notcomparable instances can occur, when document instances are given in dierentlanguages. Such data formats require feature extraction, i.e. vectors containing thenecessary characteristic information of the data.

Subsequently decision theory is about optimizing the category assignment. Thisis in general not trivial, since categories do not need to be disjoint. In such a caseit is obvious, that any arbitrary decision method fails to assign correct categoriesover the whole data set. This calls for a criterion to compare dierent decisionmethods, namely minimization of a loss function. A typical loss function couldlook as follows ∑

k,j∈C

∫Rj

Lk,jp(x, k)dx,

where C is the set of possible class assignments, Rj is the data subset, where thedecision method suggests class j, Lk,j is a penalty term raising the loss wheneverclass j is assigned when actually class k was the correct one, and p is the jointdistribution of the data and classes.

Assume in a medical problem the data x to be some X-Ray measurement of apatient, and the classes to be cancer, no cancer. Then Lcancer,no cancer, i.e. whenthe patient has cancer, but the diagnosis is no cancer, should be much larger thanLno cancer,cancer, as the former error might result into a premature death of the pa-tient, while the latter error will only produce some costs for further testings. WhenLkj = 1− δk,j, penalizing each misclassication on the same scale, this loss is alsoknown as generalization loss. A function mapping data to classes is called a classi-er. It is just one possibility to approximate the minimizing classier via estimatingthe joint or marginal distribution of the data followed by pointwise minimization ofthe integrand in the loss function. Other approaches globally estimate a classiervia direct minimization of a loss function, when there exist ecient optimizationmethods to solve them. The Support Vector Machine (SVM)-classier, which willbe of major interest in the remaining thesis, falls into that category. Most of themethods for estimation of a classier involve some free parameters. The last tooexamples both possess penalty terms as free parameters. These parameters allowfor sophisticated adjustment of the method to individual data sets. The tuning ofsuch free parameters is called model selection. A simple, but very commonly usedtuning concept is known as cross-validation. Cross-validation contains the followingsteps:

x a nite set of possible parameter values, commonly a logarithmic orequidistant placed grid

11

disjointedly segment the training set into several subsets

assess beforehand a loss function

for each parameter vector of the grid and for several distinct split ups ofthe segments into a training set and a validation set learn a classier on thetraining set and estimate the loss on the validation set

choose the parameter vector, that shows the best average performance (min-imal loss)

Especially from the last few introduced machine learning denitions one can realizea strong connection of this research eld with statistics and probability theory.Another main pillar of machine learning is functional analysis. This is due tothe fact, that most of the data representations, i.e. the extracted features, areembeddings into Hilbert spaces via positive semi-denite kernels. As mentionedbefore the focus in this thesis is about the classication task. Therefore the followingsection will reintroduce classication in a more detailed and formal way.

2 Classication

2.1 Prelude

Assume one observes a training set S = (x1, l1), . . . , (xM , lM) ⊆ X ×Y , where Xis the data space (also feature space) of all possible trajectories. Each element of thedata space holds a property value according to some distribution over a nite setY , the classes or labels. Recall the example from the introduction of handwrittenciphers. Then X = `Handwritten ciphers' and Y = `cipher identities` or formalX = A = (aij) ∈ R16×16|aij ∈ [0, 255]3 and Y = 0, . . . , 9.

A classier y : X → Y is a function assigning a label to each data instance.The goal is to formulate a `good` classier when solely having the knowledge

about the observations S. The measure of goodness has to be specied manuallydepending on the application's priorities, as already mentioned when introducingdecision theory in the former chapter. Analysing the space of all classiers is hard,and so one can reduce the problem complexity considering only an easier handledsubset of it.

2.2 Linear Classiers

Let X be a subset of a Hilbert space. A linear classier is a classier, that dissectsthe data space with a nite amount of hyperplanes into equally labeled regions.One can split up such a classier into several (binary) classiers corresponding toeach of the hyperplanes, and so, without loss of generality, consider only labels ofthe form Y = −1,+1. Then the linear classier is of the form

y(x) := yw,b(x) = 〈w, x〉2

+ b

12

Note, that not y maps into Y , but sign(y) does. For convenience one identies ywith sign(y) when factually classifying. The decision hyperplane

h = y(x) = 0|x ∈ X = w⊥ − b w

‖w‖22

is also identied with the classier y itself.

2.2.1 Linear Maximum Margin Classiers

Delimit the set of considered linear classiers rst to the set H of linear classiers,that strictly classify the data set S correct.

Denition 4 (Margin). The margin of a classier h is given by the distance d(h, S)

of S to h, i.e. d(h, S) = min(xi,li)∈S

|y(xi)|‖w‖

2

The maximum margin classier is obviously given by h∗ = argmaxh∈H

d(h, S).

Classiers exhibiting this property often prove to perform nearly as good as the bestlinear classier with respect to generalization loss. In order to form an intuitionabout that statement, recall the generalization loss

RD(y) = E(x,y)∼D 1y(x)6=y

where D is the unknown distribution over X × Y .One now can approximate D via a generic density estimation Pσ built by plac-

ing a Gaussian density pi ∼ N (ci, σI) on each observation and then grouping upthe equally labeled ones via Pσ(x|y = k) = 1

#li=k∑li=k

pi(x). Now given the max-

imum likelihood prior distribution over the marginals p(y) one can estimate thejoint distribution as Pσ(x, y) = p(x|y)p(y). Since Pσ approximates D, also RPσ(y)approximates RD(y) and the theorem holds:

Theorem 2 (Vapnik(1982)). Let h∗ ∈ H, then the statements are equivalent:

∀h ∈ H \ h∗∃s > 0s.t.∀σ < s : RPσ(h∗) < RPσ(h)

h∗ has maximal margin

This shows, that for an open environment around zero in the parameter spacethe maximum margin classier performs best. Though it does not follow, that it isthe best classier in H in general, since the correct distribution is not necessarilywithin this area.

Since all x ∈ S are correctly classied the following is valid:

|y(xi)|‖w‖

2

=liy(xi)

‖w‖2

=li(〈w, xi〉2 + b

)‖w‖

2

13

Using this, one can reformulate the task of nding the maximum margin classieras

argmaxw,b

(1

‖w‖2

min(xi,li)∈S

[li(〈w, xi〉2 + b

)])The distance of a data point x to h is invariant under equally rescaling of w and b.One can use this degree of freedom to x without loss of generality

li(〈w, xi〉2 + b

)= 1

for the closest point in S to h. I.e. it follows that

li(〈w, xi〉2 + b

)≥ 1, i = 1, . . . ,M.

By that the formulation of nding the maximum margin classier simplies to

argmaxw,b

1

‖w‖2

= argminw,b

1

2‖w‖2

2

subject to the constraints li(〈w, xi〉2 + b

)≥ 1, i = 1, . . . ,M

So far X being part of a Hilbert space and a linear classier was assumed. Theseassumptions are very restrictive, since in most of the real world data applicationsthe data is either not represented within a Hilbert space or the dierently labeledregions are not linearly separable. Making use of positive semi-denite kernels onecan almost drop the assumptions.

2.2.2 Kernelized Linear Classiers

When given an intuitive similarity measure κ between elements of an arbitrarydata space X the following theorem allows an embedding of the data into a Hilbertspace, reducing the problem to the restrictive case discussed above.

Theorem 3 (Moore-Aronszajn(1950)). Given any set X with a positive semi-denite mapping κ : X × X → C, then

There exists exactly one Hilbert space F of functions on X with κ as itsreproducing kernel, and

κ(s, t) = 〈Φ(s),Φ(t)〉Fwhere Φ : X → F , t 7→ κ(·, t) is the canonical embedding

This obviously overcomes the problem of the data not being represented in aHilbert space by embedding elements x ∈ X as Φ(x) ∈ F . That one further-more can overcome the problem of non linear separability can be illustrated by anexample:

When embedding with a quadratic polynomial kernel κ(x, y) = 〈x, y〉2:

So, the kernelized linear classier is a modied version of the linear classiermaking use of the embedding Φ:

y(x) := yw,b(x) = 〈w,Φ(x)〉2

+ b

14

Figure 1: A non-linear classication problem

Figure 2: The embedded linear classication problem

15

2.3 Overlapping Class Distributions

Another implicit restriction is the assumption of exact linear separability. Com-monly data incorporates noise or the dierent classes are (not optimal) chosen suchthat their preimages were disjoint.

From another perspective one might prefer to use a less complex model withregard to a better generalization, when there is no sucient amount of data in orderto estimate a more complex model adequate. With that in mind one can loosenthe explicit optimization constraints of

argminw,b

1

2‖w‖2

2, subject to

li(〈w,Φ(xi)〉2 + b

)≥ 1, i = 1, . . . ,M

to the primal optimization problem:

argminw,b,ξ

1

2‖w‖2

2+ C

M∑i=1

ξi , subject to (2.1)


)≥ 1− ξi, i = 1, . . . ,M

ξi ≥ 0, i = 1, . . . ,M

The idea is to allow violation of the explicit constraint on xi capturing the amplitudeof violation in the slack variable ξi. Then, depending on the size of C > 0, theconstraint violation `penalizes` the objective function in the sense, that it becomesraised antagonizing the minimization. Due to this penalizing interpretation oneusually calls it a penalty term. C is a regularizing trade-o parameter between modelcomplexity and training error. For C → ∞ the penalization grows, restricting tofewer training error and forcing the model to be more complex, which results inconvergence to the former problem formulation without slack variables. For C → 0the penalization becomes negligible, allowing for more training error and a lesscomplex model.

For dierent values of a ξi one can imply:

ξi = 0⇔ xi lies outside the margin boundary on the correct side

0 < ξi ≤ 1⇔ xi lies within the margin boundary on the correct side

ξi > 1⇔ xi is misclassied

Note thatM∑i=1

ξi is an upper bound on the number of misclassied training data

points, since for a misclassied xi the corresponding ξi > 1.Furthermore note, that from the problem formulation (2.1) one can solve for ξi:


)≥ 1− ξi ⇔ ξi ≥ 1− li

(〈w,Φ(xi)〉2 + b

)

16

Therefore with ξi ≥ 0 the following must hold

ξi ≥ max1− li(〈w,Φ(xi)〉2 + b

), 0 =

[1− li

(〈w,Φ(xi)〉2 + b

)]+Since one wants to minimize with respect to ξi one should choose it to be[

1− li(〈w,Φ(xi)〉2 + b

)]+.

Dene the following terms:

Denition 5. The functions

HD(w, b) = E(x,y)∼D[1− y

(〈w,Φ(x)〉

2+ b

)]+(2.2)

and

HS(w, b) =M∑i=1

[1− li

(〈w,Φ(xi)〉2 + b

)]+(2.3)

are called the hinge loss and the empirical hinge loss of the classierwith respect to the data distribution D or the data set S.Subsequently one can dene

L(w, b) =1

2C‖w‖2

2+HD(w, b) (2.4)

and

L(w, b) =1

2C‖w‖2

2+ HS(w, b) (2.5)

as the regularization loss and the empirical regularization loss of the classier.

Now, with the above deduced interpretation of the ξi it is obvious, that solvingthe Problem (2.1) is equivalent to solving

argminw,b

L(w, b) (2.6)

This formulation, though not needed yet, will be used later on.

2.4 Support Vector Machines

A Support Vector Machine (SVM) is the summarized optimization concept tothe above discussed maximum margin formulation, extension and relaxation (2.1),using Lagrange multipliers. The corresponding Lagrange function is

L(w, b, ξ, α, µ) =1

2‖w‖2

2+ C

M∑i=1

ξi +M∑i=1

αi(1− liy(xi)− ξi)−M∑i=1

µiξi (2.7)

17

subject to αi ≥ 0 and µi ≥ 0 One now can optimize with respect to the primaryvariables w, b, ξ:

δL(w, b, ξ, α, µ)

δw= 0⇒ w =

M∑i=1

αiliΦ(xi) (2.8)


δb= 0⇒

M∑i=1

αili = 0 (2.9)


δξi= 0⇒ αi = C − µi (2.10)

Putting (2.8) and (2.10) into (2.7) eliminates the primary variables and, underconsideration of (2.9), the dual form emerges

L(α) =M∑i=1

αi −1

2

M∑i,j=1

αiαjliljκ(xi, xj) , subject to (2.11)

0 ≤ αi ≤ C

M∑i=1

αili = 0

where the upper bound on the αi follows by (2.10) and the fact, that µi ≥ 0.For this optimization problem strong duality holds, such that the solutions of theprimal and dual problem are equal. Recall the implications on the ξi for dierentlocations of xi. One can analogously give implications on dierent values of αi.

αi = 0 ⇒ µi = C > 0 ⇒ ξi = 0 ⇒ xi lies outside the margin boundary onthe correct side

0 < αi < C, then αi < C ⇒ µi > 0⇒ ξi = 0, and 0 < αi ⇒ 1−liy(xi)−ξi = 0⇒ liy(xi) = 1⇒ xi lies on the margin boundary on the correct side

αi = C ⇒ µi = 0 ⇒ ξi ≥ 0 ⇒ xi lies within or on the wrong side of themargin boundary

The dierent implications follow from (2.10) and the Karush-Kuhn-Tucker condi-tions on the solution, i.e. αi(1−liy(xi)−ξi) = 0 and µiξi = 0, for all i ∈ 1, . . . ,M.From the dual form one can see, that for αi = 0 the corresponding data point xiis not relevant for the prediction. When C is chosen reasonable most of the datapoints should lie outside the margin boundary on the correct side. Since exactlythen αi = 0 the SVM -classier makes sparse use of the data points.

18

3 Introduction to Compressed Learning

Compressed learning is in general about solving a machine learning problem prelim-inarily compressing the data as a preprocessing step, where the form of compressionguarantees a preservation of the problem structure. This permits one to solve theoriginal problem, when solely considering the compressed version of it. The benetsof this are, due to dimension reduction, raising of the computation performance andexpansion of the considerable problem space, where no direct access to the data ispossible or feasible, but linear measurements of the data. What properties of thecompression are sucient to guarantee the problem preservation for classicationtasks and how sparsity of the data inuences the quality, will be discussed withinthe remainder of the chapter.

3.1 The Conceptual Idea

Assume X to be a bounded subset of an n-dimensional space, so without loss ofgenerality X ⊆ Rn. Note that, although the choice of X is quite arbitrary, later onX will usually stand for a bounded ball centered at zero, BnR(0), or the to k-sparsevectors restricted bounded ball BnR(0) ∩ Σk. Denote by R = max

x∈X‖x‖

2the radius

and `2→1(X ) = maxx∈X

‖x‖1‖x‖

2the distortion of the data space. Dene the measurement

domain as AX = z = Ax+e|x ∈ X , ‖e‖2≤ σR for the sensing matrix A ∈ Rm×n.

The error term e describes the occurrence of a noise contamination due to aninaccurate measurement process. Respectively let AS = (z1, l1), . . . , (zM , lM) bethe measurement representation of the training set.

Two properties, call them scaling property and angle-preserving property for themoment, of the sensing matrix and a theorem, that connects the regularization lossof the SVM -classier with the minimizer of the regularization loss, are needed torelate an SVM -classier learned in the measurement domain with the true maxi-mum margin classier in the data space with respect to the hinge loss. Figure 3illustrates the relation chain, where vAS is the SVM -classier in AX , wS, AwS arethe SVM -classiers in X and its projection to AX , v∗, w∗ are the minimizers ofthe regularization loss in AX and X , and w0 is the maximal margin classier inX . Each arrow in Figure 3 corresponds to an argument, that bounds the regular-ization loss of the classier at the arrow root in terms of the regularization loss ofthe classier at the arrow spire.

19

Figure 3: Classier Relations

3.2 Links of the Chain

First consider the theorem mentioned in the last paragraph, which gives in somesense a guarantee for a classier with small empirical regularization loss to havenearly optimal true regularization loss.

Theorem 4 (Theorem 1,[SSSs08]). For every linear threshold classier w with‖w‖2

2≤ 2C, with probability at least 1− δ over S the following holds:

LD(w)− LD(w∗) ≤ 2[LS(w)− LS(w∗)

]+

+O(R2C log(1

δ)

M

)(3.1)

In order that the theorem applies to the SVM -classiers, one has to show, that‖wS‖2

2, ‖vAS‖2

2≤ 2C.

Proof. Since wS is the SVM -classier, by (2.8), it can be written as

wS =M∑i=1

αilixi

and from (2.11) with C ← CM

the support vectors are bounded by αi ≤ CM. Then

1

2‖wS‖2

2≤ 1

2‖wS‖2

2+C

M

M∑i=1

ξi =M∑i=1

αi −1

2‖wS‖2

2≤ C − 1

2‖wS‖2

2,

where the equality holds due to αi and wS being optimal with respect to the primaland the dual problem formulation (2.1) and (2.11), which values are equal due tostrong duality. Therefore

‖wS‖22≤ C.

20

The proof for ‖vAS‖22≤ C one conducts completely analogous.

Recall, the radius of X was R, and therefore by Theorem 4, and using the fact,

that[LS(wS)− LS(w∗)

]+

= 0 holds, as wS is the minimizer of the empirical reg-

ularization loss with probability at least 1− δ

LD(wS) ≤ LD(w∗) +O(R2C log(1

δ)

M

)(3.2)

In order to bound the radius of AX one has to regulate the `scaling behavior' of A,which is captured within the DP -scaling property :

∀x ∈ X : ‖Ax‖2≤ L‖x‖

2(3.3)

for some constant L > 1 with high probability. Then for z = Ax+ e ∈ AX

‖z‖2≤ ‖Ax‖

2+ ‖e‖

2≤ L‖x‖

2+ σR ≤ (L+ σ)R

such that the radius of AX is given by (L+ σ)R, and with analogous justicationas above with probability at least 1− δ

LD(vAS) ≤ LD(v∗) +O(

(L+ σ)2R2C log(1δ)

M

)(3.4)

The last unclear link connects the regularization loss of wS with its projection tothe measurement domain. This connection, roughly spoken, captures in essencethe ability of transforming S en bloc (thereby the problem structure) to the mea-surement domain. For this purpose dene the DP -angle-preserving property, thatsuces to guarantee this ability:

∀i, j ∈ 1, . . . ,M : |〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2 (3.5)

with probability exceeding 1− δ for some constants ε, δ ≥ 0. Consider the lemma:

Lemma 2 (Lemma 10.2,[CJK12]). Let

w1 =M∑i=1

silixi, w2 =M∑i=1

tilixi and

v1 =M∑i=1

silizi, v2 =M∑i=1

tilizi

be linear combinations of S and AS respectively, where

s1, . . . , sM , t1, . . . , tM ≥ 0 such thatM∑i=1

si ≤ D1 andM∑i=1

ti ≤ D2.

Under the assumption of (3.5),

|〈w1, w2〉2 − 〈v1, v2〉2| ≤ D1D2R2ε

holds with probability exceeding 1− δ.

21

Proof.

|〈w1, w2〉2 − 〈v1, v2〉2| = |M∑i,j=1

sitjlilj(〈xi, xj〉2 − 〈zi, zj〉2)|

≤M∑i,j=1

sitj|〈xi, xj〉2 − 〈zi, zj〉2|(3.5)

≤ εR2

(M∑i=1

si

)(M∑j=1

tj

)≤ D1D2R

2ε

Using Lemma 2 one can now connect the regularization losses of wS and itsprojection to the measurement domain.

Lemma 3 (Lemma 10.3,[CJK12]). Let wS =M∑i=1

αilixi be the SVM-classier with

respect to S. Then AwS =M∑i=1

αilizi. Under the assumption of (3.5), but for (M+1)

data points

LD(AwS) ≤ LD(wS) +3CR2ε

2(3.6)

holds with probability exceeding 1− δ.Proof. As Lemma 2 applies, with w1 = w2 = wS and v1 = v2 = AwS and, by

denition of the support vectors,M∑i=1

αi ≤ C =: D1, D2,

‖AwS‖22≤ ‖wS‖2

2+ C2R2ε

⇔ 1

2C‖AwS‖2

2≤ 1

2C‖wS‖2

2+CR2ε

2

Let (xM+1, lM+1) be a new from D i.i.d. drawn sample. Again let w1 = wS andv1 = AwS with D1 = C, and w2 = lM+1xM+1, v2 = lM+1AxM+1 with D2 = 1.With Lemma 2

lM+1〈wS, xM+1〉2 ≤ lM+1〈AwS, AxM+1〉2 + CR2ε

⇔ 1− lM+1〈AwS, AxM+1〉2 ≤ 1− lM+1〈wS, xM+1〉2 + CR2ε

⇒ 1− lM+1〈AwS, AxM+1〉2 ≤ [1− lM+1〈wS, xM+1〉2]+ + CR2ε

⇒ [1− lM+1〈AwS, AxM+1〉2]+ ≤ [1− lM+1〈wS, xM+1〉2]

+ + CR2ε

Therefore

E(xM+1,lM+1)∼D[1− lM+1〈AwS, AxM+1〉2]+

≤E(xM+1,lM+1)∼D[1− lM+1〈wS, xM+1〉2]+ + CR2ε

which is, due to independence of the data points the same as

HD(AwS) ≤ HD(wS) + CR2ε

Summing up both inequalities gives the statement of the lemma.

22

The two properties (3.3) and (3.5) were summarized by Calderbank and Jafar-pour [CJK12] as follows:

Denition 6 ((M,L,ε, δ)-Distance Preserving Property). Let L > 0, 0 < ε, δ < 1small, and M the amount of training examples,then A is distance preserving (DP ), if with probability 1− δ the following holds:

∀x ∈ X : ‖Ax‖2≤ L‖x‖

2(DP -scaling property (3.3))

∀i ∈ 1, . . . ,M : (1− ε)‖xi‖22≤ ‖zi‖2

2≤ (1 + ε)‖xi‖2

2

∀i 6= j ∈ 1, . . . ,M : |〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2

(DP -angle-preserving property (3.5))

3.3 Applying the Chain

One can now come up with a theorem, that aggregates all the single links.

Theorem 5 (Theorem 10.3,[CJK12]). Let A be (M + 1, L, ε, δ1) − DP , δ2 > 0,δ := δ1 + 2δ2, and choose a reasonable C, that minimizes the dierence between thetwo hinge losses of interest. Then with probability exceeding 1− δ:

HD(vAS) ≤ HD(w0) +O

R‖w0‖2

√ε+

(L+ σ)2 log( 1δ2

)

M

. (3.7)

Proof. First observe, that

HD(vAS) ≤ 1

2C‖vAS‖2

2+HD(vAS) = LD(vAS).

Using (3.4) it holds, that

LD(vAS) ≤ LD(v∗)

despite an error of the order O(

(L+σ)2R2C log( 1δ

)

M

). Due to v∗ minimizing the regu-

larization loss, obviously

LD(v∗) ≤ LD(AwS).

By Lemma 3,

LD(AwS) ≤ LD(wS),

despite an error not exceeding a magnitude of O(

3CR2ε2

). Now, with (3.2) it holds,

that

LD(wS) ≤ LD(w∗)

23

despite an error of the order O(

2R2C log( 1δ2

)

M

). Finally, again by the fact that

minimizes the regularization loss,

LD(w∗) ≤ LD(w0) = HD(w0) +1

2C‖w0‖2

2.

All together, it holds that


(1

2C‖w0‖2

2+

3CR2ε

2+

(1 + (L+ σ)2)R2C log( 1δ2

)

M

).

Now estimate the C, that minimizes the latter part of the equation, enforcing astronger relation between HD(vAS) and HD(w0).

Call G = 3ε2

+(1+(L+σ)2) log( 1

δ2)

M. Then the latter part is given by 1

2C‖w0‖2

2+ CR2G

Setting its derivative with respect to C to zero, so

−‖w0‖2

2

2C2+R2G = 0⇔ C =

‖w0‖2R√

2G,

and applying this C gives

1

2C‖w0‖2

2+ CR2G =

1√2‖w0‖2R

√G+

1√2‖w0‖2R

√G

=√

2‖w0‖2R√G ∈ O

‖w0‖2R

√ε+

(L+ σ)2 log( 1δ2

)

M

3.4 A DP -Matrix via JLP

There already exists plenty of literature about similar matrix properties. In orderto make use of the rich analysis about these matrices, it would be desirable toreduce the needed DP -property to the properties of those. For that, rst denethe well studied Johnson-Lindenstrauss property.

Denition 7 ((ε, ρ)-Johnson-Lindenstrauss Property). A satises the Johnson-Lindenstrauss property ((ε, ρ)− JLP ), if with probability exceeding 1− ρ,

(1− ε)‖x1 − x2‖22≤ ‖A(x1 − x2)‖2

2≤ (1 + ε)‖x1 − x2‖2

2(3.8)

holds for two arbitrary x1, x2 ∈ X .

One can deduce for a measurement matrix A the DP -property from JLP byrelating each of the two DP -property parts to JLP and, in advance, enforcing Ato satisfy both JLP conditions simultaneously.

24

3.4.1 DP -Scaling Property from JLP

A matrix satisfying JLP will also possess the DP -Scaling Property with adjustedparameters. This can be deduced using a relation from JLP to RIP and subse-quently an energy bound on matrices exhibiting RIP .

Lemma 4 ([CJK12]). Let A satisfy (ε2, ρ2)− JLP and k ∈ N, then

∀x ∈ X : ‖Ax‖2≤ L‖x‖

2

where L =√

1 + 5ε2

(1 + `2→1(X )√

k

), with probability exceeding 1− ρ2

(nk

) (6√ε2

)k.

Proof. A theorem from [BDDW08] states, that for k ∈ N, if A satises (ε2, ρ2) −JLP , it will satisfy (k, 5ε2) − RIP with probability exceeding 1 − ρ2

(nk

) (6√ε2

)k.

Proposition 3.5 from [NT09] gives an upper energy bound, i.e. a possibility todelimit the Lipschitz constant of A, if A satises RIP . So if A satisfy (k, ε)−RIP ,then for every x ∈ X

‖Ax‖2≤√

1 + ε

(‖x‖

2+

1√k‖x‖

1

).

When applying the proposition on the local A,

‖Ax‖2≤√

1 + 5ε2

(‖x‖

2+

1√k‖x‖

1

)=√

1 + 5ε2

(1 +

1√k

‖x‖1

‖x‖2

)‖x‖

2

≤√

1 + 5ε2

(1 +

`2→1(X )√k

)‖x‖

2

holds for every x ∈ X .

3.4.2 DP -Angle-Preserving Property from JLP

For an A that satises JLP , the DP -angle-preserving property can be derived.

Lemma 5 ([CJK12]). Consider the sets X = x1, . . . , xM+1 ∪ 0 of arbitraryfeature vectors. Let A satisfy JLP with distortion parameter ε1 uniformly over allpairs of X with probability exceeding 1− δ, then

∀i ∈ 1, . . . ,M + 1 : (1− ε)‖xi‖22≤ ‖Axi‖2

2≤ (1 + ε)‖xi‖2

2

∀i 6= j ∈ 1, . . . ,M + 1 : |〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2

where ε = 3ε1 + 2σ + σ2.

25

Proof. By assumption,

∀i ∈ 1, . . . ,M + 1 : (1− ε1)‖xi‖22≤ ‖Axi‖2

2≤ (1 + ε1)‖xi‖2

2and

∀i 6= j ∈ 1, . . . ,M + 1 : (1− ε1)‖xi − xj‖22≤ ‖A(xi − xj)‖2

2≤ (1 + ε1)‖xi − xj‖2

2

with probability exceeding 1− δ.From the inequalities

(a) ‖A(xi − xj)‖22≤ (1 + ε1)‖xi − xj‖2

2

= (1 + ε1)[‖xi‖2

2+ ‖xj‖2

2− 2〈xi, xj〉2

](b) ‖A(xi − xj)‖2

2= ‖Axi‖2

2+ ‖Axj‖2

2− 2〈Axi, Axj〉2

≥ (1− ε1)[‖xi‖2

2+ ‖xj‖2

2

]− 2〈Axi, Axj〉2

one can imply

(1 + ε1)〈xi, xj〉2 ≤ 〈Axi, Axj〉2 + 2R2ε1

⇒ 〈xi, xj〉2 ≤ 〈Axi, Axj〉2 + 3R2ε1.

Analogously one can show, that

〈Axi, Axj〉2 ≤ 〈xi, xj〉2 + 3R2ε1.

Combining the two inequalities gives

|〈xi, xj〉2 − 〈Axi, Axj〉2| ≤ 3R2ε1.

Bearing in mind the noisiness of the projection it follows, that

|〈xi, xj〉2 − 〈zi, zj〉2| = |〈xi, xj〉2 − 〈Axi + ei, Axj + ej〉2|≤ |〈xi, xj〉2 − 〈Axi, Axj〉2|+ |〈ei, Axj〉2|+ |〈Axi, ej〉2|+ |〈ei, ej〉2|≤ 3R2ε1 + ‖ei‖2‖Axj‖2 + ‖Axi‖2‖ej‖2 + ‖ei‖2‖ej‖2≤ 3R2ε1 + 2R2σ +R2σ2 = R2(3ε1 + 2σ + σ2)

Lemma 4 and 5 together obviously relate JLP to DP , as they guarantee bothparts of DP .

Lemma 6 (Lemma 10.4, [CJK12]). Let A satisfy (ε1, ρ1) − JLP , (ε2, ρ2) − JLPand k ∈ N, then A is (M + 1, L, ε, δ)−DP where

L =√

1 + 5ε2

(1 + `2→1(X )√

k

) ε = 3ε1 + 2σ + σ2

δ = ρ1

(M+2

2

)+ ρ2

(nk

) (6√ε2

)k

26

Proof. Despite an error probability of ρ1

(M+2

2

)A will satisfy JLP with distortion

parameter ε1 uniformly over all(M+2

2

)pairs of the set x1, . . . , xM+1 ∪ 0. Then

the statement simply emerges by applying Lemma 4 and 5 simultaneously.

Now, for an A that satises JLP , by Lemma 6 it also satises DP , such thatlearning in the measurement domain becomes possible.

Theorem 6 (Theorem 10.5, [CJK12]). Let k ∈ N and A satisfy (ε1, ρ1) − JLP ,(ε2, ρ2)− JLP with

ρ1 ≤1

6

(M + 2

2

)−1

and ρ2 ≤1

6

[(n

k

)(6√ε2

)k]−1

,

then with probability at least 1− 3

[ρ1

(M+2

2

)+ ρ2

(nk

) (6√ε2

)k]HD(vAS) ≤ HD(w0) (3.9)

+O

R‖w0‖2

√√√√√ε1 + σ +

(1 + ε2)(

1 + `2→1(X )√k

)2

log

[ρ1

(M+2

2

)+ ρ2

(nk

) (6√ε2

)k]−1

M

.

Proof. Apply Lemma 6 to Theorem 5.

3.4.3 A Gaussian sampled DP -Matrix

Several classes of matrices possess the Johnson-Lindenstrauss property. One simpleexample is given by a randomly i.i.d. sampled Gaussian matrix.

Proposition 1 (Theorem 2.1, [DG03]). Let A ∈ Rm×n with entries sampled i.i.d.from a N (0, 1

m) Gaussian distribution and ε ≥ 0. Then

A satises (ε, 2 exp−m

2( ε

2

2− ε3

3)

)− JLP in Rn.

It is now straight forward to state preservation of the learning problem, whentaking measurements with a Gaussian sampled matrix.

Corollary 1 (Corollary 10.2, [CJK12]). Let X = BnR(0) be the sphere of radius Rin Rn. Let A ∈ Rm×n be drawn from a N (0, 1

m) Gaussian distribution, then there

exist universal constants κ1 and κ2, such that for every k ∈ N, if

m ≥ κ1

(logM

ε21

+k log n

k

ε22

),

then with probability at least 1− 6 [exp −κ2mε21+ exp −κ2mε

22],


(R‖w0‖2

√ε1 + σ +

(1 + ε2)nmε21

kM

).

27

Proof. A satises (ε1, ρ1)− JLP and (ε2, ρ2)− JLP with

ρ1 = 2 exp

−m

2(ε2

1

2− ε3

1

3)

and ρ2 = 2 exp

−m

2(ε2

2

2− ε3

2

3)

As in Theorem 6 the statement holds with probability exceeding

1− 3

[ρ1

(M + 2

2

)+ ρ2

(n

k

)(6√ε2

)k]

= 1− 6

[exp

log

((M + 2

2

))− m

2(ε2

1

2− ε3

1

3)

+ exp

log

[(n

k

)(6√ε2

)k]− m

2(ε2

2

2− ε3

2

3)

]Observe, that

log

[(M + 2

2

)]= O (logM) and log

[(n

k

)(6√ε2

)k]= O

(k log

n

k

).

So, for appropriate κ1 and κ2 and mε21 ≥ κ1 logM and mε2

2 ≥ κ1k log nk,

i.e. such that log[(M+2

2

)]∈ O (mε2

1) and log

[(nk

) (6√ε2

)k]∈ O (mε2

2),

1− 3

[ρ1

(M + 2

2

)+ ρ2

(n

k

)(6√ε2

)k]≈1− 6

[exp

−κ2mε

21

+ exp

−κ2mε

22

].

The Hinge loss bound emerges simply by plugging in the variables, noting, that`2→1(X ) =

√n

When considering sparse data this result can be further improved.

Corollary 2 (Corollary 10.3, [CJK12]). Let X = BnR(0) ∩ Σk the set of k-sparsevectors within the sphere of radius R in Rn. Let the entries of A ∈ Rm×n be i.i.d.drawn from a N (0, 1

m) Gaussian distribution, then there exist universal constants

κ1 and κ2, such that for every k ∈ N, if

m ≥ κ1

(logM

ε21

+k log n

k

ε22

),

then with probability at least 1− 6 [exp −κ2mε21+ exp −κ2mε

22],


(R‖w0‖2

√ε1 + σ +

(1 + ε2)mε21

M

).

Proof. The proof is analogous to the former, despite the fact, that here`2→1(X ) =

√k holds.

28

4 Discussion

4.1 Inessentiality of the DP -scaling property

Enforcing the DP -scaling property (recall Denition 6), i.e. prohibiting rescalingon the feature space, is already by intuition not reasonable, as the actual learningproblem is invariant under that operation. For example the optimal decision hy-perplane remains the same, when uniformly rescaling the feature space, despite anadjustment of the bias. The learnability is thereby not aected.

In the case that it is not clear to the reader at this point, one has to emphasizethat, in fact, the generalization loss denes a measure much more adequate todescribe the learning aptitude of a classier, compared to the regularization loss(recall 2.4). The generalization loss is given by

RD(f) = E(x,y)∼D 1sign(f(x))6=y,

which one can interpret as the probability of misclassication of the classier f .

Proposition 2. If the minimal generalization loss is the adequate measure for thelearnability of a linear classication problem, then the proposed scaling invarianceholds.

Proof. Consider the set H of ane linear mappings, i.e. if f ∈ H, then thereexist w, b such that f(x) = 〈w, x〉

2+ b. Assume a rescaling A : x 7→ αx for

α > 0. Then for every f ∈ H one can assign a unique g ∈ H, such that f hasthe same generalization loss in X as g in the rescaled space AX . This one-to-oneand onto mapping obviously guarantees conservation of the minimal generalizationloss. Dene therefore g(x) = 〈w, x〉

2+ αb and F (x, y) = (αx, y).∫

X×Y

1sign(f(x)) 6=y dD(x, y) =

∫F (X×Y)

1sign(f(x))6=y(F−1(z, y))dDF−1(z, y)

=

∫αX×Y

1F (sign(f(x))6=y)(z, y)dP(X, Y )−1F−1(z, y)

=

∫αX×Y

1sign(f( zα

))6=y(z, y)dP(F (X, Y ))−1(z, y)

=

∫αX×Y

1sign(〈w, zα〉2+b)6=y(z, y)dP(αX, Y )−1(z, y)

=

∫AX×Y

1sign( 1α [〈w,z〉

2+αb])6=y(z, y)dAD(z, y)

=

∫AX×Y

1sign( 1αg(z))6=y(z, y)dAD(z, y) =

∫AX×Y

1sign(g(z)) 6=y(z, y)dAD(z, y)

29

However the hinge loss scales along this operation, as it accounts for absolutedistance to the margin. Now, when relating the hinge loss in measurement domainand data domain, as it was done in Theorem 5, one has to take the scaling factorin consideration.

The simple reason for using the regularization loss for estimation of good classi-ers instead of generalization loss is, that the former raises a solvable, convex prob-lem, which solution can be derived eciently, contrary to the latter. When choosingan adequate classier rule, that assigns for each observation set X = x1, . . . , xMof i.i.d. samples from the data distribution D a classier fX , one can show conver-gence of these classiers to the optimal classier - with respect to generalizationloss - as M increases.

In detail, dene the Bayes risk as the lowest achievable risk with respect to thegeneralization loss

RD = infRD(f) | f : X → R measurable.

A classier rule is called universally consistent, if

limM→∞

RD(fX) = RD

holds in probability.

The here discussed SVM -classier rule exhibits this consistency (Proposition3.3, [Ste05]), such that with growing data set size the SVM -classier will approachthe best achievable classier, even though it is minimized with respect to the empir-ical regularization loss. Note that actually one has to ensure a universal propertyon the reproducing kernel of the feature space, which can be neglected in the nitedimensional case.

The intuition about scaling invariance of the learnability can now mathemati-cally be founded on the scaling invariance of the generalization loss.

One promising attempt could now be, to simply drop the DP -scaling propertyand relax theDP -angle-preserving property, such that the pairwise distances withinthe dataset are conserved up to a uniform rescaling, and show the relation betweenthe measurement domain SVM -classier and the optimal linear classier in thedata domain in terms of the generalization loss instead of the hinge loss. This goesbeyond the scope of this thesis, but still something with regard to the preliminaryconsiderations can be achieved. For that recapitulate the ideas of Chapter 3.

One can omit the DP -scaling property, i.e. the Lipschitz constant of the mea-surement matrix A, by constructing the matrix in such a way, that ‖A‖

2≤ 1 =: L

holds in general. This can be done making use of two theorems. The followinglemma results as a special case of a theorem, which states the to date best boundon the embedding dimension m for construction of RIP -satisfying matrices aris-ing from random sub-sampling of rows from unitary matrices, that are maximalincoherent (with respect to standard basis).

30

Lemma 7 (Theorem 3.3, [RV08]). For any δ ∈ (0, 1) and k, n > 2 a random subsetΩ ⊂ 1, . . . , n of cardinality

m ≥ C1

δ2k log (n) log

(C

1

δ2k log (n)

)log (k)2 .

Φ = PΩU satises (k, δ)−RIP with probability exceeding 1− 5 exp−c 1

δ2

where PΩ =

e>ω1...e>ωm

selects the rows of U indexed by Ω.

Note, that this theory can be widely generalized to Bounded Orthonormal Sys-tems (see Chapter 4 in [Rau10]). One can choose U to be the discrete Fourier orwavelet transform, as these are maximal incoherent. The next theorem deduces aJL property from the restricted isometry property

Theorem 7 (Theorem 3.1, [KW10]). Fix η > 0 and ε > 0, and consider a nite set

E ⊂ Rn with |E| = p. Set k ≥ 40 log(

4pη

), and suppose that Φ satises (k, δ)−RIP

with δ ≤ ε4. Let ξ ∼ U (−1, 1n) be a Rademacher sequence and Dξ the diagonal

matrix with ξ on the main diagonal, then with probability exceeding 1− η,

(1− ε)‖x‖22≤ ‖ΦDξx‖2

2≤ (1 + ε)‖x‖2

2

uniformly for all x ∈ E.

Together these statements give the proposition:

Proposition 3. Fix 0 < η, ε < 1 and U a maximal incoherent unitary matrix. Let

m &1

ε2log

(1

η

(M + 2

2

))log

(log

(1

η

(M + 2

2

)))3

log (n) ,

Ω ⊂ 1, . . . , n with |Ω| = m be random and uniformly chosen, and Dξ the diagonalmatrix with ξ on the main diagonal, where ξ ∼ U (−1, 1n). Dene A = PΩUDξ.Then A is (M + 1, 1, ε, δ1)−DP on the feature space X = BnR(0),where δ1 = η + 5 exp

−c16

ε2

and ε = 3ε+ 2σ + σ2.

Proof. As Dξ - and therefore UDξ - is unitary almost surely, it holds that

‖A‖2

= ‖PΩUDξ‖2 = ‖PΩ‖2.

Now, ‖PΩx‖2 =

(∑ω∈Ω

|xω|2) 1

2

≤(

n∑i=1

|xi|2) 1

2

= ‖x‖2, such that

‖PΩ‖2 = max‖x‖

2=1‖PΩx‖2 ≤ max

‖x‖2=1‖x‖

2= 1 =: L

31

So ‖Ax‖2≤ ‖A‖

2‖x‖

2≤ L‖x‖

2which gives the DP -scaling property.

Now, by Lemma 7, PΩU satises (k, ε4) − RIP with k & log

(1η

(M+2

2

))despite an

error probability below 5 exp−c16

ε2

.

For the set X = x1, . . . , xM+1 ∪ 0 of the training examples and zero considerthe set of dierences E = x− x′ | x, x′ ∈ X. Then |E| ≤

(M+2

2

).

By Theorem 7, with probability exceeding 1−(η + 5 exp

−c16

ε2

),

∀x ∈ E : (1− ε)‖x‖22≤ ‖PΩUDξx‖2

2≤ (1 + ε)‖x‖2

2

Due to Lemma 5, therefore for all i, j ∈ 1, . . . ,M + 1,

|〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2

holds, where ε = 3ε+ 2σ + σ2.

Note, that Proposition 3 does not assume any structure, i.e. sparsity, of thedata. Within the Gaussian sampled construction the only spot exploiting sparsitywas due to a better approximation of the Lipschitz constant of the DP -scalingproperty, which now can be circumvent totally. Assume now a sparse feature spaceX = BnR(0) ∩ Σk. One then can further improve the idea of Proposition 3, asone does not need to deduce the JL property for the dataset from the restrictedisometry property, if the measurement matrix already satises (2k, ε)−RIP .

Proposition 4. Fix 0 < ε < 1 and U a maximal incoherent unitary matrix. Let

m &1

ε22k log (2k)3 log (n) ,

Ω ⊂ 1, . . . , n with |Ω| = m be random and uniformly chosen. Dene A = PΩU ,then A is (M + 1, 1, ε, δ1) − DP on the feature space X = BnR(0) ∩ Σk for everyM ∈ N, where δ1 = 5 exp

−c 1

ε2

and ε = 3ε+ 2σ + σ2.

Proof. As in the previous proof via construction ‖Ax‖2≤ L‖x‖

2for L = 1.

Now PΩU satises (2k, ε)−RIP , despite an error probability below 5 exp−c 1

ε2

.

Again dene the sets X = x1, . . . , xM+1 ∪ 0 and E = x − x′ | x, x′ ∈ X.Then for all x ∈ E : x ∈ X2k. So, by applying the restricted isometry property forall x ∈ E it holds, that

(1− ε)‖x‖22≤ ‖Ax‖2

2≤ (1 + ε)‖x‖2

2.

Due to Lemma 5, therefore for all i, j ∈ 1, . . . ,M + 1,

|〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2

holds, where ε = 3ε+ 2σ + σ2.

32

In order to compare this construction with the former in Corollary 1 and 2 oneneeds to apply Theorem 5 to the so gained DP -parameters. For simplicity restrictthe choice of U such that U(R) ⊆ R - given for example by the discrete wavelettransform or the real Fourier transform. This guarantees the measurement data tobe real, such that the SVM -theory applies.

Corollary 3. Consider the feature space X = BnR(0). Fix 0 < η, ε < 1 and U amaximal incoherent unitary matrix with U(R) ⊆ R. Let

m &1

ε2log

(M

η

)log

(log

(M

η

))3

log (n) ,

Ω ⊂ 1, . . . , n with |Ω| = m be random and uniformly chosen, and Dξ the diagonalmatrix with ξ on the main diagonal, where ξ ∼ U (−1, 1n). Dene A = PΩUDξ.Then with probability exceeding 1− δ:


R‖w0‖2

√ε+ σ +

(1 + σ)2 log( 1η)

M

.

where δ = 3η + 15 exp−c16

ε2

.

Proof. From Proposition 3 A is (M + 1, 1, ε, δ1)−DP with δ1 = η + 5 exp−c16

ε2

and ε = 3ε + 2σ + σ2. Again note, that log

[(M+2

2

)]= O (logM). Setting δ2 = δ1

and applying Theorem 5 with these parameters gives the bound


R‖w0‖2

√ε+ σ +

(1 + σ)2 log( 1δ2

)

M

≤ HD(w0) +O

R‖w0‖2

√ε+ σ +

(1 + σ)2 log( 1η)

M

.

Similar for the sparse case:

Corollary 4. Consider the feature space X = BnR(0) ∩ Σk. Fix 0 < ε < 1 and U amaximal incoherent unitary matrix with U(R) ⊆ R. Let

m &1

ε2k log (k)3 log (n) ,

Ω ⊂ 1, . . . , n with |Ω| = m be random and uniformly chosen. Dene A = PΩU .Then with probability exceeding 1− δ:


(R‖w0‖2

√ε+ σ +

(1 + σ)2

Mε2

).

where δ = 15 exp−c 1

ε2

.

33

Proof. From Proposition 4 A is (M + 1, 1, ε, δ1)−DP with δ1 = 5 exp−c 1

ε2

and

ε = 3ε + 2σ + σ2. Setting δ2 = δ1 and applying Theorem 5 with these parametersgives the bound


R‖w0‖2

√ε+ σ +

(1 + σ)2 log( 1δ2

)

M

≤ HD(w0) +O

(R‖w0‖2

√ε+ σ +

(1 + σ)2

Mε2

).

4.2 Relaxation using Coherent and Redundant Dictionaries

A directly accessible sparse data representation with respect to standard basis issomewhat restrictive. One might intuitively assume, that the existence of at leastone sparse representation (with respect to an arbitrary generating system) willbenet the data compression. For a Gaussian sampled measurement matrix sparserepresentations with respect to any orthonormal basis is implicitly captured. As-sume the (non-sparse) data x has a sparse representation u with respect to anorthonormal basis, that is given by the columns of the matrix Φ - i.e. Φ is anorthogonal matrix and x = Φu. Then Ax = AΦu and the entries of A as well asthose of AΦ are i.i.d. drawn from a N (0, 1

m) Gaussian distribution. So, from the

measurement matrix generating perspective, one samples A as likely as AΦ.However this argument does not hold for the sub-sampled incoherent unitary

matrix construction and in addition would still suer from excessive model restric-tion.

A more adequate model is given by redundant dictionary representations. Aredundant dictionary is a generating system, where more than one representationfor each element may exist. For example the union of standard basis and thecolumns of the real Fourier transform form a redundant dictionary. The ambitionof such redundant dictionaries is to gain an at all or more sparse representation ofthe data compared to a basis.

Let D ∈ Rn×d be a matrix of d column-wise redundant dictionary elements.Let x = Du be the observed data with its sparse coecient representation u withrespect to D. One would now want to guarantee some isometry property on suchdata with k-sparse coecient representation. In [CEN10] a generalization of RIPwith respect to a dictionary D is introduced.

Denition 8. Let Σk be the union of all subspaces spanned by all subsets of kcolumns of D. The measurement matrix A obeys the restricted isometry propertyadapted to D (abbreviated D −RIP ) with constant δk, if

(1− δk)‖v‖22≤ ‖Av‖2

2≤ (1 + δk)‖v‖2

2

holds for all v ∈ Σk.

34

For the dictionary generated by the standard basis, D−RIP reduces to standardRIP .

If a measurement matrix A satises JLP with specic parameters, it will alsosatisfy D −RIP with high probability. In concrete

Theorem 8. Let 0 < δ < 1, D ∈ Rn×d a dictionary with RIP -constant δk(D)

and η be the failure probability of A not fullling JLP for p ≥(1 + 12

δ

)kpoints

simultaneously with distortion parameter δ3. Then A will satisfy D − RIP with

δk ≤ δk(D) + δ(1 + δk(D)) with probability exceeding 1−(e dk

)kη.

The above statement can be deduced from ([RSV08], Lemma 2.1 and Theorem2.2), with small deviations in the corresponding proofs. Although the statementhere is in essential equivalent to Theorem 2.2 in [RSV08], it is in the local termi-nology much easier to understand and handle. The condition is fullled by anymatrix satisfying standard RIP by Theorem 7, and therefore by the sub-sampledreal Fourier transform with randomized column signs. Note that, in order to guar-antee a small δk(D), the coherence µ(D) of D must be kept low, i.e. the dictionaryelements should be as equiangular as possible (up to D being an equiangular tightframe), since the coherence gives an upper bound for the the RIP -constant of Dby

δk(D) ≤ (k − 1)µ(D).

Proposition 5. Consider the feature space X = BnR(0) ∩ Σk of vectors with a k-sparse dictionary representation. Let 0 < δ, η < 1, D ∈ Rn×2n be the union of realFourier transform and standard basis as the dictionary and A the sub-sampled realFourier transform with randomized column signs. Choose

m &144

δ2log

(1

η

(1 +

12

δ

)2k)

log

(log

(1

η

(1 +

12

δ

)2k))3

log (n) .

Then A is (M + 1, 1, ε, δ1)−DP for every M ∈ N,where δ1 = 5 exp

−c144

δ2

+(enk

)2kη and ε = 3δ2k + 2σ + σ2 with δ2k ≤ (2k−1)√

m+

δ(1 + (2k−1)√m

).

Proof. The union of real Fourier transform and standard basis as the dictionaryD has coherence µ(D) = 1√

m. So δ2k(D) ≤ (2k−1)√

m. Note, that via construction

‖Ax‖2≤ L‖x‖

2for L = 1. Despite an error probability below 5 exp

−c144

δ2

A

satises (k, δ12

) − RIP by Lemma 7 with k ≥ log(

1η

(1 + 12

δ

)2k). By Theorem 7,

when randomizing the column signs, A will therefore fulll JLP for p ≥(1 + 12

δ

)2k

points simultaneously with distortion parameter δ3up to an error probability η. By

Theorem 8, A satises D −RIP with probability exceeding 1−(enk

)2kη and

δ2k ≤ δ2k(D) + δ(1 + δ2k(D)) ≤ (2k − 1)√m

+ δ(1 +(2k − 1)√

m)

35

Dene the sets X = x1, . . . , xM+1 ∪ 0 and E = x− x′ | x, x′ ∈ X. Then forall x ∈ E : x ∈ Σ2k. So, by applying D −RIP , for all x ∈ E it holds, that

(1− δ2k)‖x‖22≤ ‖Ax‖2

2≤ (1 + δ2k)‖x‖2

2.

Due to Lemma 5, therefore for all i, j ∈ 1, . . . ,M + 1|〈xi, xj〉2 − 〈zi, zj〉2| ≤ εR2

holds, where ε = 3δ2k + 2σ + σ2.

The specic choice of the dictionary, consisting of a union of real Fourier trans-form and standard basis, in Proposition 5 is quite arbitrary such that it would alsohold for other dictionaries with low coherence. One can now apply this propositionto Theorem 5.

Corollary 5. Consider the feature space X = BnR(0) ∩ Σk.Fix 0 < η, δ < 1, such that

m &1

δ2log

(1

η

(1 +

12

δ

)2k)

log

(log

(1

η

(1 +

12

δ

)2k))3

log (n) .

Let U be the real Fourier transform, Ω ⊂ 1, . . . , n with |Ω| = m be random anduniformly chosen, and Dξ the diagonal matrix with ξ on the main diagonal, whereξ ∼ U (−1, 1n). Dene A = PΩUDξ.Then with probability exceeding 1− γ:HD(vAS) ≤ HD(w0)

+O

R‖w0‖2

√√√√(2k − 1)√m

+ δ(1 +(2k − 1)√

m) + σ +

(1 + σ)2[log(

1η

)− 2k log

(nk

)]M

.

where γ = 15 exp−c144

δ2

+ 3

(enk

)2kη.

Proof. From Proposition 5 A is (M + 1, 1, ε, δ1) − DP with δ1 = 5 exp−c144

δ2

+(

enk

)2kη and ε = 3 (2k−1)√

m+ 3δ(1 + (2k−1)√

m) + 2σ + σ2. Setting δ2 = δ1 and applying

Theorem 5 with these parameters gives the bound

HD(vAS) ≤ HD(w0)

+O

R‖w0‖2

√(2k − 1)√

m+ δ(1 +

(2k − 1)√m

) + σ +(1 + σ)2 log( 1

δ2)

M

≤ HD(w0)

+O

R‖w0‖2

√√√√(2k − 1)√m

+ δ(1 +(2k − 1)√

m) + σ +

(1 + σ)2[log(

1η

)− 2k log

(nk

)]M

.

36

5 Experiments

To start with consider the simplest case of toy data which posses a sparse struc-ture. The setup is based on an experiment, which is summarized in ([CJK12],Figure 10.1), despite a fundamental dierence. In essence, they randomly draw Mindependent uniformly over the k-dimensional unit-sphere distributed data points,where k is the sparsity level, and embed all data point into the data domain Rn

by inserting them into a shared randomly chosen support and apply an orthogonaltransformation afterwards. Furthermore they draw a random direction w0 and as-sign to each data point x the label sign(〈x,w0〉2). Obviously the classication prob-lem is linearly solvable. Their results hold under the restrictive auxiliary condition,that all data points were generated on the same k-sparse support, which reduces theproblem complexity drastically towards the sparsity level k. This problem modeltherefore fails to capture the general assumptions of compressed learning. This canbe realized when observing their classication performance in the non compresseddata domain (nearly 0% generalization error) for a very small amount (M = 100) of5-sparse data in n = 1024 dimensions. Note that, even if all data points in trainingwould capture disjoint dimensions, not half of all dimensions could be touched bythe data, such that there would be no chance to classify test data points in theother half better than random. Therefore, if the model complexity was the desired,the chance of tting the model would be very low, leading to a signicantly worseperformance.

Without the auxiliary condition, one would require a drastically increase ofdata amount. As this becomes computationally infeasible here a trade-o will bemade. To justify the following idea, consider the typical real world application oftext classication, using the intuitive bag-of-words kernel, that accounts for everycommon word in two compared strings the product of multiplicities. The featurevectors possess the assumed sparsity, since it is very unlikely that a text will containevery possible word. Furthermore one observes, that some words are related to asubset of classes (e.g. topics). This means, that a class unrelated with that word,when only accounting for occurrences, will not hold a negative characteristic ofthat feature dimension but an absence. Abstractly viewed, this means that it isa quite realistic assumption, that for each class there exists a specic subset offeature dimensions on which its corresponding data will have support more likelycompared to the remaining dimensions.

One can simulate that behavior by choosing the support of a data point accord-ing to a class dependent distribution. That can be carried out by the followingprocedure. First assign uniquely all dimensions to the classes with equal amount.For each data point choose a part of its support completely at random, estimatethe label as above and then choose the rest of the support from the class specicdimensions. This procedure provides for the data`s support a ratio of occupyingthe class related features to the remaining features. For two classes, a ratio of0.5 corresponds to the completely random and most general model, a ratio of 1corresponds to the unrealistic case of a strictly separation of class supports.

37

The rst experiment will demonstrate the inuence of that ratio with the setupn = 511, m = 200, k = 32,M = 2000. igure 4 shows the average generalization loss

0.0

0.1

0.2

0.3

0.4

0.5

0.5 0.6 0.7 0.8 0.9 1.0Class Specific Support Ratio

Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement

Gaussian Measurement

Bad Measurement

Figure 4: The generalization error plotted against a varying ratio for M = 2000data points with sparsity k = 32 in n = 511 dimensions in the uncompressed formand when taking m = 200 measurements with dierent methods

when varying the ratio parameter. The experiment was applied to the original dataset, the data set once measured by the sub-sampled real Fourier transform, the sub-sampled wavelet transform, by the Gaussian sampled measurement matrix and, as areference, a `bad' measurement matrix. The `bad' measurement matrix is generatedas follows. Sample a random direction, project it onto the true decision hyperplaneand sample at random m vectors close to the projected direction. This matrixshould perform bad, since it measures essentially non-informative directions. Theevaluation consists of a 5-fold cross-validation, with respect to the generalizationloss, over the regularization parameter C and thereupon selection of the minimizingC's. The result from the rst experiment suggests to take the ratio 2

3as a base for

further experiments, as this is the point at which the non-compressed data can belearned almost perfectly.

The second experiment shows the learning behavior for a growing amount oftraining data. In Figure 5 one can infer that at some point, when further growingthe amount of data, no method will exceed a certain threshold of quality convergingto some method specic optimum. Moreover one can observe that the qualityoptimum for the sub-sampled incoherent unitary matrices is larger than the qualityoptimum of a Gaussian sampled measurement matrix.

38

0.0

0.1

0.2

0.3

0.4

0.5

0 2000 4000 6000 8000 10000Data Set Size

Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement


Bad Measurement

Figure 5: The generalization error plotted against a varying data amount withsparsity k = 32 in n = 511 dimensions and a ratio of 2

3in the uncompressed form

and when taking m = 200 measurements with dierent methods

0.0

0.1

0.2

0.3

0.4

0.5

100 200 300 400 500Embedding Dimension

Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement


Bad Measurement

Figure 6: The generalization error plotted against a varying embedding dimensionforM = 2000 data points with sparsity k = 32 in n = 511 dimensions and a ratio of23in the uncompressed form and when taking measurements with dierent methods

Based on the point in the second experiment, at which the performance ofthe methods of interest converge (M ∼ 2000), the next experiment analyses theperformance for several embedding dimensions.

39

The results displayed in Figure 6 again indicate the sub-sampled incoherentunitary matrices being superior to a Gaussian sampled measurement matrix. As thesub-sampled incoherent unitary matrices converge to true unitary transformationsasm approaches n, one can always guarantee an exact reconstruction of the learningproblem for m large enough.

The last parameter of interest, the sparsity level k, comes along with an unex-pected behavior in the local experiments. A growing sparsity level should result inan increase of generalization error. However the opposite is the case, when holdinga ratio > 0.5 of class specic support, shown in Figure 7.

0.0

0.1

0.2

0.3

0.4

0.5

0 50 100 150 200Sparsity Level

Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement


Bad Measurement

Figure 7: The generalization error plotted against a varying sparsity level for M =2000 data points in n = 511 dimensions and a ratio of 2

3in the uncompressed form

and when taking m = 200 measurements with dierent methods

Assuming that holding a specic ratio value > 0.5 has a latent eect to thelearnability, one can take a look at the totally random case, i.e. ratio = 0.5,displayed in Figure 8. Here still no increase of generalization error, more preciselyno inuence to generalization error at all, can be observed. When holding a staticnumber of class specic dimensions, the desired result can be created, as one canobserve in Figure 9. But this is in some sense a hack, since a static number ofclass specic dimensions for a growing sparsity level comes along with a decreasingratio.

40

0.2

0.3

0.4

0.5


Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement


Bad Measurement

Figure 8: The generalization error plotted against a varying sparsity level for M =2000 data points in n = 511 dimensions and a ratio of 0.5 in the uncompressedform and when taking m = 200 measurements with dierent methods

0.0

0.1

0.2

0.3

0.4

0.5


Gen

eral

izat

ion

Err

or Domain

Uncompressed

Wavelet Measurement

Fourier Measurement


Bad Measurement

Figure 9: The generalization error plotted against a varying sparsity level for M =2000 data points in n = 511 dimensions and a ratio of 0.5 in the uncompressedform and when taking m = 200 measurements with dierent methods

41

6 Limitations of Compressed Learning

6.1 The nite Data Dimension

So far here the focus was always a nite data dimension n. It is very common in theSVM -theory to allow innite data dimension, since - as long as we only observea nite amount of data - the optimal solution lies within the span of the data,implicitly embedding the optimization problem in a nite dimensional context.There exist new results in compressed sensing generalizing the model to innitedimensions, where the RIP -usage went astray in the theory extension. That thisis possible, can be accepted, when noting, that RIP was only a sucient but not anecessary condition for data compression. Since, so far in the compressed learningtheory, here we need the JLP or RIP for another reason than simply compression,in particular the structure preservation, this is not a possible way of extending thelocal theory. The following theorem gives an intuition, why within the compressedsensing generalization one actually has to abandon RIP .

Theorem 9 (Theorem 2,[BD09]). Let A be the union of L subspaces of dimensionno more than k. In order for a linear map Φ : A → Rn to exist such that it hasa Lipschitz constant KF and such that its inverse map ΦA : Φ(A) → A has aLipschitz constant KI , it is necessary that

m ≥ log(L)

log(

4KFKI∆(A)

) .

Here ∆(A) measures the distance of subspaces. When considering A = Σk,

it follows, that ∆(A) ≥√

2k, i.e. a constant with respect to the data dimension

n, and L =(nk

). A matrix A satisfying RIP implies by denition the existence

of Lipschitz constants for itself and its inverse. So by Theorem 9 the existenceof A is coupled with a dependence on the embedding dimension with the datadimension m & log

(nk

)& log(n). If one would now extend in this way the data

dimension to innity, one would require the embedding dimension to be inniteas well. As a more general but similar operator property one can consider theRestricted Amplication Property(see Denition 7 in [BCDH08]), which requiressome kind of clustering structure of relevant coecients, lowering the amount ofconsidered k-dimensional subspaces, such that there remains no dependence on thedata dimension n. This clustering structure is given for instance in real world data,which is sparse in the wavelet domain, as the coecients there have by nature atree-structure. When solely allowing subspaces, that are nite sub-trees such thatevery node of it, if it has a parent node in the original tree, implies the parent node

to be part of the sub-tree as well, the amount of subspaces does not exceed (2e)k

k+1

(see Proposition 1 in [BCDH08]). The structure condition also seems reasonable,i.e. not very restrictive, from the machine learning point of view.

42

6.2 Kernel-SVMs vs. Compressed Learning

A fundamental assumption in compressed learning is, that measurements will betaken from a data domain with vector space structure and, in advance, that alinear hyperplane can solve the classication problem. When having no directaccess to the data except for linear measurements, as such applications exist fromthe compressed sensing point of view, this is no drawback at all. From the machinelearning point of view on the other hand, lots of problems of interest concern datadomains, that are not linearly separable or are not even represented in a vectorspace, having by nature a structured representation like graphs or strings instead.

In those cases one commonly applies an appropriate kernel to transform thedata onto a vector space structure, where in addition the classication problem canbe solved by a linear hyperplane. Of course one could now proceed with compressedlearning in the usual way - assuming that the kernel induced feature map can bederived. Note, that this might not be (trivially) given.

However this scenario might give rise to a computational pitfall:Consider the string classication problem of separating spam mails from mailsof interest with the intuitive bag-of-words kernel, that accounts for every commonword in the two compared strings the product of multiplicities. The induced featurevectors are therefore simply given by the word counts of a string, where each entryin the vector corresponds to exactly one of all possible words.

Such vectors are in general very sparse. Call k the average sparsity of all nglobally occurring words in the data set of M text documents. As usual call mthe the embedding dimension of a measurement matrix for the compressed learningapproach. Now, in the learn phase,the direct computation of a kernel value betweentwo documents can be performed in O(k) such that all kernel values are determinedin O(kM2). On the other hand, even using the most ecient way to derive theembedded inner products 〈AΦ(x), AΦ(y)〉

2by once computing B = A>A then

exploiting the sparsity by

〈AΦ(x), AΦ(y)〉2

=∑

Φ(x)i 6=0

∑Φ(y)j 6=0

BijΦ(x)iΦ(y)j

one ends up with a computation complexity of O(k2M2).I.e. the embedded version takes more computation time, while in advance yield-

ing less accuracy. The same holds for the classication phase.Moreover there exist cases, when for example comparing graphs, where a feature

map is not directly accessible and the kernel values can be approximated linearlyin graph size.

43

7 Conclusion

Compressed learning exploits properties of specically random sampled measure-ment methods that preserve with high probability the structure of data sets andthereby learning problems imposed on them. Therefore such problems can equiva-lently be solved on a comparably small amount of measurements from the originaldata. The distance-preserving property introduced by Calderbank and Jafarpour,which is based on the from compressed sensing well known restricted isometryproperty and Johnson-Lindenstrauss property, is sucient to proof a conservationof classication problems solved by SVM's. Since these properties prot from spar-sity in the data, it is straightforward to achieve better conservation results undersparsity.

Experiments in relatively small dimensions compared to real world applicationssuggest that data without further structure, despite sparsity, can only be learnedunder an unrealistically large amount of samples coming along with an infeasiblecomputation eort. An adequate level of structure in data should therefore bepresumed. Moreover the experiments indicate that, unlike a Gaussian sampledmeasurement matrix, a sub-sampled incoherent unitary matrix with randomizedcolumn signs better matches the actual necessary structure preserving properties.

The theory of SVM's holds for innite data dimensions. Unfortunately, analo-gous to the case of compressed sensing, compressed learning can not be raised tothe general case of innite data dimensions when basing results on the restrictedisometry property or Johnson-Lindenstrauss property. Now that these propertiesare simply sucient but not necessary conditions for problem conservation, oneshould in future work try to formulate the essential properties, that render problemconservation possible, and adapt more general structure preserving measurementmethods to these. The current state of compressed learning only captures the caseof approximately linear separable data. The from the machine learning perspectivenatural extension to non-linear models with kernel methods can not be treated sofar. Instead one would be coerced to estimate the by the kernel induced featuremap and apply it on the data, which can be subject to a computational pitfall.

When for a concrete application the natural way to access data is via takinglinear measurements, like it is the case when generating images from physical mea-surements, compressed learning provides the theoretical background to handle suchdata directly. Although compressed learning is far from fully developed, applica-tions which match its prerequisites can strongly benet from using its results.

44

References

[Bar07] R.G. Baraniuk. Compressive sensing. Signal Processing Magazine,IEEE, 24(4):118121, July 2007.

[BCDH08] Richard G. Baraniuk, Volkan Cevher, Marco F. Duarte, and ChinmayHegde. Model-based compressive sensing. CoRR, abs/0808.3572, 2008.

[BD09] Thomas Blumensath and Michael Davis. Compressed sensing signalmodels - to innity and beyond? In Laurent Fesquet and Bruno Tor-résani, editors, SAMPTA'09, International Conference on SamplingTheory and Applications, page Special Session on compressed sensing,Marseille, France, 2009.

[BDDW08] Richard Baraniuk, Mark Davenport, Ronald DeVore, and MichaelWakin. A simple proof of the restricted isometry property for randommatrices. Constructive Approximation, 28:253263, 2008.

[Bis07] Christopher M. Bishop. Pattern Recognition and Machine Learning(Information Science and Statistics). Springer New York, 1st ed. 2006.corr. 2nd printing edition, October 2007.

[Can06] Emmanuel J. Candès. Compressive sampling. Proceedings of the Inter-national Congress of Mathematicians, 2006.

[CEN10] Emmanuel J. Candès, Yonina C. Eldar, and Deanna Needell. Com-pressed sensing with coherent and redundant dictionaries. CoRR,abs/1005.2613, 2010.

[CJK12] Robert Calderbank, Sina Jafarpour, and Jeremy Kent. Finding needlesin compressed haystacks. In Y.C. Eldar and G. Kutyniok, editors,Compressed Sensing: Theory and Applications, chapter 10. CambridgeUniversity Press, May 2012.

[CR06] Emmanuel J. Candès and Justin K Romberg. Quantitative robust un-certainty principles and optimally sparse decompositions. Foundationsof Computational Mathematics, 6(2):227254, 2006.

[CRT06a] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Robustuncertainty principles: exact signal reconstruction from highly incom-plete frequency information. IEEE Trans. Inform. Theory, 52(2):489509, 2006.

[CRT06b] Emmanuel J. Candès, Justin K. Romberg, and Terence Tao. Stable sig-nal recovery from incomplete and inaccurate measurements. Commu-nications on Pure and Applied Mathematics, 59(8):12071223, August2006.

45

[CT05] Emmanuel J. Candès and Terence Tao. Decoding by linear program-ming. IEEE Trans. Inform. Theory, 51(12):42034215, 2005.

[CT06] Emmanuel J. Candès and Terence Tao. Near-optimal signal recoveryfrom random projections: Universal encoding strategies? IEEE Trans.Inform. Theory, 52(12):54065425, 2006.

[Dav10] M. Davenport. Random observations on random observations: Sparsesignal acquisition and processing. PhD thesis, Rice University, August2010.

[DDEK12] M.A. Davenport, M.F. Duarte, Y.C. Eldar, and G. Kutyniok. Intro-duction to compressed sensing. In Y.C. Eldar and G. Kutyniok, editors,Compressed Sensing: Theory and Applications, chapter 1. CambridgeUniversity Press, May 2012.

[DG03] Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a the-orem of johnson and lindenstrauss. Random Structures & Algorithms,22(1):6065, 2003.

[Don06] David L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory,52:12891306, 2006.

[HA11] Anders C. Hansen and Ben Adcock. Generalized sampling and innitedimensional compressed sensing. Magnetic Resonance Imaging, pages147, 2011.

[JL84] William Johnson and Joram Lindenstrauss. Extensions of Lipschitzmappings into a Hilbert space. In Conference in modern analysis andprobability (New Haven, Conn., 1982), volume 26 of ContemporaryMathematics, pages 189206. American Mathematical Society, 1984.

[Kot33] V. Kotelnikov. On the carrying capacity of the ether and wire intelecommunications. Izd. Red. Upr. Svyazi RKKA, 35:181194, 1933.

[KW10] Felix Krahmer and Rachel Ward. New and improved johnson-lindenstrauss embeddings via the restricted isometry property. CoRR,abs/1009.0744, 2010.

[NT09] D. Needell and J.A. Tropp. Cosamp: Iterative signal recovery fromincomplete and inaccurate samples. Applied and Computational Har-monic Analysis, 26(3):301 321, 2009.

[Nyq28] H. Nyquist. Certain topics in telegraph transmission theory. Trans.AIEE, 47:617644, 1928.

46

[Rau10] H. Rauhut. Compressive sensing and structured random matrices. InM. Fornasier, editor, Theoretical Foundations and Numerical Methodsfor Sparse Recovery, volume 9 of Radon Series Comp. Appl. Math.,pages 192. deGruyter, 2010. Corrected Version (April 4, 2011): Non-commutative Khintchine inequality for Rademacher chaos (Theorem6.22) corrected, Proof of Theorem 9.3 corrected. Proof of Lemma 8.2corrected.

[RSV08] Holger Rauhut, K. Schnass, and Pierre Vandergheynst. Compressedsensing and redundant dictionaries. IEEE Trans. Inform. Theory,54(5):22102219, 2008.

[RV08] M. Rudelson and R. Vershynin. On sparse reconstruction from fourierand gaussian measurements. Communications on Pure and AppliedMathematics, 61(8):10251045, 2008.

[Sha49] C. Shannon. Communication in the presence of noise. Proc. Instituteof Radio Engineers, 37(1):1021, 1949.

[SSSs08] K. Sridharan, N. Srebro, and S. Shalev-shwartz. Fast rates for regular-ized objectives. In In Neural Information Processing Systems, 2008.

[Ste05] I. Steinwart. Consistency of support vector machines and other regu-larized kernel classiers. Information Theory, IEEE Transactions on,51(1):128142, 2005.

[Wal99] R. H. Walden. Analog-to-digital converter survey and analysis. IEEEJournal on Selected Areas in Communications, 17(4):539550, April1999.

[Whi15] E. Whittaker. On the functions which are represented by the expansionsof the interpolation theory. Proc. Royal Soc. Edinburgh, 35:181194,1915.

47

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Compressed Sensing and Machine Learning · The following introductions to omprcessed sensing and...

Documents