11 views

Uploaded by Pinrolinvic Liemq Manembu

- Tony West - Aba Fca-qui Tam 6-3-10
- Nq Non-spouse Rind_0080
- Fraud and Scams in Banking Sector
- Interim order in the matter of Kamalakshi Finance Corp. Ltd.
- September 22, 2011 - The Federal Crimes Watch Daily
- United States v. Jean Mari Lindor, 11th Cir. (2015)
- The Fraudsters Playbook Jumio White Paper 151113 v2
- mtech Ins
- The NAME
- Deteective in Delhi..
- United States v. Beckley, 97 F.3d 507, 11th Cir. (1996)
- Jane Doe Motion Criminal Negligence Template (1)
- United States v. Lyle David Pierce, Iii, A/K/A Sealed 2, A/K/A Joe Martin, A/K/A Joe Boy, and Regina Pierce, 224 F.3d 158, 2d Cir. (2000)
- 4 Batulanon v.s People
- Fraud
- #280 BBB 11-25-09 monthly disk 35
- Order - Attachment PALIS
- Topic 13
- Caltex vs. Palomar
- Lemonade 2016

You are on page 1of 6

Scribe: Eric Goldlust April 9, 2008

Bound on the Loss of the Widrow-Ho algorithm

Last time, we were analyzing the performance of the Widrow-Ho algorithm, specied as

follows:

1: w

1

= 0

2: for t = 1 to T do

3: get example x

t

R

n

4: predict response y

t

= w

t

x

t

5: observe response y

t

R

6: set w

t+1

= w

t

(w

t

x

t

y

t

)x

t

7: end for

We dened the loss under Widrow-Ho to be: L

WH

=

T

t=1

( y

t

y

t

)

2

And the loss under

a xed vector u to be L

u

=

T

t=1

(u x

t

y

t

)

2

and partially proved the following theorem

about L

WH

, which does not rely on any distributional assumptions about the data.

Theorem. Assume that for every t, we know that ||x

t

||

2

1. Then:

L

WH

min

u

_

L

u

1

+

||u||

2

2

_

.

In the previous lecture, we reduced the proof of this theorem to the proof of the following

lemma, which we now prove:

Lemma. Let ||x

t

||

2

1. Dene the potential function

t

= ||w

t

u||

2

2

. Dene the time-t

signed error of Widrow-Ho by

t

= w

t

x

t

y

t

and the signed error from a xed u by

g

t

= u x

t

y

t

. Then, for every t and u:

t+1

t

2

t

+

1

g

2

t

.

Proof. Let

t

=

t

x

t

t+1

t

= ||w

t+1

u||

2

2

||w

t

u||

2

2

= ||(w

t

u)

t

||

2

2

||w

t

u||

2

2

= ||

t

||

2

2

2(w

t

u)

t

=

2

2

t

||x

t

||

2

2

. .

1

2

t

x

t

(w

t

u)

. .

=

t

g

t

2

2

t

2

2

t

+ 2

t

g

t

=

2

2

t

2

2

t

+ 2

_

_

t

_

1

_

_

g

t

1

__

.

We next use the real algebraic inequality

1

ab (a

2

+b

2

)/2 for the case where a =

t

1

and b =

g

t

1

. This gives

t+1

t

2

2

t

2

2

t

+ (1 )

2

t

+

1

g

2

t

= [

2

2 + (1 )]

2

t

+

1

g

2

t

=

2

t

+

1

g

2

t

which completes the proof of the lemma, and in turn the theorem.

Generalization: Varying the Loss function and the Norm

When we originally derived the Widrow-Ho update rule, we tried to nd a value of w

t+1

that minimized a linear combination of the loss of w

t+1

on (x

t

, y

t

) and the norm ||w

t+1

w

t

||

2

2

. Specically, we wanted to minimize (w

t+1

x

t

y

t

)

2

+||w

t+1

w

t

||

2

2

. We can try to

generalize this objective function by replacing either the loss term or the norm term with a

more general function.

General loss function, L

2

distance

If our objective function is given by:

L(w

t

, x

t

, y

t

) +||w

t+1

w

t

||

2

2

Then we get the Gradient Descent (GD) update rule:

w

t+1

= w

t

w

L(w

t+1

, x

t

, y

t

)

w

t

w

L(w

t

, x

t

, y

t

).

Square loss, Relative Entropy distance

If our objective function is given by:

(w

t

x

t

y

t

)

2

+ RE(w

t

||w

t+1

)

then we nd the following update rule, which is specied component-wise:

t+1,i

=

t,i

exp {(w

t+1

x

t

y

t

)x

t,i

}

Z

t

t,i

exp {(w

t

x

t

y

t

)x

t,i

}

Z

t

where Z

t

are normalization factors. Note the parallel with the original Widrow-Ho rule:

In that case, we added

t

to w

t

. In this case, we multiply componentwise with the

exponentiation of the components of

t

and then normalize.

1

Follows from a

2

2ab + b

2

= (a b)

2

0.

2

General loss function, Relative Entropy distance

If our objective function is given by:

L(w

t

, x

t

, y

t

) + RE(w

t

||w

t+1

)

then we get the Exponentiated Gradient (EG) update rule:

t+1,i

=

t,i

exp

_

i

(w

t+1

, x

t

, y

t

)

_

Z

t

t,i

exp

_

i

(w

t

, x

t

, y

t

)

_

Z

t

.

There is a performance bound for the EG update rule that looks similar to the one we

proved about WH. In order to make an apples-to-apples comparison, we rst rewrite the

original WH bound as follows:

||x

t

||

2

1 L

WH

min

u:||u||

2

=1

[aL

u

+ b] for some a, b.

For EG with square loss, there is the following similar-looking bound:

||x

t

||

1 L

EG

min

u:||u||

1

=1

[aL

u

+ b ln N] for some a, b.

We can now add an element to our recurring dichotomy between additive and multiplicative

algorithms:

additive updates multiplicative updates

L

2

/L

2

L

/L

1

Support Vector Machines AdaBoost

Perceptrons Winnow

Gradient Descent / Widrow-Ho Exponentiated Gradient

Using Online Algorithms in a Batch Setting

We have analyzed both the batch setting and the online setting. In some sense, the results

in the online setting are stronger because they do not rely on statistical assumptions about

the data (for example that the examples are i.i.d). We now analyze the use of these online

algorithms on batch data where these statistical assumptions are assumed to hold. The

result will be simple and fast algorithms with generalization bounds that come for free from

the analysis we did in the online setting.

The Batch Setting

Given S = (x

1

, y

1

), . . . , (x

m

, y

m

), assume that for any i, (x

i

, y

i

) D. Let there be a new

test point (x, y) that is also distributed according to D. We want to nd a linear predictor

v with a low expected loss (also called risk or true risk), dened by:

R

v

= E

(x,y)D

_

(v x y)

2

.

3

When we say that we want the expected loss to be low, we mean this in relation to that of

the best possible value for any u, i.e. it should be low relative to min

u

R

u

.

One reasonable way to accomplish this is to use Widrow-Ho on the dataset, treating it as

though the examples were arriving online. This would yield a sequence of weight vectors

w

1

= 0, w

2

, . . . , w

m

. We could then return the nal hypothesis v = w

m

. Unfortunately,

if we return w

m

, the analysis turns out to be too dicult. We can, however, analyze the

performance if we return v =

1

m

m

t=1

w

t

. In this case, we can prove the following theorem:

Theorem.

E

S

[R

v

] min

u

_

_

R

u

1

+

||u||

2

2

m

. .

0 as m

_

_

Here, the expectation is over the random training set S. We proceed in steps, starting with

three lemmas.

Lemma (1).

(v x y)

2

1

m

m

t=1

(w

t

x y)

2

.

Proof.

(v x y)

2

=

__

1

m

m

t=1

w

t

_

x y

_

2

=

_

1

m

m

t=1

(w

t

x y)

_

2

1

m

m

t=1

(w

t

x y)

2

(by convexity of f(z) = z

2

).

Lemma (2).

E

_

(u x

t

y

t

)

2

= E

_

(u x y)

2

.

Proof. This statement is true because (x

t

, y

t

) and (x, y) are identically distributed.

Lemma (3).

E

_

(w

t

x

t

y

t

)

2

= E

_

(w

t

x y)

2

.

Proof. w

t

is chosen before (x

t

, y

t

), so (x

t

, y

t

) and (x, y) are identically distributed given w

t

.

Note that this is not true, for example, given w

t+1

.

We are now ready to prove the theorem.

4

Proof. For any xed u, we have:

E

S

[R

v

] = E

_

(v x y)

2

E

_

1

m

m

t=1

(w

t

x y)

2

_

(by lemma 1)

=

1

m

m

t=1

E

_

(w

t

x y)

2

=

1

m

m

t=1

E

_

(w

t

x

t

y

t

)

2

(by lemma 3)

=

1

m

E

_

m

t=1

(w

t

x

t

y

t

)

2

_

=

1

m

E[L

WH

]

1

m

E

_

m

t=1

(u x

t

y

t

)

2

1

+

||u||

2

2

_

(previously shown in WH analysis)

=

1

m

_

m

t=1

E

_

(u x

t

y

t

)

2

1

+

||u||

2

2

_

=

1

m

_

m

t=1

E

_

(u x y)

2

1

+

||u||

2

2

_

(by lemma 2)

=

_

R

u

1

+

||u||

2

2

m

_

.

Since this holds for any u, the theorem follows.

Probability Modeling

So far, we have analyzed situations where we deal with (x, y) pairs, where y could be real or

categorical. We now consider the situation where we receive only x and our goal is to model

its distribution. Let x P. The goal is to estimate P. This task is called Probability

Modeling or Density Estimation.

One example of where this can be useful is in speech recognition. In order to decide if

a speaker has just said I sat on a chair or I fat on a chair, a system could use prior

estimates of the relative likelihood of these two phrases in English in order to decide that

I sat on a chair was more likely.

We might also want to perform density estimation for classication problems. If we

wanted to estimate human gender based on height, we could build separate density estimates

(possibly just by estimating gaussian parameters) for men and women and then use Bayes

rule to decide which gender was more likely under that probability model given the height

of the test person. This is called the generative approach.

By contrast, the discriminative approach makes no attempt to model the distribution

of the data, but rather just tries to nd an eective classication rule. In this case, we would

just estimate a threshold height above which to classify a test person as male.

An advantage of the discriminative approach is that it is direct and simple and does not

require assumptions about the distribution of the data. The generative approach, however,

5

has the advantage that expert knowledge of the distribution can sometimes lead to higher

performance with less training data.

One example where the generative approach is eective is in the detection of fraudulent

calling-card phone calls. One could build, for every customer, a probability distribution over

phone calls (frequency, duration, location, etc), and then build similar models for fraudsters.

Given a test phone call, one could use the probabilities under these two distributions to

estimate the probability that the call is fraudulent. In this case, the generative approach

can work well.

6

- Tony West - Aba Fca-qui Tam 6-3-10Uploaded byMainJustice
- Nq Non-spouse Rind_0080Uploaded byAlexander Herbert
- Fraud and Scams in Banking SectorUploaded byPratik Sakhalkar
- Interim order in the matter of Kamalakshi Finance Corp. Ltd.Uploaded byShyam Sunder
- September 22, 2011 - The Federal Crimes Watch DailyUploaded byDouglas McNabb
- United States v. Jean Mari Lindor, 11th Cir. (2015)Uploaded byScribd Government Docs
- The Fraudsters Playbook Jumio White Paper 151113 v2Uploaded bytmuzaffar1
- mtech InsUploaded byAnbh88
- The NAMEUploaded bysaffo40
- Deteective in Delhi..Uploaded byadamboris3
- United States v. Beckley, 97 F.3d 507, 11th Cir. (1996)Uploaded byScribd Government Docs
- Jane Doe Motion Criminal Negligence Template (1)Uploaded byDylan Schumacher
- United States v. Lyle David Pierce, Iii, A/K/A Sealed 2, A/K/A Joe Martin, A/K/A Joe Boy, and Regina Pierce, 224 F.3d 158, 2d Cir. (2000)Uploaded byScribd Government Docs
- 4 Batulanon v.s PeopleUploaded byTon Rivera
- FraudUploaded byvinushasai
- #280 BBB 11-25-09 monthly disk 35Uploaded bybmoak
- Order - Attachment PALISUploaded bypngpls
- Topic 13Uploaded byItalo Granato
- Caltex vs. PalomarUploaded byTrish Verzosa
- Lemonade 2016Uploaded byShahbaz Ahmed
- Homework 4 - ForecastingUploaded byNathan
- Testimony Re School Governance 10.8.13Uploaded byCSM
- NASKAH PUBLIKASIUploaded byandrajunarto
- The Fraudsters Playbook Jumio White Paper 151113 (US).pdfUploaded byadeem maqsood basraa
- Clark LundbergUploaded byms. bee
- Fraud Scheme CategoriesUploaded byDemos Rea
- Kanwal DhaUploaded byKanwaljyot Kaur
- Allison Engine Sanders Opinion 06 08Uploaded byD B Karron, PhD
- 12. Buaya vs. PoloUploaded byBet
- EXELLLLL COMBINACIONUploaded byyhonvivanco

- Principles of MarketingUploaded byrupesh sharma
- Statistics and UT Mining Report 1880Uploaded byRussell Hartill
- 210 KVPY Stage 1 Exam 2016 17 Paper SolutionUploaded bypankaj16fb
- ISSUES REGARDING MIS STRUCTUREUploaded byVinayKumarSingh
- Example an Op Amp Circuit Analysis LectureUploaded byAthiyo Martin
- What is ScadaUploaded bySammyJay
- C# (Sharp) tutorial - Create a rock, paper and scissors gameUploaded byAnhar Ali
- Career on a Drilling RigUploaded bydonivald
- Pros.PDFUploaded bySachinGoel
- Mach3 V3.x Programmer Reference Draft v0.11Uploaded byAlexandre Oliveira
- Saturn V S-IC Rocket EnginesUploaded byAviation/Space History Library
- Technology the World NeedsUploaded bySterling Gordian
- Fiqh of Salah Part 7Uploaded byAbdullah Aljammal
- currenttrendstranslation 2006 7Uploaded byapi-106962289
- Maternal Benefits of BreastfeedingUploaded byJuliana Andres
- MHD56251_ed4Uploaded byjinyuan74
- Sept 2011_MailCountGuide 8-04-2011 (2)Uploaded bypostaltexan
- Haeckel LifeUploaded bycharlygramsci
- Climate Change PresentationUploaded byMarc Alamo
- Causes of DeforestationUploaded byZakaria Ghani
- Causes of Time OverrunsUploaded bypeterangatia
- Crystal Grid MatrixUploaded byTheMothership
- Importance of ChemistryUploaded byRoksana Islam
- United States v. Redding, 4th Cir. (2011)Uploaded byScribd Government Docs
- Terra Et Aqua 145 CompleteUploaded byPramona
- FC-00Uploaded by1985chandu1228
- Asian Development Review - Volume 26, Number 1Uploaded byAsian Development Bank
- Stock List 1Uploaded byMina Esmat
- EnalaprilUploaded byMarwa Basil
- Mit15 s50iap15 l1 Intro Poker CourseUploaded bypietrot