Sie sind auf Seite 1von 31

Q-Learning and Dynamic Treatment Regimes

S.A. Murphy Univ. of Michigan IMS/Bernoulli: July, 2004

Outline
Dynamic Treatment Regimes
Optimal Q-functions and Q-learning The Problem & Goal Finite Sample Bounds Outline of Proof Shortcomings and Open Problems

Dynamic Treatment Regimes


---- Multi-stage decision problems: repeated decisions are made over time on each patient.

---- Used in the management of Addictions, Mental Illnesses, HIV infection and Cancer

k Decisions
Observations made prior to tth decision Action at tth decision

Primary Outcome:

A dynamic treatment regime is a vector of decision rules, one per decision

If the regime is implemented then

Goal: Estimate the decision rules that maximize mean

Data: Data set of n finite horizon trajectories, each with


randomized actions.

are randomization probabilities.

Optimal Q-functions and Q-learning:

Definition:

denotes expectation when the actions are chosen according to the regime

Q-functions:
The Q-functions for optimal regime, recursively by are given

For t=k,k-1,.

Q-functions:
The optimal regime is given by

Q-learning:
Given a model for the Q-functions, minimize

over

Set

Q-learning:
For each t=k-1,,1 minimize

over

And set

and so on.

Q-Learning:
The estimated regime is given by

The Problem & Goal:


Most learning (e.g. estimation) methods utilize a model for all or parts of the multivariate distribution of

implicitly constrains the class of possible decision rules in the dynamic treatment regime: call this constrained class,

is a vector with many components (high dimensional) thus the model is likely incorrect; view and as approximation classes.

Goal: Given a learning method and approximation classes


assess the ability of learning method to produce the best decision rules in the class. Ideally construct an upper bound for

where

is the estimator of the regime

denotes expectation when the actions are chosen according to the rule

Goal: Given a learning method, model


class

and approximation construct a finite sample upper bound for

This upper bound should be composed of quantities that are minimized in the learning method.

Learning Method is Q-learning.

Finite Sample Bounds:


Primary Assumptions:

(1)

for L>1.

(2) Number of possible actions is finite.

Definition:

where E, without a subscript, denotes expectation when the actions are randomized.

Results:
Approximation Error:

The minimum is over

with

Define

The estimation error involves the complexity of this space.

Estimation Error:
For with probability at least 1-

for n satisfying

If

is finite then n needs only to satisfy

that is,

Outline of Proof:
The Q-functions for regime are given by

Proof Outline
(1)

Proof Outline
(2)

It turns out that also

Proof Outline

(3)

Shortcomings and Open Problems

Recall Estimation Error:


For with probability at least 1-

for n satisfying

Open Problems

Is there a learning method that can learn the best decision rule in an approximation class given a data set of n finite horizon trajectories? Sieve Estimators or Regularized Estimators? Dealing with high dimensional X-- feature extraction--feature selection.

This seminar can be found at:


http://www.stat.lsa.umich.edu/~samurphy/seminars/ ims_bernoulli_0704.ppt

The paper can be found at :


http://www.stat.lsa.umich.edu/~samurphy/papers/ Qlearning.pdf

samurphy@umich.edu

Recall Proof Outline


(2)

It turns out that also

Recall Proof Outline


(1)

Das könnte Ihnen auch gefallen