RL 3

Correlated Q Learning Replication
Anthony Singhavong
asinghavong@gatech.edu
1 Introduction
In ’Correlated Q-Learning’, Amy Greenwald and Keith Hall introduce four variantions of Correlated-
q (CE-Q) learning algorithms and how they’re applied to different Markov games. In particular,
the researchers present the challenge of learning equilibrium policies and how it relates to games
with multiple agents. The overall objective of this paper is to replicate one particular Markov game
detailed in Figure 3 of the original paper. Concretely, there are four different graphs per each variant
that demostrate the different non-deterministic Q learner algorithms applied to the context of a zero
sum game.
Figure 1: Original graphs from Greenwald and Hall that illustrate convergence attempts per the four
variants discussed in [1] as it realtes to the Soccer Enviroment detailed in [1.2].
1.1 Multi-Agent Q Learning & Markov Games
A Markov Decision Process (MDP) can be generalized by a stochastic game. Such games can be
represented by tuples such that T γ = (N, S, A, P, R). Where N represents a finite set of players,
S is a finite set of states, whiel A represents pure actions of an ith player at state si . Finally, P
represents the transition probabilities and R represents the rewards at states s and at action a. An MDP
is typically a one-player Markov game (i.e. Bellman Equation[2]). Where Q(s, a) is a normalized
sum of the immmediate reward obtained at state s for taking action a. In Multi-Agent Markov Games,
player i0 s Q values are defined over state and action vectors rather than state-action pairs.
1.2 Soccer Environment
The Soccer environment is a 2x4 grid that allows for players to obtain rewards for each action that is
taken. Each player can move in any direction given some constraints. The action space is North, East,
West, South, and Stick (where stick keeps the player in their current position). The players can move
until one of them reaches their goal and or they score an own goal (see Figure 2 for more details). In
the case of reaching their goal, 100 points is rewarded and -100 points is given to the other player.
The opposite is also true if the same player scores in their own goal.
2 Experiments & Assumptions

All experiments were run without an parameter and run over 105 ∗ 10 iterations. In every replication
case, all players started in the same exact placement as Figure 2 above and Player B always started
Figure 2: The Soccer Enviroment detailed in [1.2] further illustrated with the allowed directions
to traverse. Westmost squares are reserved for Player A’s goals while Player B’s goals are in the
eastmost corner. The A and B represent the player where in this particuar scernario B has the ball.
with the ball. Due to no epsilon policy being implemented, agents acted randomly without seed to
better simulate a stochastic environment.
2.1 Q Learning
Greenwald and Hall reinforce that Q Learning is not an adequate strategy to demonstrate equilibrium
in this particular zero sum game. They argue that in Markov games, agents utilizing Q Learning only
maximize their own rewards which leads to unreliable convergence patterns. Below is the update
constraint that was detailed in the paper and translated to our implementation.
Qi (s, a) = (1 − γ) ∗ Ri (s, a) + γ ∗ σ(P (s0 ) ∗ V i(s0 ))
Figure 3: On the left, the original Q Learning graph compared to our implementation on the right.
Our experiments were performed over 105 ∗ 10 iterations as the paper suggests. General See the
implementation section (2.1.1) for more analysis. The datapoints represent the Player A’s error rate
per each iteration.
2.1.1 Implementation
In the paper, there was no mention of an exact strategy for hyperparameters. The only given inputs
were γ = 0.90 while the α was to be determined by the implementor. We did not implement an
epsilon greedy approach but found decent results with both players performing random actions. The
most dramtic change would occur when decaying α aggressively which we ended up setting to 0.99.
Because there was no exact definition of the decaying schedule, we implemented a linear decay which
took the max(0.001, alpha ∗ 0.999995). The .001 and 0.999995 were determined by trial and error.
Anything more than 0.999995 would decay too slowly and anything less would alter the graph’s
trend by shifting the dip closer to earlier iterations. As observed by the original paper, the Q-Learner
doesn’t converge but does in fact decrease due to the aforementioned learning rate decay schedule.
This observation supports the original findings despite the graphs varying from one another. One
2
possibile explanation for the discrepency could be the lack of exact alpha. Beacuse trial and error
was used, a better and more precise parameter searching method could’ve improved the replication
results. Alternatively, an greedy policy could’ve been deployed to futher diversify the off-policy
agents during gameplay.
2.2 Friend Q
Friend Q implements a similar update strategy to Q Learning but takes into account the actions of the
other agent. In his 2001 paper, Littman introduced this technique to demonstrate the coordination
equilibrium between agents. Unlike Q Learning, Friend Q does in fact converge since both agents are
acting collaboratively.
Similar to the implementation details of 2.1.1, γ = 0.90 was used and no assumptions of α were made.
The final α was set to the same exact value as Q-Learning but the alterations made by increasing
or decreasing α were not nearly as dramatic as the standard Q Learner. This is likely because both
agents are assisting the other to maximize rewards which tend to skew all results towards convergence
regardless of which alpha is used. To verify this, we were able to run the same experiment on different
values of alpha which produced similar results (if not slightly shifted). The different alphas tested
were as follows: = (0.2, 0.75, 0.9.0.99). The main differences of the two graph can be observed by
the curve as the line intersects with the x intercept. The original shows a slight curve in values right
before convergence while the replication attempt is simply a straight drop to convergence. This is
again likely due to knowing an exact learning rate that would exhibit the same behavior.
Figure 4: On the left, the original Friend Q graph compared to our implementation on the right. Our
experiments were performed over 105 ∗ 10 iterations as the paper suggests. See the implementation
section (2.2.1) for more analysis. The datapoints represent the Player A’s error rate per each iteration.
2.3 Foe Q & CE Q
Unlike previous varieties of Q Learning algorithms, both Foe and Correlated Equilibrium can be
achieved via Linear Programming (LP). Though they both display different characteristics, they
can be modeled similarly and acheive similar convergence patterns. For Foe Q, both agents are
acting aversely to one another (opposite of Friend Q described in 2.2). The researchers support the
idea that agents optimize with respect to one another but taking themselves into account ultimately.
Particularly, they use Probability distribution over the joint space of actions to correctly identify the
best next action. We didn’t implement CE Q due to time constraints but we talk theoretically about
its implications towards this problem (in the concluding section).
3
We modeled Foe through the minmax equation set forth by problems like Rock, Paper, Scissors.
Namely, it’s a problem that requires updates to have well defined constraints to maximize given
inputs. In the update step of this algorithm, we must model the linear program which will in fact be
minimized per each optimization iteration. The same gamma was used from the paper (0.09) and α
was explored per usual. Similarly to Friend, we found evidence that alterating alpha and the decay
schedule didn’t dramatically change the convergence behavior. We confirmed this by testing the same
alphas (though we omitted 0.9 and 0.75) to test two extremes in an effort to reduce total convergence
time. Both the original implementation and the experiment show similar convergence patterns but
ours seems to converge slower. Due to the heavy computation and time it takes to successfully run a
LP model, we weren’t able to test more alphas to see if results would’ve improved.
Figure 5: On the left, the original Foe Q graph compared to our implementation on the right. Our
experiments were performed over 105 ∗ 10 iterations as the paper suggests. See the implementation
section (2.3.1) for more analysis. The datapoints represent the Player A’s error rate per each iteration.
3 Conclusions
Linear programming model convex problems very well which is why the Foe algorithm learns a
minmax Equilibrium policy in this Soccer environment (as shown in Figure 5). Unfortunately, we ran
out of time to properly implement a CE Q variant that would’ve hopefully yielded the same results as
the Foe variant. Because they’re both solvable via Linear Programming, their convergence patterns
are more robust than the Q Learner strategies that preceeded them (such as standard Q and Friend
Q). The most difficult challenge faced during the implementation was understanding the Linear
Programming principles. When we finally started to understand the basics (enough to implement
foe), there wasn’t much time left for CE Q. As the original researchers suggest, seeing both foe and
ce q side-by-side should be an exact replica due to having similar Q values. Given more time, we
would’ve liked to attempt a matrix implementation for both Foe Q and CE Q to speed up compute
time for rapid hyper parameter experimentation.
4
References
[1] Amy Greenwald and Keith Hall. (Jan 2003) Correlated Q-Learning Learning ICML
2003 https://pdfs.semanticscholar.org/c8c7/d184a34035f91045a9cb3e75ca0064fcc221.pdf?_ga=2.
201585113.973142509.1532319722-483509346.1532319722
[2] Littman, Michael. (Jan 2003). Friend-or-Foe Q-learning in General-Sum Games.
Website. https://www.researchgate.net/profile/Michael_Littman2/publication/2933305_
Friend-or-Foe_Q-learning_in_General-Sum_Games/links/54b66cb80cf24eb34f6d19dc/
Friend-or-Foe-Q-learning-in-General-Sum-Games.pdf?origin=publication_detail
[3] Wikipedia contributors. (July 2018) Q-learning Wikipedia, Website. https://en.wikipedia.org/w/
index.php?title=Q-learning&oldid=848343136
[4] Pedro Valdez. (March 2018) mlsoccer Personal Website. https://github.com/pdvelez/ml_soccer/
blob/master/soccer.py

RL 3

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

RL 3

Hochgeladen von

Copyright:

Verfügbare Formate

Correlated Q Learning Replication

1.1 Multi-Agent Q Learning & Markov Games

1.2 Soccer Environment

2 Experiments & Assumptions

Qi (s, a) = (1 − γ) ∗ Ri (s, a) + γ ∗ σ(P (s0 ) ∗ V i(s0 ))

2.3 Foe Q & CE Q

Das könnte Ihnen auch gefallen