Back to Annotated Deep Learning Paper Implementations

Regret Minimization in Games with Incomplete Information (CFR)

docs/cfr/index.html

latest21.1 KB
Original Source

homecfr

[View code on Github](https://github.com/labmlai/annotated_deep_learning_paper_implementations/tree/master/labml_nn/cfr/ init.py)

#

Regret Minimization in Games with Incomplete Information (CFR)

The paper Regret Minimization in Games with Incomplete Information introduces counterfactual regret and how minimizing counterfactual regret through self-play can be used to reach Nash equilibrium. The algorithm is called Counterfactual Regret Minimization ( CFR ).

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games introduces Monte Carlo Counterfactual Regret Minimization ( MCCFR ), where we sample from the game tree and estimate the regrets.

We tried to keep our Python implementation easy-to-understand like a tutorial. We run it on a very simple imperfect information game called Kuhn poker.

Twitter thread

Introduction

We implement Monte Carlo Counterfactual Regret Minimization (MCCFR) with chance sampling (CS). It iteratively, explores part of the game tree by trying all player actions, but sampling chance events. Chance events are things like dealing cards; they are kept sampled once per iteration. Then it calculates, for each action, the regret of following the current strategy instead of taking that action. Then it updates the strategy based on these regrets for the next iteration, using regret matching. Finally, it computes the average of the strategies throughout the iterations, which is very close to the Nash equilibrium if we ran enough iterations.

We will first introduce the mathematical notation and theory.

Player

A player is denoted by i∈N, where N is the set of players.

History

History h∈H is a sequence of actions including chance events, and H is the set of all histories.

Z⊆H is the set of terminal histories (game over).

Action

Action a, A(h)=a:(h,a)∈H where h∈H is a non-terminal history.

Information Set Ii​

Information set Ii​∈Ii​ for player i is similar to a history h∈H but only contains the actions visible to player i. That is, the history h will contain actions/events such as cards dealt to the opposing player while Ii​ will not have them.

Ii​ is known as the information partition of player i.

h∈I is the set of all histories that belong to a given information set; i.e. all those histories look the same in the eye of the player.

Strategy

Strategy of player i, σi​∈Σi​ is a distribution over actions A(Ii​), where Σi​ is the set of all strategies for player i. Strategy on t-th iteration is denoted by σti​.

Strategy is defined as a probability for taking an action a in for a given information set I,

σi​(I)(a)

σ is the strategy profile which consists of strategies of all players σ1​,σ2​,…

σ−i​ is strategies of all players except σi​

Probability of History

πσ(h) is the probability of reaching the history h with strategy profile σ. πσ(h)−i​ is the probability of reaching h without player i's contribution; i.e. player i took the actions to follow h with a probability of 1.

πσ(h)i​ is the probability of reaching h with only player i's contribution. That is, πσ(h)=πσ(h)i​πσ(h)−i​

Probability of reaching a information set I is, πσ(I)=h∈I∑​πσ(h)

Utility (Pay off)

The terminal utility is the utility (or pay off) of a player i for a terminal history h.

ui​(h) where h∈Z

ui​(σ) is the expected utility (payoff) for player i with strategy profile σ.

ui​(σ)=h∈Z∑​ui​(h)πσ(h)

Nash Equilibrium

Nash equilibrium is a state where none of the players can increase their expected utility (or payoff) by changing their strategy alone.

For two players, Nash equilibrium is a strategy profile where

u1​(σ)u2​(σ)​≥maxσ1′​∈Σ1​​u1​(σ1′​,σ2​)≥maxσ2′​∈Σ2​​u1​(σ1​,σ2′​)​

ϵ-Nash equilibrium is,

u1​(σ)+ϵu2​(σ)+ϵ​≥maxσ1′​∈Σ1​​u1​(σ1′​,σ2​)≥maxσ2′​∈Σ2​​u1​(σ1​,σ2′​)​

Regret Minimization

Regret is the utility (or pay off) that the player didn't get because she didn't follow the optimal strategy or took the best action.

Average overall regret for Player i is the average regret of not following the optimal strategy in all T rounds of iterations.

RiT​=T1​maxσi∗​∈Σi​​t=1∑T​(ui​(σi∗​,σt−i​)−ui​(σt))

where σt is the strategy profile of all players in iteration t, and

(σi∗​,σt−i​)

is the strategy profile σt with player i's strategy replaced with σi∗​.

The average strategy is the average of strategies followed in each round, for all I∈I,a∈A(I)

σˉiT​(I)(a)=∑t=1T​πiσt​(I)∑t=1T​πiσt​(I)σt(I)(a)​

That is the mean regret of not playing with the optimal strategy.

If RiT​<ϵ for all players then σˉiT​(I)(a) is a 2ϵ-Nash equilibrium.

RiT​RiT​​<ϵ=T1​maxσi∗​∈Σi​​t=1∑T​(ui​(σi∗​,σt−i​)−ui​(σt))=T1​maxσi∗​∈Σi​​t=1∑T​ui​(σi∗​,σt−i​)−T1​t=1∑T​ui​(σt)<ϵ​

Since u1​=−u2​ because it's a zero-sum game, we can add R1T​ and RiT​ and the second term will cancel out.

2ϵ​>T1​maxσ1∗​∈Σ1​​t=1∑T​u1​(σ1∗​,σt−1​)+T1​maxσ2∗​∈Σ2​​t=1∑T​u2​(σ2∗​,σt−2​)​

The average of utilities over a set of strategies is equal to the utility of the average strategy.

T1​t=1∑T​ui​(σt)=ui​(σˉT)

Therefore,

2ϵ​>maxσ1∗​∈Σ1​​u1​(σ1∗​,σˉ−1T​)+maxσ2∗​∈Σ2​​u2​(σ2∗​,σˉ−2T​)​

From the definition of max, maxσ2∗​∈Σ2​​u2​(σ2∗​,σˉ−2T​)≥u2​(σˉT)=−u1​(σˉT)

Then,

2ϵu1​(σˉT)+2ϵ​>maxσ1∗​∈Σ1​​u1​(σ1∗​,σˉ−1T​)+−u1​(σˉT)>maxσ1∗​∈Σ1​​u1​(σ1∗​,σˉ−1T​)​

This is 2ϵ-Nash equilibrium. You can similarly prove for games with more than 2 players.

So we need to minimize RiT​ to get close to a Nash equilibrium.

Counterfactual regret

Counterfactual value vi​(σ,I) is the expected utility for player i if if player i tried to reach I (took the actions leading to I with a probability of 1).

vi​(σ,I)=z∈ZI​∑​πσ−i​(z[I])πσ(z[I],z)ui​(z)

where ZI​ is the set of terminal histories reachable from I, and z[I] is the prefix of z up to I. πσ(z[I],z) is the probability of reaching z from z[I].

Immediate counterfactual regret is,

Ri,immT​(I)=maxa∈AI​Ri,immT​(I,a)

where

Ri,immT​(I)=T1​t=1∑T​(vi​(σt∣I→a​,I)−vi​(σt,I))

where σ∣I→a​ is the strategy profile σ with the modification of always taking action a at information set I.

The paper proves that (Theorem 3),

RiT​≤I∈I∑​Ri,immT,+​(I) where Ri,immT,+​(I)=max(Ri,immT​(I),0)

Regret Matching

The strategy is calculated using regret matching.

The regret for each information set and action pair RiT​(I,a) is maintained,

rit​(I,a)RiT​(I,a)​=vi​(σt∣I→a​,I)−vi​(σt,I)=T1​t=1∑T​rit​(I,a)​

and the strategy is calculated with regret matching,

σi​T+1(I)(a)=⎩⎨⎧​∑a′∈A(I)​RiT,+​(I,a′)RiT,+​(I,a)​,∣A(I)∣1​,​if∑a′∈A(I)​RiT,+​(I,a′)>0otherwise​​

where RiT,+​(I,a)=max(RiT​(I,a),0)

The paper The paper Regret Minimization in Games with Incomplete Information proves that if the strategy is selected according to above equation RiT​ gets smaller proportionate to T​1​, and therefore reaches ϵ-Nash equilibrium.

Monte Carlo CFR (MCCFR)

Computing rit​(I,a) requires expanding the full game tree on each iteration.

The paper Monte Carlo Sampling for Regret Minimization in Extensive Games shows we can sample from the game tree and estimate the regrets.

Q=Q1​,…,Qr​ is a set of subsets of Z (Qj​⊆Z) where we look at only a single block Qj​ in an iteration. Union of all subsets spans Z (Q1​∩…∩Qr​=Z). qj​ is the probability of picking block Qj​.

q(z) is the probability of picking z in current iteration; i.e. q(z)=∑j:z∈Qj​​qj​ - the sum of qj​ where z∈Qj​.

Then we get sampled counterfactual value fro block j,

v~(σ,I∣j)=z∈Qj​∑​q(z)1​πσ−i​(z[I])πσ(z[I],z)ui​(z)

The paper shows that

Ej∼qj​​[v~(σ,I∣j)]=vi​(σ,I)

with a simple proof.

Therefore we can sample a part of the game tree and calculate the regrets. We calculate an estimate of regrets

rit​(I,a)=vi​(σt∣I→a​,I)−v~i​(σt,I)

And use that to update RiT​(I,a) and calculate the strategy σi​T+1(I)(a) on each iteration. Finally, we calculate the overall average strategy σˉiT​(I)(a).

Here is a Kuhn Poker implementation to try CFR on Kuhn Poker.

Let's dive into the code!

328fromtypingimportNewType,Dict,List,Callable,cast329330fromlabmlimportmonit,tracker,logger,experiment331fromlabml.configsimportBaseConfigs,option

#

A player i∈N where N is the set of players

334Player=NewType('Player',int)

#

Action a, A(h)=a:(h,a)∈H where h∈H is a non-terminal history

336Action=NewType('Action',str)

#

History

History h∈H is a sequence of actions including chance events, and H is the set of all histories.

This class should be extended with game specific logic.

339classHistory:

#

Whether it's a terminal history; i.e. game over. h∈Z

351defis\_terminal(self):

#

356raiseNotImplementedError()

#

Utility of player i for a terminal history. ui​(h) where h∈Z

358defterminal\_utility(self,i:Player)-\>float:

#

364raiseNotImplementedError()

#

Get current player, denoted by P(h), where P is known as Player function.

If P(h)=c it means that current event is a chance c event. Something like dealing cards, or opening common cards in poker.

366defplayer(self)-\>Player:

#

373raiseNotImplementedError()

#

Whether the next step is a chance step; something like dealing a new card. P(h)=c

375defis\_chance(self)-\>bool:

#

380raiseNotImplementedError()

#

Sample a chance when P(h)=c.

382defsample\_chance(self)-\>Action:

#

386raiseNotImplementedError()

#

Add an action to the history.

388def\_\_add\_\_(self,action:Action):

#

392raiseNotImplementedError()

#

Get information set for the current player

394definfo\_set\_key(self)-\>str:

#

398raiseNotImplementedError

#

Create a new information set for the current player

400defnew\_info\_set(self)-\>'InfoSet':

#

404raiseNotImplementedError()

#

Human readable representation

406def\_\_repr\_\_(self):

#

410raiseNotImplementedError()

#

Information Set Ii​

413classInfoSet:

#

Unique key identifying the information set

421key:str

#

σi​, the strategy of player i

423strategy:Dict[Action,float]

#

Total regret of not taking each action A(Ii​),

rit​(I,a)RiT​(I,a)​=vi​(σt∣I→a​,I)−vi​(σt,I)=T1​t=1∑T​rit​(I,a)​

We maintain TRiT​(I,a) instead of RiT​(I,a) since T1​ term cancels out anyway when computing strategy σi​T+1(I)(a)

438regret:Dict[Action,float]

#

We maintain the cumulative strategy t=1∑T​πiσt​(I)σt(I)(a) to compute overall average strategy

σˉiT​(I)(a)=∑t=1T​πiσt​(I)∑t=1T​πiσt​(I)σt(I)(a)​

445cumulative\_strategy:Dict[Action,float]

#

Initialize

447def\_\_init\_\_(self,key:str):

#

451self.key=key452self.regret={a:0forainself.actions()}453self.cumulative\_strategy={a:0forainself.actions()}454self.calculate\_strategy()

#

Actions A(Ii​)

456defactions(self)-\>List[Action]:

#

460raiseNotImplementedError()

#

Load information set from a saved dictionary

462@staticmethod463deffrom\_dict(data:Dict[str,any])-\>'InfoSet':

#

467raiseNotImplementedError()

#

Save the information set to a dictionary

469defto\_dict(self):

#

473return{474'key':self.key,475'regret':self.regret,476'average\_strategy':self.cumulative\_strategy,477}

#

Load data from a saved dictionary

479defload\_dict(self,data:Dict[str,any]):

#

483self.regret=data['regret']484self.cumulative\_strategy=data['average\_strategy']485self.calculate\_strategy()

#

Calculate strategy

Calculate current strategy using regret matching.

σi​T+1(I)(a)=⎩⎨⎧​∑a′∈A(I)​RiT,+​(I,a′)RiT,+​(I,a)​,∣A(I)∣1​,​if∑a′∈A(I)​RiT,+​(I,a′)>0otherwise​​

where RiT,+​(I,a)=max(RiT​(I,a),0)

487defcalculate\_strategy(self):

#

RiT,+​(I,a)=max(RiT​(I,a),0)

506regret={a:max(r,0)fora,rinself.regret.items()}

#

a′∈A(I)∑​RiT,+​(I,a′)

508regret\_sum=sum(regret.values())

#

if ∑a′∈A(I)​RiT,+​(I,a′)>0,

510ifregret\_sum\>0:

#

σi​T+1(I)(a)=∑a′∈A(I)​RiT,+​(I,a′)RiT,+​(I,a)​

513self.strategy={a:r/regret\_sumfora,rinregret.items()}

#

Otherwise,

515else:

#

∣A(I)∣

517count=len(list(aforainself.regret))

#

σi​T+1(I)(a)=∣A(I)∣1​

520self.strategy={a:1/countfora,rinregret.items()}

#

Get average strategy

σˉiT​(I)(a)=∑t=1T​πiσt​(I)∑t=1T​πiσt​(I)σt(I)(a)​

522defget\_average\_strategy(self):

#

t=1∑T​πiσt​(I)σt(I)(a)

531cum\_strategy={a:self.cumulative\_strategy.get(a,0.)forainself.actions()}

#

t=1∑T​πiσt​(I)=a∈A(I)∑​t=1∑T​πiσt​(I)σt(I)(a)

535strategy\_sum=sum(cum\_strategy.values())

#

If ∑t=1T​πiσt​(I)>0,

537ifstrategy\_sum\>0:

#

σˉiT​(I)(a)=∑t=1T​πiσt​(I)∑t=1T​πiσt​(I)σt(I)(a)​

541return{a:s/strategy\_sumfora,sincum\_strategy.items()}

#

Otherwise,

543else:

#

∣A(I)∣

545count=len(list(aforaincum\_strategy))

#

σˉiT​(I)(a)=∣A(I)∣1​

548return{a:1/countfora,rincum\_strategy.items()}

#

Human readable representation

550def\_\_repr\_\_(self):

#

554raiseNotImplementedError()

#

Counterfactual Regret Minimization (CFR) Algorithm

We do chance sampling ( CS ) where all the chance events (nodes) are sampled and all other events (nodes) are explored.

We can ignore the term q(z) since it's the same for all terminal histories since we are doing chance sampling and it cancels out when calculating strategy (common in numerator and denominator).

557classCFR:

#

I set of all information sets.

570info\_sets:Dict[str,InfoSet]

#

  • create_new_history creates a new empty history
  • epochs is the number of iterations to train on T
  • n_players is the number of players
572def\_\_init\_\_(self,\*,573create\_new\_history:Callable[[],History],574epochs:int,575n\_players:int=2):

#

581self.n\_players=n\_players582self.epochs=epochs583self.create\_new\_history=create\_new\_history

#

A dictionary for I set of all information sets

585self.info\_sets={}

#

Tracker for analytics

587self.tracker=InfoSetTracker()

#

Returns the information set I of the current player for a given history h

589def\_get\_info\_set(self,h:History):

#

593info\_set\_key=h.info\_set\_key()594ifinfo\_set\_keynotinself.info\_sets:595self.info\_sets[info\_set\_key]=h.new\_info\_set()596returnself.info\_sets[info\_set\_key]

#

Walk Tree

This function walks the game tree.

  • h is the current history h
  • i is the player i that we are computing regrets of
  • pi_i is πiσt​(h)
  • pi_neg_i is π−iσt​(h)

It returns the expected utility, for the history h z∈Zh​∑​πσ(h,z)ui​(z) where Zh​ is the set of terminal histories with prefix h

While walking the tee it updates the total regrets RiT​(I,a).

598defwalk\_tree(self,h:History,i:Player,pi\_i:float,pi\_neg\_i:float)-\>float:

#

If it's a terminal history h∈Z return the terminal utility ui​(h).

619ifh.is\_terminal():620returnh.terminal\_utility(i)

#

If it's a chance event P(h)=c sample a and go to next step.

622elifh.is\_chance():623a=h.sample\_chance()624returnself.walk\_tree(h+a,i,pi\_i,pi\_neg\_i)

#

Get current player's information set for h

627I=self.\_get\_info\_set(h)

#

To store ∑z∈Zh​​πσ(h,z)ui​(z)

629v=0

#

To store z∈Zh​∑​πσt∣I→a​(h,z)ui​(z) for each action a∈A(h)

633va={}

#

Iterate through all actions

636forainI.actions():

#

If the current player is i,

638ifi==h.player():

# πiσt​(h+a)π−iσt​(h+a)​=πiσt​(h)σti​(I)(a)=π−iσt​(h)​

643va[a]=self.walk\_tree(h+a,i,pi\_i\*I.strategy[a],pi\_neg\_i)

#

Otherwise,

645else:

# πiσt​(h+a)π−iσt​(h+a)​=πiσt​(h)=π−iσt​(h)∗σti​(I)(a)​

650va[a]=self.walk\_tree(h+a,i,pi\_i,pi\_neg\_i\*I.strategy[a])

#

z∈Zh​∑​πσ(h,z)ui​(z)=a∈A(I)∑​[σti​(I)(a)z∈Zh​∑​πσt∣I→a​(h,z)ui​(z)]

655v=v+I.strategy[a]\*va[a]

#

If the current player is i, update the cumulative strategies and total regrets

659ifh.player()==i:

#

Update cumulative strategies t=1∑T​πiσt​(I)σt(I)(a)=t=1∑T​[h∈I∑​πiσt​(h)σt(I)(a)]

664forainI.actions():665I.cumulative\_strategy[a]=I.cumulative\_strategy[a]+pi\_i\*I.strategy[a]

# rit​(I,a)TRiT​(I,a)​=vi​(σt∣I→a​,I)−vi​(σt,I)=π−iσt​(h)(z∈Zh​∑​πσt∣I→a​(h,z)ui​(z)−z∈Zh​∑​πσ(h,z)ui​(z))=t=1∑T​rit​(I,a)​

678forainI.actions():679I.regret[a]+=pi\_neg\_i\*(va[a]-v)

#

Update the strategy σt(I)(a)

682I.calculate\_strategy()

#

Return the expected utility for player i, z∈Zh​∑​πσ(h,z)ui​(z)

686returnv

#

Iteratively update σt(I)(a)

This updates the strategies for T iterations.

688defiterate(self):

#

Loop for epochs times

696fortinmonit.iterate('Train',self.epochs):

#

Walk tree and update regrets for each player

698foriinrange(self.n\_players):699self.walk\_tree(self.create\_new\_history(),cast(Player,i),1,1)

#

Track data for analytics

702tracker.add\_global\_step()703self.tracker(self.info\_sets)704tracker.save()

#

Print the information sets

707logger.inspect(self.info\_sets)

#

Information set tracker

This is a small helper class to track data from information sets

710classInfoSetTracker:

#

Set tracking indicators

716def\_\_init\_\_(self):

#

720tracker.set\_histogram(f'strategy.\*')721tracker.set\_histogram(f'average\_strategy.\*')722tracker.set\_histogram(f'regret.\*')

#

Track the data from all information sets

724def\_\_call\_\_(self,info\_sets:Dict[str,InfoSet]):

#

728forIininfo\_sets.values():729avg\_strategy=I.get\_average\_strategy()730forainI.actions():731tracker.add({732f'strategy.{I.key}.{a}':I.strategy[a],733f'average\_strategy.{I.key}.{a}':avg\_strategy[a],734f'regret.{I.key}.{a}':I.regret[a],735})

#

Configurable CFR module

738classCFRConfigs(BaseConfigs):

#

742create\_new\_history:Callable[[],History]743epochs:int=1\_00\_000744cfr:CFR='simple\_cfr'

#

Initialize CFR algorithm

747@option(CFRConfigs.cfr)748defsimple\_cfr(c:CFRConfigs):

#

752returnCFR(create\_new\_history=c.create\_new\_history,753epochs=c.epochs)

labml.ai