docs/cfr/index.html
[View code on Github](https://github.com/labmlai/annotated_deep_learning_paper_implementations/tree/master/labml_nn/cfr/ init.py)
The paper Regret Minimization in Games with Incomplete Information introduces counterfactual regret and how minimizing counterfactual regret through self-play can be used to reach Nash equilibrium. The algorithm is called Counterfactual Regret Minimization ( CFR ).
The paper Monte Carlo Sampling for Regret Minimization in Extensive Games introduces Monte Carlo Counterfactual Regret Minimization ( MCCFR ), where we sample from the game tree and estimate the regrets.
We tried to keep our Python implementation easy-to-understand like a tutorial. We run it on a very simple imperfect information game called Kuhn poker.
Twitter thread
We implement Monte Carlo Counterfactual Regret Minimization (MCCFR) with chance sampling (CS). It iteratively, explores part of the game tree by trying all player actions, but sampling chance events. Chance events are things like dealing cards; they are kept sampled once per iteration. Then it calculates, for each action, the regret of following the current strategy instead of taking that action. Then it updates the strategy based on these regrets for the next iteration, using regret matching. Finally, it computes the average of the strategies throughout the iterations, which is very close to the Nash equilibrium if we ran enough iterations.
We will first introduce the mathematical notation and theory.
A player is denoted by i∈N, where N is the set of players.
History h∈H is a sequence of actions including chance events, and H is the set of all histories.
Z⊆H is the set of terminal histories (game over).
Action a, A(h)=a:(h,a)∈H where h∈H is a non-terminal history.
Information set Ii∈Ii for player i is similar to a history h∈H but only contains the actions visible to player i. That is, the history h will contain actions/events such as cards dealt to the opposing player while Ii will not have them.
Ii is known as the information partition of player i.
h∈I is the set of all histories that belong to a given information set; i.e. all those histories look the same in the eye of the player.
Strategy of player i, σi∈Σi is a distribution over actions A(Ii), where Σi is the set of all strategies for player i. Strategy on t-th iteration is denoted by σti.
Strategy is defined as a probability for taking an action a in for a given information set I,
σi(I)(a)
σ is the strategy profile which consists of strategies of all players σ1,σ2,…
σ−i is strategies of all players except σi
πσ(h) is the probability of reaching the history h with strategy profile σ. πσ(h)−i is the probability of reaching h without player i's contribution; i.e. player i took the actions to follow h with a probability of 1.
πσ(h)i is the probability of reaching h with only player i's contribution. That is, πσ(h)=πσ(h)iπσ(h)−i
Probability of reaching a information set I is, πσ(I)=h∈I∑πσ(h)
The terminal utility is the utility (or pay off) of a player i for a terminal history h.
ui(h) where h∈Z
ui(σ) is the expected utility (payoff) for player i with strategy profile σ.
ui(σ)=h∈Z∑ui(h)πσ(h)
Nash equilibrium is a state where none of the players can increase their expected utility (or payoff) by changing their strategy alone.
For two players, Nash equilibrium is a strategy profile where
u1(σ)u2(σ)≥maxσ1′∈Σ1u1(σ1′,σ2)≥maxσ2′∈Σ2u1(σ1,σ2′)
ϵ-Nash equilibrium is,
u1(σ)+ϵu2(σ)+ϵ≥maxσ1′∈Σ1u1(σ1′,σ2)≥maxσ2′∈Σ2u1(σ1,σ2′)
Regret is the utility (or pay off) that the player didn't get because she didn't follow the optimal strategy or took the best action.
Average overall regret for Player i is the average regret of not following the optimal strategy in all T rounds of iterations.
RiT=T1maxσi∗∈Σit=1∑T(ui(σi∗,σt−i)−ui(σt))
where σt is the strategy profile of all players in iteration t, and
(σi∗,σt−i)
is the strategy profile σt with player i's strategy replaced with σi∗.
The average strategy is the average of strategies followed in each round, for all I∈I,a∈A(I)
σˉiT(I)(a)=∑t=1Tπiσt(I)∑t=1Tπiσt(I)σt(I)(a)
That is the mean regret of not playing with the optimal strategy.
If RiT<ϵ for all players then σˉiT(I)(a) is a 2ϵ-Nash equilibrium.
RiTRiT<ϵ=T1maxσi∗∈Σit=1∑T(ui(σi∗,σt−i)−ui(σt))=T1maxσi∗∈Σit=1∑Tui(σi∗,σt−i)−T1t=1∑Tui(σt)<ϵ
Since u1=−u2 because it's a zero-sum game, we can add R1T and RiT and the second term will cancel out.
2ϵ>T1maxσ1∗∈Σ1t=1∑Tu1(σ1∗,σt−1)+T1maxσ2∗∈Σ2t=1∑Tu2(σ2∗,σt−2)
The average of utilities over a set of strategies is equal to the utility of the average strategy.
T1t=1∑Tui(σt)=ui(σˉT)
Therefore,
2ϵ>maxσ1∗∈Σ1u1(σ1∗,σˉ−1T)+maxσ2∗∈Σ2u2(σ2∗,σˉ−2T)
From the definition of max, maxσ2∗∈Σ2u2(σ2∗,σˉ−2T)≥u2(σˉT)=−u1(σˉT)
Then,
2ϵu1(σˉT)+2ϵ>maxσ1∗∈Σ1u1(σ1∗,σˉ−1T)+−u1(σˉT)>maxσ1∗∈Σ1u1(σ1∗,σˉ−1T)
This is 2ϵ-Nash equilibrium. You can similarly prove for games with more than 2 players.
So we need to minimize RiT to get close to a Nash equilibrium.
Counterfactual value vi(σ,I) is the expected utility for player i if if player i tried to reach I (took the actions leading to I with a probability of 1).
vi(σ,I)=z∈ZI∑πσ−i(z[I])πσ(z[I],z)ui(z)
where ZI is the set of terminal histories reachable from I, and z[I] is the prefix of z up to I. πσ(z[I],z) is the probability of reaching z from z[I].
Immediate counterfactual regret is,
Ri,immT(I)=maxa∈AIRi,immT(I,a)
where
Ri,immT(I)=T1t=1∑T(vi(σt∣I→a,I)−vi(σt,I))
where σ∣I→a is the strategy profile σ with the modification of always taking action a at information set I.
The paper proves that (Theorem 3),
RiT≤I∈I∑Ri,immT,+(I) where Ri,immT,+(I)=max(Ri,immT(I),0)
The strategy is calculated using regret matching.
The regret for each information set and action pair RiT(I,a) is maintained,
rit(I,a)RiT(I,a)=vi(σt∣I→a,I)−vi(σt,I)=T1t=1∑Trit(I,a)
and the strategy is calculated with regret matching,
σiT+1(I)(a)=⎩⎨⎧∑a′∈A(I)RiT,+(I,a′)RiT,+(I,a),∣A(I)∣1,if∑a′∈A(I)RiT,+(I,a′)>0otherwise
where RiT,+(I,a)=max(RiT(I,a),0)
The paper The paper Regret Minimization in Games with Incomplete Information proves that if the strategy is selected according to above equation RiT gets smaller proportionate to T1, and therefore reaches ϵ-Nash equilibrium.
Computing rit(I,a) requires expanding the full game tree on each iteration.
The paper Monte Carlo Sampling for Regret Minimization in Extensive Games shows we can sample from the game tree and estimate the regrets.
Q=Q1,…,Qr is a set of subsets of Z (Qj⊆Z) where we look at only a single block Qj in an iteration. Union of all subsets spans Z (Q1∩…∩Qr=Z). qj is the probability of picking block Qj.
q(z) is the probability of picking z in current iteration; i.e. q(z)=∑j:z∈Qjqj - the sum of qj where z∈Qj.
Then we get sampled counterfactual value fro block j,
v~(σ,I∣j)=z∈Qj∑q(z)1πσ−i(z[I])πσ(z[I],z)ui(z)
The paper shows that
Ej∼qj[v~(σ,I∣j)]=vi(σ,I)
with a simple proof.
Therefore we can sample a part of the game tree and calculate the regrets. We calculate an estimate of regrets
rit(I,a)=vi(σt∣I→a,I)−v~i(σt,I)
And use that to update RiT(I,a) and calculate the strategy σiT+1(I)(a) on each iteration. Finally, we calculate the overall average strategy σˉiT(I)(a).
Here is a Kuhn Poker implementation to try CFR on Kuhn Poker.
Let's dive into the code!
328fromtypingimportNewType,Dict,List,Callable,cast329330fromlabmlimportmonit,tracker,logger,experiment331fromlabml.configsimportBaseConfigs,option
A player i∈N where N is the set of players
334Player=NewType('Player',int)
Action a, A(h)=a:(h,a)∈H where h∈H is a non-terminal history
336Action=NewType('Action',str)
History h∈H is a sequence of actions including chance events, and H is the set of all histories.
This class should be extended with game specific logic.
339classHistory:
Whether it's a terminal history; i.e. game over. h∈Z
351defis\_terminal(self):
356raiseNotImplementedError()
Utility of player i for a terminal history. ui(h) where h∈Z
358defterminal\_utility(self,i:Player)-\>float:
364raiseNotImplementedError()
Get current player, denoted by P(h), where P is known as Player function.
If P(h)=c it means that current event is a chance c event. Something like dealing cards, or opening common cards in poker.
366defplayer(self)-\>Player:
373raiseNotImplementedError()
Whether the next step is a chance step; something like dealing a new card. P(h)=c
375defis\_chance(self)-\>bool:
380raiseNotImplementedError()
Sample a chance when P(h)=c.
382defsample\_chance(self)-\>Action:
386raiseNotImplementedError()
Add an action to the history.
388def\_\_add\_\_(self,action:Action):
392raiseNotImplementedError()
Get information set for the current player
394definfo\_set\_key(self)-\>str:
398raiseNotImplementedError
Create a new information set for the current player
400defnew\_info\_set(self)-\>'InfoSet':
404raiseNotImplementedError()
Human readable representation
406def\_\_repr\_\_(self):
410raiseNotImplementedError()
413classInfoSet:
Unique key identifying the information set
421key:str
σi, the strategy of player i
423strategy:Dict[Action,float]
Total regret of not taking each action A(Ii),
rit(I,a)RiT(I,a)=vi(σt∣I→a,I)−vi(σt,I)=T1t=1∑Trit(I,a)
We maintain TRiT(I,a) instead of RiT(I,a) since T1 term cancels out anyway when computing strategy σiT+1(I)(a)
438regret:Dict[Action,float]
We maintain the cumulative strategy t=1∑Tπiσt(I)σt(I)(a) to compute overall average strategy
σˉiT(I)(a)=∑t=1Tπiσt(I)∑t=1Tπiσt(I)σt(I)(a)
445cumulative\_strategy:Dict[Action,float]
Initialize
447def\_\_init\_\_(self,key:str):
451self.key=key452self.regret={a:0forainself.actions()}453self.cumulative\_strategy={a:0forainself.actions()}454self.calculate\_strategy()
Actions A(Ii)
456defactions(self)-\>List[Action]:
460raiseNotImplementedError()
Load information set from a saved dictionary
462@staticmethod463deffrom\_dict(data:Dict[str,any])-\>'InfoSet':
467raiseNotImplementedError()
Save the information set to a dictionary
469defto\_dict(self):
473return{474'key':self.key,475'regret':self.regret,476'average\_strategy':self.cumulative\_strategy,477}
Load data from a saved dictionary
479defload\_dict(self,data:Dict[str,any]):
483self.regret=data['regret']484self.cumulative\_strategy=data['average\_strategy']485self.calculate\_strategy()
Calculate current strategy using regret matching.
σiT+1(I)(a)=⎩⎨⎧∑a′∈A(I)RiT,+(I,a′)RiT,+(I,a),∣A(I)∣1,if∑a′∈A(I)RiT,+(I,a′)>0otherwise
where RiT,+(I,a)=max(RiT(I,a),0)
487defcalculate\_strategy(self):
RiT,+(I,a)=max(RiT(I,a),0)
506regret={a:max(r,0)fora,rinself.regret.items()}
a′∈A(I)∑RiT,+(I,a′)
508regret\_sum=sum(regret.values())
if ∑a′∈A(I)RiT,+(I,a′)>0,
510ifregret\_sum\>0:
σiT+1(I)(a)=∑a′∈A(I)RiT,+(I,a′)RiT,+(I,a)
513self.strategy={a:r/regret\_sumfora,rinregret.items()}
Otherwise,
515else:
∣A(I)∣
517count=len(list(aforainself.regret))
σiT+1(I)(a)=∣A(I)∣1
520self.strategy={a:1/countfora,rinregret.items()}
σˉiT(I)(a)=∑t=1Tπiσt(I)∑t=1Tπiσt(I)σt(I)(a)
522defget\_average\_strategy(self):
t=1∑Tπiσt(I)σt(I)(a)
531cum\_strategy={a:self.cumulative\_strategy.get(a,0.)forainself.actions()}
t=1∑Tπiσt(I)=a∈A(I)∑t=1∑Tπiσt(I)σt(I)(a)
535strategy\_sum=sum(cum\_strategy.values())
If ∑t=1Tπiσt(I)>0,
537ifstrategy\_sum\>0:
σˉiT(I)(a)=∑t=1Tπiσt(I)∑t=1Tπiσt(I)σt(I)(a)
541return{a:s/strategy\_sumfora,sincum\_strategy.items()}
Otherwise,
543else:
∣A(I)∣
545count=len(list(aforaincum\_strategy))
σˉiT(I)(a)=∣A(I)∣1
548return{a:1/countfora,rincum\_strategy.items()}
Human readable representation
550def\_\_repr\_\_(self):
554raiseNotImplementedError()
We do chance sampling ( CS ) where all the chance events (nodes) are sampled and all other events (nodes) are explored.
We can ignore the term q(z) since it's the same for all terminal histories since we are doing chance sampling and it cancels out when calculating strategy (common in numerator and denominator).
557classCFR:
I set of all information sets.
570info\_sets:Dict[str,InfoSet]
create_new_history creates a new empty historyepochs is the number of iterations to train on Tn_players is the number of players572def\_\_init\_\_(self,\*,573create\_new\_history:Callable[[],History],574epochs:int,575n\_players:int=2):
581self.n\_players=n\_players582self.epochs=epochs583self.create\_new\_history=create\_new\_history
A dictionary for I set of all information sets
585self.info\_sets={}
Tracker for analytics
587self.tracker=InfoSetTracker()
Returns the information set I of the current player for a given history h
589def\_get\_info\_set(self,h:History):
593info\_set\_key=h.info\_set\_key()594ifinfo\_set\_keynotinself.info\_sets:595self.info\_sets[info\_set\_key]=h.new\_info\_set()596returnself.info\_sets[info\_set\_key]
This function walks the game tree.
h is the current history hi is the player i that we are computing regrets ofpi_i is πiσt(h)pi_neg_i is π−iσt(h)It returns the expected utility, for the history h z∈Zh∑πσ(h,z)ui(z) where Zh is the set of terminal histories with prefix h
While walking the tee it updates the total regrets RiT(I,a).
598defwalk\_tree(self,h:History,i:Player,pi\_i:float,pi\_neg\_i:float)-\>float:
If it's a terminal history h∈Z return the terminal utility ui(h).
619ifh.is\_terminal():620returnh.terminal\_utility(i)
If it's a chance event P(h)=c sample a and go to next step.
622elifh.is\_chance():623a=h.sample\_chance()624returnself.walk\_tree(h+a,i,pi\_i,pi\_neg\_i)
Get current player's information set for h
627I=self.\_get\_info\_set(h)
To store ∑z∈Zhπσ(h,z)ui(z)
629v=0
To store z∈Zh∑πσt∣I→a(h,z)ui(z) for each action a∈A(h)
633va={}
Iterate through all actions
636forainI.actions():
If the current player is i,
638ifi==h.player():
# πiσt(h+a)π−iσt(h+a)=πiσt(h)σti(I)(a)=π−iσt(h)
643va[a]=self.walk\_tree(h+a,i,pi\_i\*I.strategy[a],pi\_neg\_i)
Otherwise,
645else:
# πiσt(h+a)π−iσt(h+a)=πiσt(h)=π−iσt(h)∗σti(I)(a)
650va[a]=self.walk\_tree(h+a,i,pi\_i,pi\_neg\_i\*I.strategy[a])
z∈Zh∑πσ(h,z)ui(z)=a∈A(I)∑[σti(I)(a)z∈Zh∑πσt∣I→a(h,z)ui(z)]
655v=v+I.strategy[a]\*va[a]
If the current player is i, update the cumulative strategies and total regrets
659ifh.player()==i:
Update cumulative strategies t=1∑Tπiσt(I)σt(I)(a)=t=1∑T[h∈I∑πiσt(h)σt(I)(a)]
664forainI.actions():665I.cumulative\_strategy[a]=I.cumulative\_strategy[a]+pi\_i\*I.strategy[a]
#
rit(I,a)TRiT(I,a)=vi(σt∣I→a,I)−vi(σt,I)=π−iσt(h)(z∈Zh∑πσt∣I→a(h,z)ui(z)−z∈Zh∑πσ(h,z)ui(z))=t=1∑Trit(I,a)
678forainI.actions():679I.regret[a]+=pi\_neg\_i\*(va[a]-v)
Update the strategy σt(I)(a)
682I.calculate\_strategy()
Return the expected utility for player i, z∈Zh∑πσ(h,z)ui(z)
686returnv
This updates the strategies for T iterations.
688defiterate(self):
Loop for epochs times
696fortinmonit.iterate('Train',self.epochs):
Walk tree and update regrets for each player
698foriinrange(self.n\_players):699self.walk\_tree(self.create\_new\_history(),cast(Player,i),1,1)
Track data for analytics
702tracker.add\_global\_step()703self.tracker(self.info\_sets)704tracker.save()
Print the information sets
707logger.inspect(self.info\_sets)
This is a small helper class to track data from information sets
710classInfoSetTracker:
Set tracking indicators
716def\_\_init\_\_(self):
720tracker.set\_histogram(f'strategy.\*')721tracker.set\_histogram(f'average\_strategy.\*')722tracker.set\_histogram(f'regret.\*')
Track the data from all information sets
724def\_\_call\_\_(self,info\_sets:Dict[str,InfoSet]):
728forIininfo\_sets.values():729avg\_strategy=I.get\_average\_strategy()730forainI.actions():731tracker.add({732f'strategy.{I.key}.{a}':I.strategy[a],733f'average\_strategy.{I.key}.{a}':avg\_strategy[a],734f'regret.{I.key}.{a}':I.regret[a],735})
738classCFRConfigs(BaseConfigs):
742create\_new\_history:Callable[[],History]743epochs:int=1\_00\_000744cfr:CFR='simple\_cfr'
Initialize CFR algorithm
747@option(CFRConfigs.cfr)748defsimple\_cfr(c:CFRConfigs):
752returnCFR(create\_new\_history=c.create\_new\_history,753epochs=c.epochs)