Poker-AI.org

Poker AI and Botting Discussion Forum
It is currently Mon Nov 13, 2023 5:40 pm

All times are UTC




Post new topic Reply to topic  [ 17 posts ] 
Author Message
PostPosted: Tue Apr 23, 2013 11:24 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
I just read the part of Michael Johanson's MS thesis where he gives a detailed example of CFR, and now I am doubting my baseline level of understanding for the algorithm.

The way I have currently implemented vanilla CFRM is to track the cumulative positive CFR for every information set. Then at each iteration, I update the strategy to be proportional to the cumulative sum. By contrast, from the MS thesis description, it seems like what I should do is track the total CFR (positive or negative) and then update the strategy to be proportional to the positive results. However, when I try implementing it using that strategy, it does not converge at all.

There is some mention in the thesis about the "average strategy" converging. Is the idea that you do something like fictitious play, where you need to average together all the policies at time T in order to get the true optimal strategy? In that case, you really need to keep both your current strategy and your average strategy-- which of course you can update incrementally. That's the only way I can see this approach working.

Can someone give me some clarity here?


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 12:36 am 
Offline
Senior Member

Joined: Mon Mar 11, 2013 10:24 pm
Posts: 216
the current strategy can be derived from the regret values, but you store the cumulative strategies in each node too - this is where the end result comes from.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 6:05 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
Okay, then I am really confused. I just implemented it that way and it is not nearly as effective or stable. I must be doing something wrong.

Here is my full python repo: https://github.com/tansey/pycfr

If you go to the main directory and run
Code:
python tests/test_cfr.py
, it should run with the canonical implementation on first half-street kuhn and then leduc. HSK works fine, since it's such a simple problem. However, when I run on leduc, it converges slowly and then actually starts to bounce back up once it gets around a 0-exploitability payoff for player 2 (when we know they value of leduc is around -0.08 for player 1).

If you run change the object created in test_cfr.py from creating an object of type
Code:
ProperCounterfactualRegretMinimizer
to just
Code:
CounterfactualRegretMinimizer
(you can keep all other code exactly the same), it uses the approach I originally implemented, based on the proportion of cumulative positive regret. It converges much faster and doesn't bounce around at all.

Did I do something wrong here? If anybody's implemented vanilla CFR for leduc, I'd be interested in taking a look at an output of the exploitability of their agents every 10 iterations for the first 2000 iterations, since that's about the timeframe I'm looking at right now.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 6:43 pm 
Offline
Veteran Member

Joined: Wed Mar 20, 2013 1:43 am
Posts: 267
The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 6:46 pm 
Offline
Site Admin
User avatar

Joined: Sun Feb 24, 2013 9:39 pm
Posts: 642
Amax posted some nice code on the old forum at http://poker-ai.org/archive/www.pokerai ... 335#p40335


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:00 pm 
Offline
Junior Member

Joined: Mon Apr 22, 2013 11:46 am
Posts: 34
HontoNiBaka wrote:
The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.


this. but you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.

also, hijacking for follow up question which i can't seem to prove:

does the average EV of the CFRM-iterations converge to the EV between the average strategies?


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:05 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
HontoNiBaka wrote:
The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.


Right. I am pretty sure that's what I'm doing. I track counterfactual regret and maintain a "current strategy profile" and "average strategy profile", where the latter is the average of all the previous instances of the current profiles.

I agree that it shouldn't change much after millions of iterations, but I am working right now on the order of thousands and printing out the exploitability every 10 iterations. The results can thus change quite a bit there.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:08 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
spears wrote:
Amax posted some nice code on the old forum at http://poker-ai.org/archive/www.pokerai ... 335#p40335


Thanks Spears. I've seen and downloaded that code before. It only covers the various MC-CFR versions though, not vanilla CFR right? I suppose I could intuit most of the algorithm by looking closely at external sampling, but it would still not be the exact algorithm.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:10 pm 
Offline
Veteran Member

Joined: Wed Mar 20, 2013 1:43 am
Posts: 267
public chance sampling is vanilla for kuhn and the riversolver


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:12 pm 
Offline
Junior Member

Joined: Mon Apr 22, 2013 11:46 am
Posts: 34
glanced at your code and found this:

1.0 / (self.iterations + 1) * (self.iterations * prev_cfr + immediate_cfr)

this is usually a very bad idea due to rounding errors. just add them all up and divide with the iterations when you need the value.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:13 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:15 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
HontoNiBaka wrote:
public chance sampling is vanilla for kuhn and the riversolver


Right. Both my original implementation and what I think is the proper CFRM implementation work well on kuhn. It's leduc that I'm seeing big differences in w.r.t. convergence rates.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:18 pm 
Offline
Junior Member

Joined: Mon Apr 22, 2013 11:46 am
Posts: 34
longshot wrote:
Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?


no. there you weigh with the opponents probability of reaching the information set.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:31 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
somehomelessguy wrote:
longshot wrote:
Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?


no. there you weigh with the opponents probability of reaching the information set.


So you're saying you need to store the cumulative reach probabilities at each information set? Then the update equation is:

Code:
total_reach = prev_reach + cur_reach
strategy = 1.0 / total_reach * (prev_reach * prev_strategy + cur_reach * cur_strategy)


That doesn't seem to appear anywhere in the original NIPS paper.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:40 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
somehomelessguy wrote:
glanced at your code and found this:

Code:
1.0 / (self.iterations + 1) * (self.iterations * prev_cfr + immediate_cfr)


this is usually a very bad idea due to rounding errors. just add them all up and divide with the iterations when you need the value.


Strangely enough, if I change it to:

Code:
self.counterfactual_regret[root.player][infoset][i] += immediate_cfr


It seems to have a substantially detrimental effect. I'm not sure why right now. That may be an indication that there is a bug somewhere in my code. Though looking at everything, it doesn't really seem like that should make a difference. We're only interested in the relative CFR, so whether I store the average or the sum, it shouldn't effect the result of the algorithm.


Last edited by longshot on Wed Apr 24, 2013 7:42 pm, edited 1 time in total.

Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 7:40 pm 
Offline
Junior Member

Joined: Mon Apr 22, 2013 11:46 am
Posts: 34
longshot wrote:
So you're saying you need to store the cumulative reach probabilities at each information set? Then the update equation is:

Code:
total_reach = prev_reach + cur_reach
strategy = 1.0 / total_reach * (prev_reach * prev_strategy + cur_reach * cur_strategy)


That doesn't seem to appear anywhere in the original NIPS paper.


stop that. you'll get rounding errors.

keep track of sums of reach*prob for each action.
average strategy is the action's sum / sum of reaches
hint: you don't need to keep track of the sum of reaches as you can add up the actions reach*prob to get it.


Top
 Profile  
 
PostPosted: Wed Apr 24, 2013 8:44 pm 
Offline
Junior Member

Joined: Thu Apr 11, 2013 10:13 pm
Posts: 22
somehomelessguy wrote:
stop that. you'll get rounding errors.

keep track of sums of reach*prob for each action.
average strategy is the action's sum / sum of reaches
hint: you don't need to keep track of the sum of reaches as you can add up the actions reach*prob to get it.


Ah, I found the average strategy formula in the NIPS 2009 paper and now what you're saying makes much more sense! Initial results look like it's fixed.

Thanks SHG!


P.S. Still haven't changed/fixed all the potential rounding errors. Right now I'm just assuming it's not a huge issue until I get really close to the NE.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 17 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group