Poker-AI.org
http://poker-ai.org/phpbb/

Simple vanilla CFRM question
http://poker-ai.org/phpbb/viewtopic.php?f=24&t=2463
Page 1 of 1

Author:  longshot [ Tue Apr 23, 2013 11:24 pm ]
Post subject:  Simple vanilla CFRM question

I just read the part of Michael Johanson's MS thesis where he gives a detailed example of CFR, and now I am doubting my baseline level of understanding for the algorithm.

The way I have currently implemented vanilla CFRM is to track the cumulative positive CFR for every information set. Then at each iteration, I update the strategy to be proportional to the cumulative sum. By contrast, from the MS thesis description, it seems like what I should do is track the total CFR (positive or negative) and then update the strategy to be proportional to the positive results. However, when I try implementing it using that strategy, it does not converge at all.

There is some mention in the thesis about the "average strategy" converging. Is the idea that you do something like fictitious play, where you need to average together all the policies at time T in order to get the true optimal strategy? In that case, you really need to keep both your current strategy and your average strategy-- which of course you can update incrementally. That's the only way I can see this approach working.

Can someone give me some clarity here?

Author:  proud2bBot [ Wed Apr 24, 2013 12:36 am ]
Post subject:  Re: Simple vanilla CFRM question

the current strategy can be derived from the regret values, but you store the cumulative strategies in each node too - this is where the end result comes from.

Author:  longshot [ Wed Apr 24, 2013 6:05 pm ]
Post subject:  Re: Simple vanilla CFRM question

Okay, then I am really confused. I just implemented it that way and it is not nearly as effective or stable. I must be doing something wrong.

Here is my full python repo: https://github.com/tansey/pycfr

If you go to the main directory and run
Code:
python tests/test_cfr.py
, it should run with the canonical implementation on first half-street kuhn and then leduc. HSK works fine, since it's such a simple problem. However, when I run on leduc, it converges slowly and then actually starts to bounce back up once it gets around a 0-exploitability payoff for player 2 (when we know they value of leduc is around -0.08 for player 1).

If you run change the object created in test_cfr.py from creating an object of type
Code:
ProperCounterfactualRegretMinimizer
to just
Code:
CounterfactualRegretMinimizer
(you can keep all other code exactly the same), it uses the approach I originally implemented, based on the proportion of cumulative positive regret. It converges much faster and doesn't bounce around at all.

Did I do something wrong here? If anybody's implemented vanilla CFR for leduc, I'd be interested in taking a look at an output of the exploitability of their agents every 10 iterations for the first 2000 iterations, since that's about the timeframe I'm looking at right now.

Author:  HontoNiBaka [ Wed Apr 24, 2013 6:43 pm ]
Post subject:  Re: Simple vanilla CFRM question

The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.

Author:  spears [ Wed Apr 24, 2013 6:46 pm ]
Post subject:  Re: Simple vanilla CFRM question

Amax posted some nice code on the old forum at http://poker-ai.org/archive/www.pokerai ... 335#p40335

Author:  somehomelessguy [ Wed Apr 24, 2013 7:00 pm ]
Post subject:  Re: Simple vanilla CFRM question

HontoNiBaka wrote:
The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.


this. but you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.

also, hijacking for follow up question which i can't seem to prove:

does the average EV of the CFRM-iterations converge to the EV between the average strategies?

Author:  longshot [ Wed Apr 24, 2013 7:05 pm ]
Post subject:  Re: Simple vanilla CFRM question

HontoNiBaka wrote:
The strategy for the next iteration is computed with the regret values. You have to keep track of positive and begative values, but you only use the positive ones in the actual strategy computation.

This strategy will change quite often, hands with mixed strategies oscilate.

The average strategy is the sum of all strategies normalized by the number of iterations. This average strategy doesnt change much, especially after lots of iterations. Intuitivelly this should make sense, as if you have a million of itterations, another itteration cant change the average that much.


Right. I am pretty sure that's what I'm doing. I track counterfactual regret and maintain a "current strategy profile" and "average strategy profile", where the latter is the average of all the previous instances of the current profiles.

I agree that it shouldn't change much after millions of iterations, but I am working right now on the order of thousands and printing out the exploitability every 10 iterations. The results can thus change quite a bit there.

Author:  longshot [ Wed Apr 24, 2013 7:08 pm ]
Post subject:  Re: Simple vanilla CFRM question

spears wrote:
Amax posted some nice code on the old forum at http://poker-ai.org/archive/www.pokerai ... 335#p40335


Thanks Spears. I've seen and downloaded that code before. It only covers the various MC-CFR versions though, not vanilla CFR right? I suppose I could intuit most of the algorithm by looking closely at external sampling, but it would still not be the exact algorithm.

Author:  HontoNiBaka [ Wed Apr 24, 2013 7:10 pm ]
Post subject:  Re: Simple vanilla CFRM question

public chance sampling is vanilla for kuhn and the riversolver

Author:  somehomelessguy [ Wed Apr 24, 2013 7:12 pm ]
Post subject:  Re: Simple vanilla CFRM question

glanced at your code and found this:

1.0 / (self.iterations + 1) * (self.iterations * prev_cfr + immediate_cfr)

this is usually a very bad idea due to rounding errors. just add them all up and divide with the iterations when you need the value.

Author:  longshot [ Wed Apr 24, 2013 7:13 pm ]
Post subject:  Re: Simple vanilla CFRM question

Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?

Author:  longshot [ Wed Apr 24, 2013 7:15 pm ]
Post subject:  Re: Simple vanilla CFRM question

HontoNiBaka wrote:
public chance sampling is vanilla for kuhn and the riversolver


Right. Both my original implementation and what I think is the proper CFRM implementation work well on kuhn. It's leduc that I'm seeing big differences in w.r.t. convergence rates.

Author:  somehomelessguy [ Wed Apr 24, 2013 7:18 pm ]
Post subject:  Re: Simple vanilla CFRM question

longshot wrote:
Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?


no. there you weigh with the opponents probability of reaching the information set.

Author:  longshot [ Wed Apr 24, 2013 7:31 pm ]
Post subject:  Re: Simple vanilla CFRM question

somehomelessguy wrote:
longshot wrote:
Quote:
you normalize with the acummulated sum of reach probabilties. if you have strategy 0/0/1 one iteration where you had 2% prob to reach, and strategy 0.5/0.5/0 in another iteration where you had 4% prob to reach the information set, the average strategy is 0.333/0.333/0.333 for example.


Wouldn't that be taken into account by the CFR weighting? I thought CFR was basically:
Code:
reachprob * (action_ev - strategy_ev)


So wouldn't weighting the averages be double-counting things?


no. there you weigh with the opponents probability of reaching the information set.


So you're saying you need to store the cumulative reach probabilities at each information set? Then the update equation is:

Code:
total_reach = prev_reach + cur_reach
strategy = 1.0 / total_reach * (prev_reach * prev_strategy + cur_reach * cur_strategy)


That doesn't seem to appear anywhere in the original NIPS paper.

Author:  longshot [ Wed Apr 24, 2013 7:40 pm ]
Post subject:  Re: Simple vanilla CFRM question

somehomelessguy wrote:
glanced at your code and found this:

Code:
1.0 / (self.iterations + 1) * (self.iterations * prev_cfr + immediate_cfr)


this is usually a very bad idea due to rounding errors. just add them all up and divide with the iterations when you need the value.


Strangely enough, if I change it to:

Code:
self.counterfactual_regret[root.player][infoset][i] += immediate_cfr


It seems to have a substantially detrimental effect. I'm not sure why right now. That may be an indication that there is a bug somewhere in my code. Though looking at everything, it doesn't really seem like that should make a difference. We're only interested in the relative CFR, so whether I store the average or the sum, it shouldn't effect the result of the algorithm.

Author:  somehomelessguy [ Wed Apr 24, 2013 7:40 pm ]
Post subject:  Re: Simple vanilla CFRM question

longshot wrote:
So you're saying you need to store the cumulative reach probabilities at each information set? Then the update equation is:

Code:
total_reach = prev_reach + cur_reach
strategy = 1.0 / total_reach * (prev_reach * prev_strategy + cur_reach * cur_strategy)


That doesn't seem to appear anywhere in the original NIPS paper.


stop that. you'll get rounding errors.

keep track of sums of reach*prob for each action.
average strategy is the action's sum / sum of reaches
hint: you don't need to keep track of the sum of reaches as you can add up the actions reach*prob to get it.

Author:  longshot [ Wed Apr 24, 2013 8:44 pm ]
Post subject:  Re: Simple vanilla CFRM question

somehomelessguy wrote:
stop that. you'll get rounding errors.

keep track of sums of reach*prob for each action.
average strategy is the action's sum / sum of reaches
hint: you don't need to keep track of the sum of reaches as you can add up the actions reach*prob to get it.


Ah, I found the average strategy formula in the NIPS 2009 paper and now what you're saying makes much more sense! Initial results look like it's fixed.

Thanks SHG!


P.S. Still haven't changed/fixed all the potential rounding errors. Right now I'm just assuming it's not a huge issue until I get really close to the NE.

Page 1 of 1 All times are UTC
Powered by phpBB® Forum Software © phpBB Group
http://www.phpbb.com/