Okay, then I am really confused. I just implemented it that way and it is not nearly as effective or stable. I must be doing something wrong.
Here is my full python repo:
https://github.com/tansey/pycfrIf you go to the main directory and run
Code:
python tests/test_cfr.py
, it should run with the canonical implementation on first half-street kuhn and then leduc. HSK works fine, since it's such a simple problem. However, when I run on leduc, it converges slowly and then actually starts to bounce back up once it gets around a 0-exploitability payoff for player 2 (when we know they value of leduc is around -0.08 for player 1).
If you run change the object created in test_cfr.py from creating an object of type
Code:
ProperCounterfactualRegretMinimizer
to just
Code:
CounterfactualRegretMinimizer
(you can keep all other code exactly the same), it uses the approach I originally implemented, based on the proportion of cumulative positive regret. It converges much faster and doesn't bounce around at all.
Did I do something wrong here? If anybody's implemented vanilla CFR for leduc, I'd be interested in taking a look at an output of the exploitability of their agents every 10 iterations for the first 2000 iterations, since that's about the timeframe I'm looking at right now.