If we use CFR to calculate our regrets for each decision and then recompute strategy based on those regrets. After billions of iterations, why is the final strategy that we used not one that converges to the nash equilibrium? It's said instead that the average of the strategies do
The question is based on the
comment by Michael Johanson, a researcher from UoA with published papers on this:
"...It then recomputes its strategy so that it takes actions with probabilities proportional to their positive regret... It repeats this process for billions of games. So you have this long sequence of strategies that it was using on each game.
Counter-intuitively, that sequence of strategies does not necessarily converge to anything useful (although it sometimes does so in practice, now, with the new CFR+ algorithm we describe in the Science paper)."
And what about the CFR+ algorithm allows the final strategy to be used?
Cheers!