I'm working on a python implementation of most of the published CFR algorithms (I know it won't scale, but it's intended for research on toy problems). I've implemented and tested CFR on both half-street Kuhn and Leduc poker, where I compared the resulting strategies vs. the open_cfr (1B iterations) results and they're sufficiently close (< 0.0001 per probability per infoset) that I'm pretty sure it's right.
Since I used a public tree to implement vanilla CFR, I decided to do public chance sampling next. It seems like it should be so simple: instead of having a global
iterations count, you have a
visits count for each node. You use that in the counterfactual regret update and then replace the board node enumeration with a simple random child node choice.
The code still works for HS Kuhn, which is a good sanity check since there are no board cards. However, when I run it on Leduc, it gets stuck around exploitability values of (0.35, 0.10) for players 1 and 2.
The MC papers talk about sampling the chance event proportionally to the likelihood, but since we're enumerating all holecards, it seems to me that uniform random is okay. I've tried doing a full Bayesian update of the probability of cards in the deck before sampling and I seem to be getting the same results.
Am I missing something here? Is there some other adjustment I need to make to the algorithm?
Code for my PCS implementation is below:
Code:
class PublicChanceSamplingCFR(CounterfactualRegretMinimizer):
def __init__(self, rules):
CounterfactualRegretMinimizer.__init__(self, rules)
self.init_helper(self.tree.root)
def init_helper(self, node):
node.visits = 0
try:
for child in node.children:
self.init_helper(child)
except AttributeError:
return
def cfr_helper(self, root, reachprobs):
root.visits += 1
return CounterfactualRegretMinimizer.cfr_helper(self, root, reachprobs)
def cfr_boardcard_node(self, root, reachprobs):
bc = random.choice(root.children)
return self.cfr_helper(bc, reachprobs)
def cfr_regret_update(self, root, action_payoffs, ev):
for action,subpayoff in enumerate(action_payoffs):
if subpayoff is None:
continue
for hc,winnings in subpayoff[root.player].iteritems():
immediate_regret = max(winnings - ev[hc], 0)
infoset = self.rules.infoset_format(root.player, hc, root.board, root.bet_history)
prev_regret = self.regret[root.player][infoset][action]
self.regret[root.player][infoset][action] = 1.0 / (root.visits + 1) * (root.visits * prev_regret + immediate_regret)