I'm working on my PCS implementation and it doesn't look like it's converging to the expected EV values. I'm thinking I must be sampling incorrectly or somehow updating my regret values incorrectly.
Here's my code:
https://github.com/tansey/pycfr/blob/master/pokercfr.py (PCS is at the bottom of the file)
I'm using a public tree implementation, and calculating both players' values simultaneously. It seems like you still need to enumerate the actions and holecards the same way, calculate the terminal payoffs the same way, and update the regret the same way (i.e. just adding the CFR). So I just subclass my CounterfactualRegretMinimizer and change how it handles board nodes:
class PublicChanceSamplingCFR(CounterfactualRegretMinimizer):
def __init__(self, rules):
CounterfactualRegretMinimizer.__init__(self, rules)
def cfr(self):
# Sample all board cards to be used
self.board = random.sample(self.rules.deck, sum([x.boardcards for x in self.rules.roundinfo]))
# Set the top card of the deck
self.top_card = 0
# Call the standard CFR algorithm
self.cfr_helper(self.tree.root, [{(): 1} for _ in range(self.rules.players)])
def cfr_boardcard_node(self, root, reachprobs):
# Number of community cards dealt this round
num_dealt = len(root.children[0].board) - len(root.board)
# Possible combinations of community cards
prevlen = len(reachprobs[0].keys()[0])
possible_deals = float(choose(len(root.deck) - prevlen,root.todeal))
# Find the child that matches the sampled board card(s)
for bc in root.children:
if self.boardmatch(num_dealt, bc):
# Deal the card(s)
self.top_card += num_dealt
# Update the probabilities for each HC to be 1/N for N possible boardcard deals
next_reachprobs = [{ hc: reachprobs[player][hc] / len(root.children) for hc in bc.holecards[player] } for player in range(self.rules.players)]
# Perform normal CFR
results = self.cfr_helper(bc, next_reachprobs)
# Put the cards back in the deck (so we use the same sampled cards for every trajectory)
self.top_card -= num_dealt
# Return the payoffs
return results
raise Exception('Sampling from impossible board card')
def boardmatch(self, num_dealt, node):
# Checks if this node is a match for the sampled board card(s)
for next_card in range(self.top_card, self.top_card + num_dealt):
if self.board[next_card] not in node.board:
return False
return True
def cfr_strategy_update(self, root, reachprobs):
# Update the strategies and regrets for each infoset
for hc in root.holecards[root.player]:
infoset = self.rules.infoset_format(root.player, hc, root.board, root.bet_history)
# Get the current CFR
prev_cfr = self.counterfactual_regret[root.player][infoset]
# Get the total positive CFR
sumpos_cfr = float(sum([max(0,x) for x in prev_cfr]))
if sumpos_cfr == 0:
# Default strategy is equal probability
probs = self.equal_probs(root)
# Use the strategy that's proportional to accumulated positive CFR
probs = [max(0,x) / sumpos_cfr for x in prev_cfr]
# Use the updated strategy as our current strategy
self.current_profile.strategies[root.player].policy[infoset] = probs
# Update the weighted policy probabilities (used to recover the average strategy)
for i in range(3):
self.action_reachprobs[root.player][infoset][i] += reachprobs[root.player][hc] * probs[i]
if sum(self.action_reachprobs[root.player][infoset]) == 0:
# Default strategy is equal weight
self.profile.strategies[root.player].policy[infoset] = self.equal_probs(root)
# Recover the weighted average strategy
self.profile.strategies[root.player].policy[infoset] = [self.action_reachprobs[root.player][infoset][i] / sum(self.action_reachprobs[root.player][infoset]) for i in range(3)]
# Return and use the current CFR strategy
return self.current_profile.strategies[root.player]
Does anything stick out as obviously wrong to anyone?
It seems like I'm doing a very similar approach to what I see in the open_cfr code, but after 100k iterations on mine it's still stuck at Exploitability0=-0.04 vs. -0.06 for open_cfr. Given that PCS should have lower variance, I think I must be doing something wrong here. Unfortunately, the C# code on the old forum isn't really helpful since it doesn't cover flop games.
Edit: Updated the code with comments.