Hi guys,
I've been working on RNR/DBR for a while now, but some results don't make complete sense, hence I would like to ask for you advice. The discussion is about HUNL Hold'em.
I have an EQ strategy and a skewed strategy SKEW. Using my DBR I obtained SKEW_BB_DBR by taking the BB strategy of SKEW and optimizing against it with external sampling. The SKEW_BB_DBR beats SKEW by more than EQ does, which makes sense.
I did a similar test by taking the BB strategy of EQ and optimizing against it to get EQ_BB_DBR. My idea was that since EQ is equilibrium the BR will be also eq, so EQ vs EQ_BB_DBR should be break-even. However, I have failed to get this result and weirdly EQ beats EQ_BB_DBR by a small, but significant margin.
Taking all this into account I started suspecting that I might be off because of wrong regret/ average strategy updates. The DBR solving happens along the lines of the following pseudo-code:
Code:
WalkTree( position p, history h ):
# Handling non-player nodes
if player( h ) == chance:
sample action a according to \sigma_{chance}(h)
return WalkTree( p, h + a )
elif h == terminal:
return utility
# Computing the CFR strategy
strategy s( h ) = regretMatching( h )
# Handle the player which is optimized against
if player( h ) == player_with_data:
rho = data_precision( h )
if U(0,1) < rho:
s( h ) = data( h )
# "Normal" ES CFR
if player( h ) == p:
average_value = 0;
for action a in possible_actions( h )
values( a ) = WalkTree( p, h+a )
average_value += s( h )( a ) * values( a )
regret( h )( a ) += values( a ) - average_value
elif player( h ) !=p:
for action a in possible_actions( h ):
average_strategy( h )( a ) += s( h )( a )
sample action a according to s( h )
return WalkTree( p, h+a )
Could you please have a look at it and let me know if you spot anything wrong.
Cheers,