Poker-AI.org

Poker AI and Botting Discussion Forum
It is currently Mon Nov 13, 2023 12:30 pm

All times are UTC




Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: Cleaning up CFRM results
PostPosted: Fri Mar 15, 2013 3:06 pm 
Offline
Senior Member

Joined: Mon Mar 11, 2013 10:24 pm
Posts: 216
When running CFRM (I'm using ASS, where the problem is bigger than in regular CFRM I guess, but there its still existent), we obtain the e-Nash solution by looking at the average strategy (derived from the cumulative strategy stored in the nodes). In practice, I find nodes where after 1B iterations, its a clear decision, but the strategy does for example 99.1% Action X and 0.9% Action Y, because at the beginning of the learning process, Y seems to be good but as the convergence progresses, it showed that X is better. I wonder if we can somehow clean up these results, either during the process or after it. The latter case is easy: given we have a situation like that (e.g., AA in my 25bb game is raised as first action 99.2% and called 0.8%), we make sure the dominant action is over a certain threshold (e.g. 99%) and all other actions have a large negative cumulative regret. However, this makes our strategy a bit more "readable" and probably also a bit better, but we miss a lot of spots close to the threshold (e.g. 98.9%, while the regrets for all others are hugely negative).
My idea is that it would be better to value regrets/strategies higher, that are added in a more recent iteration (obv. not a single one). For instance, in our example, there are 1B iterations and almost all calls come from the first ones where we didn't have a clue how we play on later decisions. If we could find a function that values more recent results slightly higher (continuously), our strategy should converge faster; however we cannot overdo this as otherwise we would probably make a mistake when switching back and forth between two equilibrium points.
The question I have is does someone implement such an approach and which function/heuristic are you using? Or has anyone thoughts on the general approach, i.e., do you think it will work/won't work? It's not only the 0.x% difference that makes me interested in this optimization, but also speeding up the algorithm itself: if we have the AA situation described above, the Call node is always evaluated too, as long its % is larger than 0,1%, so finding out faster that calling is worse than raising enables us to cut the complete call-branch from further inspection.


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 4:06 pm 
Offline
New Member

Joined: Sat Mar 09, 2013 4:56 pm
Posts: 5
I've used this simple post-processing algorithm:

Code:
if (regret[action] < 0 && strategy[action] < threshold)
    strategy[action] = 0;


It usually increases theoretical exploitability but performs better in practice.


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 4:16 pm 
Offline
Senior Member

Joined: Mon Mar 11, 2013 10:24 pm
Posts: 216
Yes, I'm using something similar (but have an additional threshold for regret in order to cope with mixed strategies where the regret fluctuates between 0), but I am still thinking of a possibility to do it on the run while learning.


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 7:47 pm 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
I think is was Carnagie Mellon that showed in some abstractions a certain degree of purification decreased exploitability in the real game. Conversely, for the U of A's CFRM-BR strategy, which already had a very low real-game exploitability, it increased it.

I've tried some purification in training with chance sampling, which didn't seem to work. However, maybe by purifying the cumulative regret when doing ASS it might work. Not actually storing the purification, just when deciding which node to sample. Seems reasonable, yes?

As far as weighted updates in training, slumbot 2012 did that -- it seemed to work Ok for him. I've tried it, but not thoroughly enough to show any kind of empirically meaningful results. Such a thing bodes the question: why use cumulative regret at all?


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 7:59 pm 
Offline
Senior Member

Joined: Mon Mar 11, 2013 10:24 pm
Posts: 216
Nasher, just to be clear, I'm not using cum. regret directly, but Average Strategy Sampling, which samples based on the relative cumulative strategy for an action. Your idea to not change the regrets/cumulative strategies, but change the sampling seems nice, I will try that.


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 8:05 pm 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
Wait, you don't have cum in your ASS? :D <-- This is a proud day for poker AI, because it's completely legitimate.

I just mean in ASS, when summing up the cumulative regret and normalizing it, you could perform some purification (i.e. only sample the node if the normalized CR is above some threshold).


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 8:39 pm 
Offline
Senior Member

Joined: Mon Mar 11, 2013 10:24 pm
Posts: 216
Nasher wrote:
Wait, you don't have cum in your ASS? :D <-- This is a proud day for poker AI, because it's completely legitimate.


hahaha, pure gold. I have to admit: like the fossilman I do! And yes, thats what I was thinking after your reply. I'll try it out. Thanks for the idea and making me laugh :)


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 9:48 pm 
Offline
Junior Member
User avatar

Joined: Sun Mar 10, 2013 11:38 pm
Posts: 34
What info do you feed into ASS,HOLE cards? I'm sorry, but I can't take this thread seriously. :lol:

_________________
www.bespokebots.com


Top
 Profile  
 
PostPosted: Fri Mar 15, 2013 11:39 pm 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
proud2bBot wrote:
hahaha, pure gold. I have to admit: like the fossilman I do! And yes, thats what I was thinking after your reply. I'll try it out. Thanks for the idea and making me laugh :)


You can purify the CS before passing it to the e-greedy formula. The formula will still select nodes below your threshold by epsilon %, but by temporarily purifying and re-accumulating CS, the probability of taking the OTHER nodes will increase because your cs_sum will be smaller.

This makes me wonder if anybody has experimented with other purification schemes? Like, maybe squaring and re-normalizing, or something like that.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
cron
Powered by phpBB® Forum Software © phpBB Group