Poker-AI.org

Re: Cleaning up CFRM results

2013-03-15T23:39:23+00:00

proud2bBot wrote:

hahaha, pure gold. I have to admit: like the fossilman I do! And yes, thats what I was thinking after your reply. I'll try it out. Thanks for the idea and making me laugh

You can purify the CS before passing it to the e-greedy formula. The formula will still select nodes below your threshold by epsilon %, but by temporarily purifying and re-accumulating CS, the probability of taking the OTHER nodes will increase because your cs_sum will be smaller.

This makes me wonder if anybody has experimented with other purification schemes? Like, maybe squaring and re-normalizing, or something like that.

Statistics: Posted by cantina — Fri Mar 15, 2013 11:39 pm

Re: Cleaning up CFRM results

2013-03-15T21:48:52+00:00

What info do you feed into ASS,HOLE cards? I'm sorry, but I can't take this thread seriously.

Statistics: Posted by birchy — Fri Mar 15, 2013 9:48 pm

Re: Cleaning up CFRM results

2013-03-15T20:39:38+00:00

Nasher wrote:

Wait, you don't have cum in your ASS? <-- This is a proud day for poker AI, because it's completely legitimate.

hahaha, pure gold. I have to admit: like the fossilman I do! And yes, thats what I was thinking after your reply. I'll try it out. Thanks for the idea and making me laugh

Statistics: Posted by proud2bBot — Fri Mar 15, 2013 8:39 pm

Re: Cleaning up CFRM results

2013-03-15T20:05:34+00:00

Wait, you don't have cum in your ASS? <-- This is a proud day for poker AI, because it's completely legitimate.

I just mean in ASS, when summing up the cumulative regret and normalizing it, you could perform some purification (i.e. only sample the node if the normalized CR is above some threshold).

Statistics: Posted by cantina — Fri Mar 15, 2013 8:05 pm

Re: Cleaning up CFRM results

2013-03-15T19:59:38+00:00

Nasher, just to be clear, I'm not using cum. regret directly, but Average Strategy Sampling, which samples based on the relative cumulative strategy for an action. Your idea to not change the regrets/cumulative strategies, but change the sampling seems nice, I will try that.

Statistics: Posted by proud2bBot — Fri Mar 15, 2013 7:59 pm

Re: Cleaning up CFRM results

2013-03-15T19:47:44+00:00

I think is was Carnagie Mellon that showed in some abstractions a certain degree of purification decreased exploitability in the real game. Conversely, for the U of A's CFRM-BR strategy, which already had a very low real-game exploitability, it increased it.

I've tried some purification in training with chance sampling, which didn't seem to work. However, maybe by purifying the cumulative regret when doing ASS it might work. Not actually storing the purification, just when deciding which node to sample. Seems reasonable, yes?

As far as weighted updates in training, slumbot 2012 did that -- it seemed to work Ok for him. I've tried it, but not thoroughly enough to show any kind of empirically meaningful results. Such a thing bodes the question: why use cumulative regret at all?

Statistics: Posted by cantina — Fri Mar 15, 2013 7:47 pm

Re: Cleaning up CFRM results

2013-03-15T16:16:33+00:00

Yes, I'm using something similar (but have an additional threshold for regret in order to cope with mixed strategies where the regret fluctuates between 0), but I am still thinking of a possibility to do it on the run while learning.

Statistics: Posted by proud2bBot — Fri Mar 15, 2013 4:16 pm

Re: Cleaning up CFRM results

2013-03-15T16:06:51+00:00

I've used this simple post-processing algorithm:

Code:

if (regret[action] < 0 && strategy[action] < threshold)
    strategy[action] = 0;

It usually increases theoretical exploitability but performs better in practice.

Statistics: Posted by amax — Fri Mar 15, 2013 4:06 pm

Cleaning up CFRM results

2013-03-15T15:06:30+00:00

When running CFRM (I'm using ASS, where the problem is bigger than in regular CFRM I guess, but there its still existent), we obtain the e-Nash solution by looking at the average strategy (derived from the cumulative strategy stored in the nodes). In practice, I find nodes where after 1B iterations, its a clear decision, but the strategy does for example 99.1% Action X and 0.9% Action Y, because at the beginning of the learning process, Y seems to be good but as the convergence progresses, it showed that X is better. I wonder if we can somehow clean up these results, either during the process or after it. The latter case is easy: given we have a situation like that (e.g., AA in my 25bb game is raised as first action 99.2% and called 0.8%), we make sure the dominant action is over a certain threshold (e.g. 99%) and all other actions have a large negative cumulative regret. However, this makes our strategy a bit more "readable" and probably also a bit better, but we miss a lot of spots close to the threshold (e.g. 98.9%, while the regrets for all others are hugely negative).
My idea is that it would be better to value regrets/strategies higher, that are added in a more recent iteration (obv. not a single one). For instance, in our example, there are 1B iterations and almost all calls come from the first ones where we didn't have a clue how we play on later decisions. If we could find a function that values more recent results slightly higher (continuously), our strategy should converge faster; however we cannot overdo this as otherwise we would probably make a mistake when switching back and forth between two equilibrium points.
The question I have is does someone implement such an approach and which function/heuristic are you using? Or has anyone thoughts on the general approach, i.e., do you think it will work/won't work? It's not only the 0.x% difference that makes me interested in this optimization, but also speeding up the algorithm itself: if we have the AA situation described above, the Call node is always evaluated too, as long its % is larger than 0,1%, so finding out faster that calling is worse than raising enables us to cut the complete call-branch from further inspection.

Statistics: Posted by proud2bBot — Fri Mar 15, 2013 3:06 pm