Report my post from another 3d:
The test was not done in the best way, only top 3 of 30 players went itm so the humans were encouraged to gamble and not to play their real A-game in a cash table. Anyway 45bb is truly amazing.
I would try to reproduce the deep-stack algorithm. The bulk of the cost is to reproduce 10M size training set of random situation solved by cfr for training the network with 3500 hidden units. They've runned 6144 cpu for 11 days. I've estimated that would cost 50k euro. I can at best do 1M samples with 5k of investement so i was thinking to start solving some poker game less deep, like husng were i can reproduce 3-5M samples or start doing some data expansion using multiple examples from the same solved game but it's inappropriate for deep-stack resolving mechanism.
pulser wrote:
Excited to finally see them going for the much more scalable approach of online solving combined with trained models for look-ahead, instead of sticking to precomputed strategies.
One thing confuses me though, they say that they ignore the opponent's actual action when doing the recalc. Does that mean they ignore the opponent's bet size as well, and then just map it to one of the "2 or 3 bet/raise actions" post-recalc? Why not consider the actual size as an optional path?
Also for their own bets, I don't see any mention of bet sizing, which leads me to believe they used the same ½P, P, 2P, All-in sizings they used for training the networks(?) Again it sounds like they're leaving an unnecessary amount of chips on the table.
Either way, impressive results!
The value-function (the neural network precedent trained) return the counterfactual utility approximation of any possible hands for the opponents taking as input only the pot-size and the deep-stack range. So during the simulation that algorithm doesn't consider the precedent action or the size but only the pot-size.
The abstraction is implicit and continuos in the network that produce the value-function but they don't map anything in an explicit way.
In other words the value-function is a method to give a value being in a certain position during the game. The exploatability of deep-stack goes to zero (so his strategy converge to a Nash equilibrium) if the aproximation error of the network goes to zero. This is not possible but judging from the test the error is small enough to can't be exploatable by humans.