spears wrote:
I was assuming and hoping it produces a mixed strategy. I believe a pure best response strategy would be highly exploitable.
Not sure, but I think you can just train an additional net as explicit policy net. So you have one net that tells you the value of states, and one that tells you percentages to take an action.
old: I would probably still use action that leads to maximum state by default, and only use the policy if the state values are near each other (during usage, in training/self-play you use the policy always (except epsilon-greedy exploration stuff))... edit: stupid me: you make the enemy indifferent to calling/folding not yourself, dohnew: just trust your nets
But I am not sure if net training would require you to evaluate all possible actions (more expensive to train, but no problem for HU-Limit, more of a problem for my case: 6-max-NL...). But we can use the state value estimation network for that.