Quote:
i wonder how you would set a differentiable loss function so that the SGD based optimizers could work? the original CFR algo seems undifferentiable.
Not sure I entirely follow you. I imagine it could work like this. In every iteration:
1. Sample from the game tree, calculate regrets for each information set, calculate strategies matching the regrets. For example, we can have "with AA preflop, raise 100% of the time".
2. Do one training backpropagation step on the neural network(s). Let's say the neural network currently predicts "with AA preflop, raise 90% of the time". We say, "no, 90% is incorrect, the correct answer is 100%" and perform an update in that direction. So after the update, it might output "91%" for example, depending on the learning rate.
An important note is that the learning rate would be proportional to the counterfactual reach probability. This is similar to how regrets are weighted in CFR. That's the main idea behind this neural network approach. Intuitively, it seems that this could possibly work well, because it works for CFR.
Another note, the strategies from step one are obviously incorrect, because they're based on just one sample. But on average, they should be correct (or not? not sure here).