Image Image Image




Post new topic Reply to topic  [ 10 posts ] 
Author Message
 Post subject: Robust Decision Making And Opponent Exploitation In Simplified P
PostPosted: Sat Sep 22, 2012 10:16 pm 
Offline
New member
User avatar

Posts: 4
Favourite Bot: ...
Robust Decision Making and Opponent Exploitation in Simplified Poker

Author: Arminas Kemesius
Year: 2012

Abstract:

The Game of Kuhn poker was used to investigate the process decision making in competitive scenarios. This relatively simple game shares many features with the more common and far more complex poker games. Game theoretical methods were employed to develop an unbeatable range of strategies known as the Nash Equilibrium. With some modification, these same methods were then able to produce strategies aimed at exploiting the weaknesses in opponent’s decision making. The algorithm was then then further modified to increase competitiveness against unforeseen opponents. For the last development, an adaptive strategy algorithm was developed. The algorithm was capable of exploiting unknown opponents by modelling their strategy during game play, and adapting accordingly.

This is my dissertation for MSc in Control Systems at The University of Sheffield.


Attachments:
A.Kemesius - Robust Decision Making and Opponent Exploitation in Simplified Poker - 2012.pdf [888.53 KB]
Downloaded 118 times
Top
 Profile E-mail  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 12:27 am 
Offline
PokerAI fellow
User avatar

Posts: 1115
Favourite Bot: Johnny #5
So, if I read this right, they used BE to determine which ND-RNR strategy is the most profitable versus a given opponent?


Top
 Profile  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 1:12 am 
Offline
New member
User avatar

Posts: 4
Favourite Bot: ...
c2008 wrote:
So, if I read this right, they used BE to determine which ND-RNR strategy is the most profitable versus a given opponent?


7 ND-RNR strategies were computed offline, each for a specific type of opponent.
When online(playing actual games), Bayesian Inference was used to compute the probabilities that the opponent was one of the 7 known types.
The ND-RNR response was chosen for the max probability opponent type.

I hope this answers the question, as I'm not sure what you meant with 'BE'.


Top
 Profile E-mail  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 1:30 am 
Offline
PokerAI fellow
User avatar

Posts: 1115
Favourite Bot: Johnny #5
Bayesian Erections...


Top
 Profile  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 1:34 am 
Offline
PokerAI fellow
User avatar

Posts: 1115
Favourite Bot: Johnny #5
I might have missed it, but they didn't compare the performance to the standard Importance Sampling?


Top
 Profile  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 1:55 am 
Offline
New member
User avatar

Posts: 4
Favourite Bot: ...
c2008 wrote:
Bayesian Erections...


The probabilities were updated by Recursive Bayesian Estimation. The histories ending in a showdown were used to update the opponent type probabilities after every game played.

c2008 wrote:
I might have missed it, but they didn't compare the performance to the standard Importance Sampling?


For the adaptive play, which I assume you are referring to, the hands were dealt out at random, and the actual as well as the theoretical (EV) profits were computed to indicate the profitability during a series of games Fig 34. Correct opponent identification lead to increase in profitability.

The performance of the strategies did not require estimation, as Kuhn poker is small enough to compute the performance directly through straightforward EV calculations.


Top
 Profile E-mail  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sun Sep 23, 2012 6:30 am 
Offline
PokerAI fellow
User avatar

Posts: 1115
Favourite Bot: Johnny #5
I can't help but get the feeling you're a part of "they," arm99?


Top
 Profile  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Mon Sep 24, 2012 10:05 pm 
Offline
New member
User avatar

Posts: 4
Favourite Bot: ...
c2008 wrote:
I can't help but get the feeling you're a part of "they," arm99?



This is my dissertation for MSc in Control Systems at The University of Sheffield :)


Top
 Profile E-mail  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Tue Sep 25, 2012 1:37 am 
Offline
Junior member
User avatar

Posts: 25
Favourite Bot: Polaris Orange
Hi arm99,

I've just had a quick look at your thesis, but it doesn't look like what you're doing is Restricted Nash Response at all - it actually looks like just a (non-robust) best response calculation using CFR.

In CFR, if you let both players adapt, the average strategy for the players converges to a Nash equilibrium. Instead, if you load a strategy for one player and hold that strategy static (ie, you don't update it with CFR) and use CFR to update the other player, then that player's strategy (current and average) converges to a best response. From Figure 20 and the text of your thesis, it looks like that's what you're doing, and the text below Figure 20 notes that it takes only a small number of iterations (5) to compute an exploitive strategy. But that strategy isn't necessarily any more robust than any other best response computed via any other algorithm: it can be just as brittle. This isn't RNR at all, it's just another way to do best response.

RNR is more complicated, and takes at least as long as CFR: either more iterations, or the iterations are slower. For restricted Nash response, you have an exploiting player and a restricted player. The exploiting player has one strategy and updates it via CFR, just as in the normal CFR algorithm. The restricted player has two strategies: one that is static (not updated at all) and initialized to be the opponent's strategy (or a model of it), and the other is not static and is updated via CFR, just like in the normal CFR algorithm. You also have a probability for the restricted player at the start of the game, p, that determines the probability that the restricted player plays according to the static strategy. With probability (1-p), they play according to the adaptive strategy. Different values of p allow you to generate any strategy for the exploiting player from a Nash equilibrium (when opponent's p=0) to a best response (when opponent's p=1) to something in between. And RNR doesn't just find *a* strategy in between these goals: by varying p, it finds the *best possible* set of tradeoffs between them. If you run RNR until it's "converged enough", it won't be possible for there to be a strategy that is both less exploitable and wins more against the target opponent.

There are two ways to implement RNR with the p parameter. One way is just to sample a number s on each iteration. If (s < p), then update only the exploiting player with respect to the static opponent, as in your Figure 20's right hand side. If (s > p), then update both players just as in CFR, as in your Figure 20's left hand side. If p=0.5, then the iterations should take as long as the average between the two updates, since you're doing half of the slow left hand side (updating two players) and half of the faster right hand side (updating one player). The other approach, which is what we do, is to update with respect to both opponent trees on each iteration, and just weight the regret updates to the exploiting player by (p) and (1-p). This means that each iteration is slower, since you do three tree traversals on each iteration. Using P1 to refer to the exploiting player and P2-static and P2-adaptive to refer to the two trees for the restricted player, you need to update (P1-vs-P2-static, P1-vs-P2-adaptive, P2-adaptive-vs-P1). So, more work per iteration than CFR. And it will take about as many iterations as CFR for P1 to converge.

Am I wrong about what you've done? From Figure 20 it sure looks like you're just computing best responses via CFR and not doing RNR at all. If you are doing RNR, then what values of p did you use in your experiments?

One of the benefits of RNR is that it should completely avoid the problems you created ND-RNR to avoid. Since the adaptive part of the opponent can force the exploiting player down any line of play that is reachable given its strategy, there really shouldn't be any parts of the exploiting player's strategy left un-tuned. If there is any reachable flaw in the exploiting player's strategy, then the opponent's adaptive strategy will be able to gain value by exploring that line of play and the exploiting player will be forced to update their strategy that those decisions. This is just like how, in CFR, both players probe each others flaws in order to reach an equilibrium. If RNR is run properly, then I don't believe ND-RNR will be necessary at all. The strategy might still have un-updated decisions, but only in the non-reachable parts of its tree. For example, if its strategy never calls as its first action, then its decisions following that first call may or may not be updated. But it doesn't matter, because those later decisions aren't reachable by any opponent. Note that this only holds if you're running RNR (or, for that matter, CFR) long enough to let the players find each others' flaws.

For more on these themes, and for an algorithm that's less useful in theory but much stronger in practice if you only get observations of the opponent instead of their whole strategy, have a look at our Data Biased Response paper from AISTATS 2009.


Top
 Profile  
 
 Post subject: Re: RobustDecisionMakingAndOpponentExploitationInSimplifiedPoker
PostPosted: Sat Sep 29, 2012 1:01 pm 
Offline
PokerAI fellow
User avatar

Posts: 1115
Favourite Bot: Johnny #5
arm99 wrote:
For the adaptive play, which I assume you are referring to, the hands were dealt out at random, and the actual as well as the theoretical (EV) profits were computed to indicate the profitability during a series of games Fig 34. Correct opponent identification lead to increase in profitability.

The performance of the strategies did not require estimation, as Kuhn poker is small enough to compute the performance directly through straightforward EV calculations.


I'm wondering, however, which is the better/faster/stronger classifier: your Bayes approach or Importance Sampling? Maybe MBJ can speculate...


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: