Poker-AI.org

Poker AI and Botting Discussion Forum
It is currently Mon Nov 13, 2023 12:15 pm

All times are UTC




Post new topic Reply to topic  [ 42 posts ]  Go to page Previous  1, 2, 3  Next
Author Message
PostPosted: Fri Sep 06, 2013 9:51 am 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
I don't know. If I'm just reading 8 bytes specifically at offset x of pointer p, would it pull everything into cache held at that reference pointer? If yes, maybe I could find some way to split the array up and marshal it to different locations with different reference pointers.

Maybe I should try splitting up the arrays without marshaling first. I could define an array for each texture bucket post-flop? Then, at loading/saving just merge it back together.


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 9:58 am 
Offline
Regular Member

Joined: Sun Mar 03, 2013 11:55 am
Posts: 64
Requesting something from an array does not move the entire array into the cache, it will still just grab a section of memory. I'm not totally sure how that section is decided upon, but it doesn't look at what it's getting hold of, it literally just grabs a section. So that memory may be totally unrelated.

You should be using arrays of structs in .NET, an array of objects is just a array of object references, so the objects themselves wont be local to the array. An array of structs stores the structs in a continuous section of memory, therefore it makes much better use of the CPU caches.

You just somehow need to work out which objects are frequently accessed together, and try to keep them together. So assuming you're storing nodes of a tree, try to keep adjacent nodes as close together as possibly. A side note - trees are a killer for cache locality, but there's not much you can do about that.


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 10:05 am 
Offline
Site Admin
User avatar

Joined: Sun Feb 24, 2013 9:39 pm
Posts: 642
I asked
Quote:
Doesn't the marshalling itself require the array to be moved into the cache?

I meant to ask
Quote:
Doesn't the marshalling itself require the portion of the array containing the target value to be moved into the cache?


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 10:09 am 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
OneDayItllWork wrote:
Requesting something from an array does not move the entire array into the cache, it will still just grab a section of memory. I'm not totally sure how that section is decided upon, but it doesn't look at what it's getting hold of, it literally just grabs a section. So that memory may be totally unrelated.

Hmm... Would .NET be doing the tossing? I'm convinced something is, because, like I said, the smaller the data arrays, the faster things get. Maybe when the tree is traversed and the node class is referenced it's bringing the entire thing into cache?

OneDayItllWork wrote:
You should be using arrays of structs in .NET, an array of objects is just a array of object references, so the objects themselves wont be local to the array. An array of structs stores the structs in a continuous section of memory, therefore it makes much better use of the CPU caches.

I'm not sure what you're suggesting here? Things look like this:

Code:
Class GameTreeNode

  Dim data(,,,,) as double
  Dim children() as GameTreeNode

  Function Train(ByVal some_stuff)
    children(i).Train(some_stuff)
  End Function

End Class


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 10:14 am 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
Maybe I should keep my data outside of the game tree nodes? If it is the references of the node class that's causing the tossing, keeping the data marshaled and just storing a single reference pointer within the node class would reduce that.


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 10:58 am 
Offline
Regular Member

Joined: Sun Mar 03, 2013 11:55 am
Posts: 64
Nasher wrote:
PolarBear wrote:
I concur with OneDay on this - it's likely that the memory bandwidth of your machine is just not enough given the number of threads running and that each thread constantly accesses different areas of memory.


Like I mentioned above, it gets faster when the data arrays have smaller bounds. Given what OneDay said about the caching, this makes me think that it's tossing around the entirety of those arrays when there is work being done on just a few elements therein. I'm hoping marshaling the specific elements I need to work with will bypass that tossing.


edit: I thought my previous post didn't post, so I've written some of the same things again.

An array is always stored in one continuous section of memory, so when accessing an element of an array in memory, the array itself is not read into the cache. .NET has a pointer reference to the start of the array, it has an index, and it knows the size of the elements in the array. It can therefore jump straight to the section of memory it reads. It will not read the whole array, and as most CPU caches are < 10MB, it can't read anything near the whole array into the cache if the array is large.

How the hardware decides what to pull into CPU cache, other than what was requested, I'm not totally sure. As far as I'm aware it'll grab a page, depending on hardware this may vary in size, but let's say it grabs a 4K page. It doesn't look at what that page contains, or whether it's in any way related, it' just one 4K block of memory. The logic being, that if you're looking at something in that area of memory, then you'll probably want something else in that area of memory as well. This is especially true when working with objects, as all the object value types will also be stored in a single block of memory. It also means we can have multiple sections of memory under a single lookup in the CPU cache, which means the cache lookup table doesn't need to be as big.

This brings me onto my next point. An array of objects is an array of object references. That means that the objects themselves have no locality to the array. You'll just jumping all over the place with an array of objects. You need to use an array of structs, then all the structs will be stored adjacently in a single block of memory.

As said, assuming you're working with some kind of tree that you iterate though, you want to have adjacent nodes as close to each other as possible in the array. Due the the exponentially increasing size of a tree with levels, they are horrible structures for cache locality, but there's not much you can do about that.

So to sum it up, don't bother with marshalling, just use an array of structs.


Last edited by OneDayItllWork on Fri Sep 06, 2013 11:14 am, edited 3 times in total.

Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 11:02 am 
Offline
Regular Member

Joined: Sun Mar 03, 2013 11:55 am
Posts: 64
Nasher wrote:
Maybe I should keep my data outside of the game tree nodes? If it is the references of the node class that's causing the tossing, keeping the data marshaled and just storing a single reference pointer within the node class would reduce that.

Keep as little data as you possibly can in a struct that will enable you to iterate the tree. Store all the nodes of the tree in an array to enforce some kind of locality. And in your array of structs store an object reference to data you actually need to work with when doing something with a node.


Top
 Profile  
 
PostPosted: Fri Sep 06, 2013 11:14 am 
Offline
Junior Member

Joined: Thu May 02, 2013 2:25 pm
Posts: 30
Nasher wrote:
Hmm... Would .NET be doing the tossing? I'm convinced something is, because, like I said, the smaller the data arrays, the faster things get.


Without going into specifics of your algorithm, the above sounds like a normal behaviour - you just get more cache hits and less cache misses (requiring a fetch from memory) when your data structure is smaller. Also, I wouldn't worry about the "tossing" - as far as I understand it, the cost of bringing the whole block of memory to the CPU and (thus to on-CPU cache) is the same as the cost of bringing a single value. So it's not like you're wasting time because of CPU's caching efforts.


Top
 Profile  
 
PostPosted: Sat Sep 07, 2013 1:16 am 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
Moving all the data into a single, external array actually ended up slowing things down. Each thread had the same efficiency, but they just took longer to traverse the tree. :)

I don't doubt what you're telling me about the cache hits, I'm just not sure what to do about it.


Last edited by cantina on Sat Sep 07, 2013 1:27 am, edited 1 time in total.

Top
 Profile  
 
PostPosted: Sat Sep 07, 2013 1:24 am 
Offline
Junior Member

Joined: Thu Mar 07, 2013 1:14 am
Posts: 12
Calculate more stuff and use less large LUTs.


Top
 Profile  
 
PostPosted: Sat Sep 07, 2013 6:24 am 
Offline
Regular Member

Joined: Sun Mar 03, 2013 11:55 am
Posts: 64
Nasher wrote:
Moving all the data into a single, external array actually ended up slowing things down. Each thread had the same efficiency, but they just took longer to traverse the tree. :)

I don't doubt what you're telling me about the cache hits, I'm just not sure what to do about it.

Run a profiler and see what's taking the time.


Top
 Profile  
 
PostPosted: Sat Sep 07, 2013 8:16 am 
Offline
Site Admin
User avatar

Joined: Sun Feb 24, 2013 9:39 pm
Posts: 642
You have to arrange the data in memory so that which is used close in time is placed close together in memory. Even in a single threaded application that is usually not possible
all the time so you design it so that all the big jumps in memory location are made at the same time, because a large jump in memory location costs the same as a medium size one and small jumps don't cost anything. So for example if you are navigating a tree and doing calculations only within a node and with adjacent nodes you should keep all data for a particular node together, and data for adjacent nodes together. Multicore multithreading makes this even more difficult to get right because different threads will be working on very different parts of memory.


Top
 Profile  
 
PostPosted: Fri Nov 08, 2013 9:56 pm 
Offline
Regular Member
User avatar

Joined: Sat May 25, 2013 7:36 am
Posts: 73
Sick shit. After reading this thread I wondered whether the cache or the computational power is the bottleneck in my setup. So I reduced the number of threads in my process from 32 over 16 to 8 finally down to 1 (having 16 cores) resulting in the processor utilization decaying from 97% to 5%. however, the iterations per second went up from initially 16'000 to almost 40'000.

impressed! very helpful. thanks a lot!


Top
 Profile  
 
PostPosted: Sat Nov 09, 2013 7:52 am 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
That's a very odd response to thread reduction. I'd say there's a problem with how you're doing multithreading.


Top
 Profile  
 
PostPosted: Sun Nov 10, 2013 12:07 am 
Offline
Regular Member
User avatar

Joined: Sat May 25, 2013 7:36 am
Posts: 73
Nasher wrote:
That's a very odd response to thread reduction. I'd say there's a problem with how you're doing multithreading.


True :( Every applied mechanism for synchronizing just stabs my back. I will try to disable them all. Let's see how it turns out :)

[Edit] Wow, that's odd. Even worse .... well, that's gonna be a boring day :(


Attachments:
File comment: Contention analysis report w/o sync mechanisms
Picture2.jpg
Picture2.jpg [ 45.21 KiB | Viewed 15965 times ]
File comment: Contention analysis report w/ sync mechanisms
Picture1.jpg
Picture1.jpg [ 31.7 KiB | Viewed 15968 times ]
Top
 Profile  
 
PostPosted: Mon Nov 11, 2013 4:34 pm 
Offline
Veteran Member

Joined: Thu Feb 28, 2013 2:39 am
Posts: 437
I give each tree traversal it's own thread and let them all collide.


Top
 Profile  
 
PostPosted: Thu Nov 14, 2013 4:26 pm 
Offline
Junior Member

Joined: Thu Nov 14, 2013 2:56 pm
Posts: 12
Nose wrote:
Sick shit. After reading this thread I wondered whether the cache or the computational power is the bottleneck in my setup. So I reduced the number of threads in my process from 32 over 16 to 8 finally down to 1 (having 16 cores) resulting in the processor utilization decaying from 97% to 5%. however, the iterations per second went up from initially 16'000 to almost 40'000.

Not sure what language you are using, but are any of your threads calling any standard library functions while they are executing? eg: allocating ram, sqrt(), pow(), ect.

I had similar problems in the past whereby 4 threads were doing less work than 1 and it turned out to be very poor thread control in the M$ standard libraries... Switching to use the Intel compiler (which obviously has much better lock-free implementations of the standard library functions) solved the problem.

Juk :)


Top
 Profile  
 
PostPosted: Fri Nov 15, 2013 4:02 am 
Offline
Regular Member
User avatar

Joined: Sat May 25, 2013 7:36 am
Posts: 73
Thanks for your help. I got it running, but I can't really point my finger on what the actual problem was. I rearranged my code until it was running smoothly. I also allowed threads to collide. Maybe that was the issue.


Top
 Profile  
 
PostPosted: Mon Dec 02, 2013 3:08 pm 
Offline
Junior Member

Joined: Mon Dec 02, 2013 3:02 pm
Posts: 19
I'm having a very similar problem, but with Java. I've got 8 cores, but 4 threads is definitely quicker than anything >4 threads and it's been bugging me for a week. It could be that half my cores are 'virtual' cores of some kind. This kind of programming isn't exactly my strong point up till now. Each thread churns through X number of CS game trees (which means threads can collide).

Not that I'm going to win any speed prizes with my implementation, but it'd be nice if it was a /bit/ faster :D


Top
 Profile  
 
PostPosted: Mon Dec 02, 2013 5:16 pm 
Offline
Junior Member

Joined: Sat Nov 02, 2013 2:21 pm
Posts: 26
fraction wrote:
Each thread churns through X number of CS game trees (which means threads can collide).
What do you mean here? Does each thread operate on a different part of your game tree. Or do they all perform parallel full chance sampled tree traversals? And if so, do you synchronize something?

I'm also using Java and I'm also trying to figure out the best options for multithreading for my cs cfrm implementation. I would like to know if it could screw up the algorithm if the regrets and average strategies are not updated atomically.
If the algorithm still converges if some values are sometimes not updated correctly, it would probably be the best option to let each thread do independent chance-sampled tree-walks without synchronization. Then we might still need to use the "volatile" keyword, but I'm not sure about that. Can anyone explain to me what exactly the "volatile" keyword does? My understanding is, that it prevents caching so that every thread always reads and writes the actual values directly from and to ram.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 42 posts ]  Go to page Previous  1, 2, 3  Next

All times are UTC


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Powered by phpBB® Forum Software © phpBB Group