Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables #6663

Open
LouisJenkinsCS opened this issue Jul 12, 2017 · 75 comments
Open

Comments

@LouisJenkinsCS
Copy link
Member

LouisJenkinsCS commented Jul 12, 2017

Global Atomic Object

(PR available: #6717 )...

Details

Below, I go over some details that are present in the current implementation

Pointer Compression

This is the common case, but certainly not a novel concept in and of itself nor to
Chapel itself
, however it is sufficient for most cases where the number of locales
fit within 16 bits. It takes advantage of the fact that all known architectures and
operating systems only use 48 bits of the virtual address space, meaning we are free
to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales,
we get the benefit of using atomics for a single 64-bit atomic operation, making
atomics extremely fast.

Descriptor Tables

This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an insertIfNotPresent operation on the target locale, and then embed
the locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.

Performance

I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty sync variable, which currently gets a handicap by just acquiring the lock,
perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is Atomic on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));

There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a write
(because a read is not interesting since it is a basic lookup in a locale's table),
for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to compareExchangeStrong is both remote. Currently,
'LLSC' for GlobalAtomicObject demonstrates 3 atomic operations: the read which
is a simple lookup, and compression of both variables passed to compareExchangeStrong
which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a sync variable is. It should also be emphasized that
the runtime of GlobalAtomicObject with pointer compression is equivalent to Atomic.

I present two graphs: one with respect to runtime, and another in terms of raw operations
per second.

Time

image

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly
beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate
the power of global atomics. A single global atomic, while is magnitudes slower than a
single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if we had attempted to implement some MCAS operation to update both locations using
descriptors. At best we would succeed within 4 LL/SC operations without obstruction
(2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat.
Of course, in reality, the operation would require multiple retries, in the hundreds to
thousands with real workloads that placed contention on it. This shows that, in terms of
global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.

Operations Per Second

image

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added on statement) beats a normal
sync variable. In every case, GlobalAtomicObject wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.

Usage

Sample usage of GlobalAtomicObject can be seen below (and yes, it does work).

class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;

x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);

// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;

// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);

// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...

GlobalAtomicObject(C) can easily become atomic C. If the language were to be
modified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:

class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}

Where delete can call some hook which also removes it from the table if it has not yet been
reclaimed. I believe this can become a reality with little effort.

Improvements

Below, I go over some potential improvements that can be made by others who are up to it.

Pointer Compression + Descriptor Tables

It may be possible to combine normal pointer compression for the lower locales that fit
within 16 bits. Combining both means we allow better performance on average, even for
cases when they have more than 16 bits worth of locales, any locales with an id
less than 2^16 get the power of atomics, while the ones greater than or equal
to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.

See next idea for how to differentiate.

Zone Compression

A theoretical model where we use a more dynamic type of compression which attempts
to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits
into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by
zone id, which is allocated when a new unique id is found. With optimizations
such as caching the zone mappings for other locales (I.E: First time a unique upper 32
bits are found, check if we have it cached, if not found jump to the owning locale
and add it if not already, and update what we have cached, etc.)

The benefit here is that we get to keep using 64-bit pointer compression, and the amount
of unique zones that can be kept track of depend on the number of bits needed to keep
track of the locale id.

"What happens when we run out of zones?" - We use the descriptor
tables again. We can just search the unique zones for the target locale first, if the
zone isn't found and we can't create a new one, we just use descriptors.

"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor
Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential
maximum segments for descriptor tables, but it allows a much more dynamic compression
strategy.

Caching Descriptors into Objects

This is a change that isn't theoretical but requires changes with the language itself.
Majority of the overhead of descriptor tables is hashing the object, which requires jumping
to the owning locales, but if the descriptors were cached then repeated accesses would
be just about as fast as atomics. If it were cached, the benefits of recycling memory
to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).

Concurrent-Safe Implementation

The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (BigInteger).
The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local sync variable.

If someone managed to make everything concurrent, it could improve performance further
by 5 - 10x at least.

Edit:

Removed Introduction to avoid further derailment of discussion.

@LouisJenkinsCS LouisJenkinsCS changed the title Global Atomic Object Global Atomic Object - First Solution to allow Atomic Operations on 128-bit wide pointers Jul 12, 2017
@LouisJenkinsCS LouisJenkinsCS changed the title Global Atomic Object - First Solution to allow Atomic Operations on 128-bit wide pointers Global Atomic Object - First Scalable Solution to allow Atomic Operations on 128-bit wide pointers Jul 12, 2017
@bradcray
Copy link
Member

In any budding language, especially one that boasts to be a productive
concurrent/parallel language, atomics on all types are a necessity.

This statement strikes me as being a naive oversimplification. Is an atomic distributed array a necessity for any language that supports distributed arrays? I don't think so, and am not even sure what that would mean.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 13, 2017

Is an atomic distributed array a necessity for any language that supports distributed arrays?

The emphasis would be on productivity. "What can I do if X were atomic compared to if it were not". If X is a distributed array, what do we gain? Each element in a distributed array can be updated atomically as is, so perhaps not. The distributed array could be made atomic by allowing concurrent access while resizing, which is a huge plus and would further enable productivity, sure. Would I say that this is a necessity to allowing the user to be more productive? Arguably. However, that's not the case here.

Since X is a class instance, the amount of productivity gained by making it atomic is limited to the stretch of the imagination. You can:

  1. Implement more complex algorithms and data structures, like scalable wait-free algorithms
  2. Allows any local algorithm to be converted into a distributed algorithm
  3. Literally becomes a universal construct that allows us to make any algorithm become distributed (RCU like in the case of atomic distributed arrays, STM, garbage collection, what-have-you)

Is the above a necessity? Absolutely. However unlike atomic distributed arrays, I have presented an actual solution which works and scales.

Edit:

For more clarity: Atomics for Class Instances is a necessity, but let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance. If you wish to atomically update it, imagine the follow:

var globalData : atomic Obj;

// Read...
var current = globalData.read();

// Copy...
var modified = clone(current);
modified.field = data;

// Update...
globalData.compareAndSwap(current, modified);

You have potential RCU. Of course the next actual difficulty would be memory reclamation, but the point is that it would have been the same whether the algorithm was distributed or not.

Furthermore, if it sounded like I was bashing Chapel, I'm not, I wish to see it improve and as such I am offering suggestions to its problems.

@bradcray
Copy link
Member

I think you're misinterpreting my comment. I don't have any objections to supporting atomic classes in Chapel, and have long believed that we should do so. My objection was with the opening blanket statement which I read as "any new language that has type X must necessarily have atomic type X." And I simply don't believe that to be true (and used "distributed array" as my example X). As a result of my reaction to the statement (+ time constraints), I rolled my eyes and stopped reading there rather than being intrigued and continuing on. Then eventually I came back to comment on my reaction.

To me, a more believable, less absolutist, and less objectionable statement would be something like "Any budding language, especially one that strives to be a productive concurrent/parallel language, should consider for each supported type T whether there would also be value in supporting an atomic variant of T." This statement seems far harder to argue with to me.

@LouisJenkinsCS
Copy link
Member Author

Understood, it makes sense that it made a broad generalization, but I didn't expect it to have the kind of effect which would cause people to skip over it from the first sentence. I revised it so that it didn't generalize to any type, but to a specific type of data.

@bradcray
Copy link
Member

Thanks!

To make sure I'm understanding the following:

let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an int, a string, a record or some distributed array, lets call it data, which can be a field in some wrapper class instance.

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

@LouisJenkinsCS
Copy link
Member Author

Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type T we can create a class C_T with a field of that type and now by declaring an atomic C_T, we effectively have an atomic T"?

Technically, it can yes. The semantics would change (requires RCU or similar updates), performance may drop significantly, but it is possible to emulate atomic T with atomic C_T with effort (even if it is a lot). The point I was trying to make was that it necessary in that it made such things possible (even if they weren't optimal), but that was before you informed me the issue was because my generalization (of which I may have done again).

@LouisJenkinsCS LouisJenkinsCS changed the title Global Atomic Object - First Scalable Solution to allow Atomic Operations on 128-bit wide pointers GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables Jul 17, 2017
@mppf
Copy link
Member

mppf commented Jul 17, 2017

@gbtitus and @ronawho are planning to look at this proposal. (and me too, but I've already looked at it enough to be comfortable).

@LouisJenkinsCS
Copy link
Member Author

@mppf

I had another idea in terms of how it can be implemented into the language itself. What if you had one global table (not coupled to a GlobalAtomicObject, but initialized first thing at runtime) that was managed across all nodes. Management of the objects (adding, propagating changes across nodes, removing them when the object is disposed of with delete, etc) is up to you (and I'm sure you guys can think up much better solutions), but this allows us to use the whole 64 bits. Embedding the 64-bit descriptor number (index into the table) into the object is still a valid optimization here, and we allow 2^64 potential objects. The compiler can properly instrument read, write, etc.

Boom, you've got your global descriptor table. The low level details would be beyond me though.

@mppf
Copy link
Member

mppf commented Jul 18, 2017

@LouisJenkinsCS - I'm not sure what to interpret from your other idea other than that we could come up with a completely different solution from what you created, but I already knew that. But I'm probably missing something. What would be the advantage of having one global descriptor table?

@gbtitus
Copy link
Member

gbtitus commented Jul 18, 2017

@LouisJenkinsCS: For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context. Basically the need is for AMOs on global pointers which conceptually are larger than 64 bits (and we don't have 128-bit network AMOs), and the specific operations we need are just Load, Store, Swap, and Compare-and-Swap? Is that correct?

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 18, 2017

@mppf

Allow me to contrast having one single global descriptor table to having one coupled to GlobalAtomicObject (or each atomic Obj).

  1. It makes caching the descriptor possible in O(1) space, as it can be added directly to the object's header the first time it is used in any atomic operation (via checking if its cached descriptor is set). If you have a unique table coupled to each atomic Obj, then the cached descriptor is specific to each instance of atomic Obj.
  2. There can be one table to rule them all, in that the table could hold 2^64 arbitrary wide_ptr_t data, and have it store any kind of object. Currently, you'd need one descriptor table per atomic object, and even if you had one table per runtime type, a lot of space will be wasted. This can be done in just one extra level of indirection.
  3. atomic Obj can become just an arbitrary descriptor index.

Edit:

To backup whether this is more than enough: as you stated, an object must take up at least 32 bytes, so we're talking about 590.29581 exabytes worth of data before space becomes an issue.

@LouisJenkinsCS
Copy link
Member Author

@gbtitus Yes, originally I had stated that in the removed introduction, and yes we only need those operations.

@mppf
Copy link
Member

mppf commented Jul 18, 2017

@LouisJenkinsCS - I think using a single global table is interesting as an improvement. Instead of asking us to solve all the problems - can you describe what operations should happen when to support such a thing. What would you need the compiler to do? What does the runtime do? I would expect as we integrate more with the compiler/runtime you'll need help to do so.

However I think we should view it as an improvement/optimization and try to get a 1st version merged. I.e. I think we should discuss the global version later and in a different issue. For the purposes of this issue, I think it would nice to simply say why it is harder (what are the challenges?).

@gbtitus
Copy link
Member

gbtitus commented Jul 18, 2017

I feel like I must be missing something obvious. One solution would seem to be to implement those AMOs by simply doing an on-stmt to the node where the AMO needs to be done and then doing it as a 128-bit processor AMO (either directly if there is CPU support, or indirectly using the x86 128-bit compare/exchange instruction CMPXCHG16B). In the current implementation this would take 2 network round trips per AMO, all things considered. What are the advantages of the proposed solutions over doing that?

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 18, 2017

@gbtitus

The issue would be atomically updating them. What happens when you have another atomic operation begin while another is ongoing? (Misread what you meant by doing an on-statement, yes that would work if there was hardware support). In terms of the current implementation, what it does is allow us to have an immutable pseudo-reference to a wide pointer that we know won't change and can be represented by a normal 64-bit integer. I'm not sure how CMPXCHG16B is implemented, can you fill me in on the specifics? Does it do a double LL/SC that causes many operations to retry if they fail, or does it atomically operate on whole 128-bit data words? Regardless however, not all hardware have support for such operations and this would be a software implementation to fallback on.

@LouisJenkinsCS
Copy link
Member Author

@mppf

With the current version, we can do the following...

  1. Implement atomic C as new GlobalAtomicObject(C).
  2. Add primitives such as __primitive("get cached descriptor", this, obj) and __primitive("set cached descriptor", this, obj), which can potentially map the current GlobalAtomicObject to the object's descriptor.

Currently, that's all that would be needed to get a nice efficient implementation going.

@gbtitus
Copy link
Member

gbtitus commented Jul 18, 2017

CMPXCHG16B is a 16-byte atomic compare-exchange. The LOCK prefix is needed in multicore situations. It's true that this operation is not necessarily present on all architectures -- it's not implemented on early AMD x86_64 and there may well be other processor architectures that don't have it at all. On those one would need to do something else.

However, I'd argue that the "something else" should not do any network ops. In other words, no matter what actual mechanism we use for the pointer updates, it should only involve local operations, because network communication is very costly. I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

That still leaves a lot of maneuvering room for implementation, though. For one thing, we might be able to go a bit further with straight pointer compression. If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. (I did say it was a blunt tool!) Speaking of which, in your comparison with sync var protection at the top, where (which node) were the sync vars involved, and was there one per global pointer, or just one for everything, or something else? Would it be possible for you to post your scaling comparison code(s)?

@bradcray
Copy link
Member

For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context.

I also got confused by trying to read the evolving document -- started last week, couldn't find my place this week.

@LouisJenkinsCS
Copy link
Member Author

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

However, I'd argue that the "something else" should not do any network ops.

Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.

I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references.

Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. The other operation is using network atomics to set the 64-bit atomic field, which is also unavoidable. A network atomic operation would beat using an on statement any day (or at least in Chapel it does, I don't know how well optimized you can make it in the runtime).

If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form.

That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.

Would it be possible for you to post your scaling comparison code(s)?

Benchmark.

@gbtitus
Copy link
Member

gbtitus commented Jul 18, 2017

I'm going to respond to the parts of this separately, since I suspect some will be dealt with and disappear rapidly while others lead on to further conversation ...

CMPXCHG16B is a 16-byte atomic compare-exchange.

Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic CMPXCHG16B operation could change its value mid-read?

No, it's an octoword compare-exchange, that is, it operates on an single aligned sequence of 16 bytes, not on two separate 8-byte quadwords.

@mppf
Copy link
Member

mppf commented Jul 18, 2017

@gbtitus - I think it'd be reasonable to do an on with a CMPXCHG16B if CPU support is available. I view the approach @LouisJenkinsCS is trying here as the fall-back - in case 128-bit processor atomics are not available. Additionally I can't predict if 64-bit compare-and-swap but extra work on creating/destroying a instance for the atomic object would be faster for a given application.

A separate concern is that some algorithms might want to use a generation number in the other 8 bytes, so CMPXCHG16B could protect against the ABA problem.

@LouisJenkinsCS - I'm pretty sure CMPXCHG16B only updates contiguous 16 bytes.

@gbtitus
Copy link
Member

gbtitus commented Jul 19, 2017

Here's one more little bit of possibly useful information. I've been estimating that remotely initiated processor atomics probably can't perform much better than 1/4 the speed of network atomics. It turns out this is really easy to test. On a Cray XC with comm=ugni, we can toggle the ra-atomics test between doing network AMOs and doing remotely initiated processor AMOs just by whether or not we have a hugepage module loaded. In order to use network AMOs we have to register memory with the NIC and in order to do that we have to put the heap on hugepages. No hugepages, no network AMOs. And when we do use remotely initiated processor AMOs it's through a fairly lightweight runtime-internal mechanism, not a Chapel on-stmt. So this is a pretty good analogy to the situations we've been discussing. (In this case the AMO is a XOR rather than a CMPXCHG, but I doubt that matters.) It turns out that network AMOs are 30x faster than remotely initiated processor AMOs for this test, not just 4x or 5x faster. Now granted my 4x was a floor and I haven't dug into why the difference is so much greater than I expected (I have more ideas than time) but I thought I should mention this result.

@LouisJenkinsCS
Copy link
Member Author

Interesting... while it still doesn't leave me with much room for doubt that remote AMO is better for contention-heavy CAS-retry loop, it does give me hope for descriptor tables. I wonder though, does that 30x increase or decrease as you add more nodes?

@gbtitus
Copy link
Member

gbtitus commented Jul 20, 2017

Yes, definitely any CAS-retry spin loop should be done on the node where the target datum resides rather than over the network. What's tricky is that if you have an appropriate network AMO and don't need to spin then you should definitely use the network AMO, because remote execution is so much slower. Thus knowing whether contention will occur is really important.

The 30x increases as I add nodes: roughly it's 19x on 2 nodes, 30x on 4 nodes, 34x on 8, and 37x on 16 in a set of runs I just did. I think this just reflects that as the node count rises, a greater proportion of network AMOs have to traverse the network. ra-atomics has a block-distributed target array that spans all the nodes, and parallel tasks on all cores and nodes do atomic XOR updates to randomly selected elements of that array as fast as they can. On XC all these updates are done via network AMOs through the NIC. If the update happens to hit an array element local to the node then there is no network traversal, but we nevertheless do have to use a NIC AMO instead of a processor one. This is because the Aries AMO cache is not coherent with the processor cache(s) and thus, for a given memory location that is an AMO target, all AMOs to that location must be either done with the processor (even if the location is remote) or done with the NIC (even if the location is local).

@LouisJenkinsCS
Copy link
Member Author

@mppf @gbtitus

In such short notice, I'm afraid I can't do too much right now, and I'll have to devote the rest of my time for catching up on my project (I'll be falling behind at this rate, to be honest). The below shows three things: 'GDTWrites' which is a simple write using the descriptor tables (no cache, meaning worst case when using local data), GDTReads which is a simple read using the descriptor tables (no cache), and 128bit-Read which is an emulated read using the only 128-bit hardware instruction available: cmpxchg16b. It is emulated in as an efficient manner as possible:

inline void read128bit(void *srcvp, void *dstvp) {
  uint128_t *src = srcvp;
  uint128_t with_val = *src;
  uint128_t cmp_val;
  uint128_t *cmp = &cmp_val;
  uint128_t *with = &with_val;
  char result;

  __asm__ __volatile__ ("lock; cmpxchg16b (%6);"
                        "setz %7; "
                        : "=a" (cmp->lo),
                        "=d" (cmp->hi)
                        : "0" (cmp->lo),
                        "1" (cmp->hi),
                        "b" (with->lo),
                        "c" (with->hi),
                        "r" (src),
                        "m" (result)
                        : "cc", "memory");

  *(uint128_t *)dstvp = cmp_val;
}

It is as optimized as you can get: regardless of if a CAS fails, it returns what is read at that location in cmp, which is why I have return that, so it is still a wait-free operation in that it only performs 1 CAS operation, and it is the only way to safely read all 128-bits atomically.

The results of the benchmark are surprising. The descriptor table won outright, by far, and as I mentioned before, it scaled. Remember this benchmark is in terms of raw time. As well, the benchmark consisted of many rapid operations in a 'for-each-task-on-each-locale' kind of way.

image

However, @mppf I promised to deliver on more benchmarks, and I'll try to do so at the earliest time possible, potentially over the weekend I'll be able to do so, but if not then hopefully sometime within next week. Again, I need to catch up on my project first and foremost.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 20, 2017

My thoughts on why the Descriptor Table won (so far) is the following...

  1. Assuming an equal balance of work from each node, since compression/decompression takes place on the node that owns the object, the computational work load is also balanced. This is why the hardware approach drops in terms of performance because all nodes must jump to a single node. Also as well, in this benchmark work that is added is created locally, so there's no need to jump to another node. Unfortunately my explanation isn't anywhere near as in-depth as @gbtitus but hopefully this can suffice.
  2. Reads for the table can occur with a simple lookup, no need to grab locks to add to the table, just a pure lookup. This can be made true for writes as well if its expanded on.

I'll add more if I think of anything else to add here.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 25, 2017

Okay, I am going to be doing this incrementally, as this is the easiest way to do this. I also restricted it to 8 nodes 32 nodes instead of 64 to save time. What I'm going to do is show results of all operations; read, write, and compareExchange (with and without local on statements). Again, this is important in that both GlobalAtomicObject and CMPXCHG16B handle a write atomic operation differently from a read atomic operation.

image

Revisited read again, this time also comparing to pointer compression. GlobalAtomicObject is right in the middle and scales well enough.

GlobalAtomicObject shows how much time it takes to read a descriptor from a locale's table (likely remote). PtrCompression shows how long it takes for a normal network AMO + Pointer Decompression. CMPXCHG16B shows how long it takes to jump to the owning locale and perform an atomic read operation.

@LouisJenkinsCS
Copy link
Member Author

This one shows atomic 'Write' operations, which vary a lot more.

image

As can be seen, the overhead of writing for descriptor tables is massive when the descriptor is not cached, but its more of a constant. It shows that CMPXCHG16B is much slower than anything else. As before, GlobalAtomicObject is in between both PtrCompression and CMPXCHG16B.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jul 25, 2017

Would like to note one crucial flaw in the descriptor table... this can't be solved from caching alone as its unavoidable. The flaw is that once the table begins filling up, adding elements to it becomes significantly slower. I'm assuming the issue is with my skip list implementation (in my defense it was my first time ever implementing, and it was based on the 'Skip List Cookbook' one). In reality once its cached on first use, there's no need to be have the skiplist, In the current implementation, performance is so abysmal it couldn't even finish at 1 node within 5 minutes (it starts very fast, then slows to a crawl).

Huge oversight, but still plenty of potential here. I'll try to compose a benchmark reusing memory for the time being.

Edit: I'm wondering if a potential bottleneck could be BigInteger.scan0() since I use that heavily to find the next free bit in the memory pool...

@LouisJenkinsCS
Copy link
Member Author

Okay, after a very long hiatus, I believe I can finally make some more progress on this again.

One change I'm going to be making is stripping the skip-list, as searching for it tends to be the limiting factor, and becomes irrelevant later once caching is implemented. So instead, I'm going to regress to the original design of having to 'register' each object in the table, which then returns a tuple of both the object and the index as the descriptor. So...

extern type wide_ptr_t;
type descriptor_t = (wide_ptr_t, int(64));

proc register(obj:objType) : descriptor_t { }

In the original document for the idea, here, the performance for it was phenomenal, but that was because it did not have to worry about hashing objects to find their descriptors. This can still become a reality if caching it performed well enough. I'd like to explore this when I have the time.

Note: The final version likely will not require the user to 'register' anything and just requires proper tracking of the descriptor by runtime. As I'm short on time, this chapel-side version is the best I can do at the moment.

@LouisJenkinsCS
Copy link
Member Author

Also as well, I plan to privatize tables across all nodes in the cluster, and use my RCU (trademark-aside) abstraction for allowing updates concurrently. I'll find a decent way to propagate such updates, but overall it will favor reusing memory over rapid creation and discarding of new memory to be registered.

@LouisJenkinsCS
Copy link
Member Author

Okay, I think I have a way in which I can handle managing the descriptor table. There will be only one descriptor table, wherein each cell is capable of holding any wide_ptr_t, or rather any 128-bit value. In fact, this can be generalized to allow any size, even variable-sized, data depending on the way I implement it.

First and foremost, I'll need to clarify that the RCU and the DistVector implementation (which I had shown to @mppf in an email with @e-kayrakli) likely must come first to allow for concurrent reads while resizing the tables. The DistVector would serve as, essentially, a parallel-safe resizing distributed array that Chapel current lacks. The theory behind it is that we just create a static region of memory, and then reuse that memory in a new instance (added bonus of not needing to copy data over to the new data structure, just the pointer to the data the original instance used), and then use RCU to install that new instance. In other words, we add more indirection, which proves the adage correct that "All problems in computer science can be solved by another level of indirection".

After we have DistVector, then I can utilize that as the backbone for everything. I'm going to be using a global counter to keep track of the descriptor, although unfortunately this counter needs to be placed on a single node. Luckily this scales well enough, as my data structures have proven. When registering an object, we can simply increment the counter (if we cannot recycle a descriptor) and create a descriptor based on that. If we use this approach, privatization and resizing handled by DistVector, that essentially handles majority of the problem.

Problem: How do we recycle descriptors? After the data is no longer used, what happens to the slot that is no longer in use?

Potential Solution: Maintaining a global resizable 'free-list' using DistVector could work, or we could make use of DistDeque as a FIFO/LIFO way of reusing the descriptors. As in, when a descriptor is no longer used, we could first 'pop' or 'dequeue' first and utilize that directly in our table.

@gbtitus Any thoughts on this?

@LouisJenkinsCS
Copy link
Member Author

Update:

I have implemented the DistVector (seen here), although it is suffering from poor performance due to lack of time. The DistVector properly incorporates the proposed RCU model into it's base, and as such allows concurrent reads during writes, although again the data structure is much slower (in the test at least; in the test we intermingle reads and writes, but writes are 1000000x slower than reads, so we can't reliably benchmark reads here...). I need to perform a lot of profiling, as I'm sure I'm suffering from communication from something I thought I'd privatized, and I'm using resizable domains over a much more cache-coherent statically allocated buffer (IIRC resizable domains will allocate new space, copy everything over, and delete the old; this is a huge bottleneck and contradictory to my design of 'reusing the old memory and linking with the new'). I can predict that this model will be a success, and can even scale, as mentioned before, given some extra time. Unfortunately, I no longer have that time, as I must catch back up on my studies.

I'll try to get back to this after I graduate (December 16th), so await some promising results!

@LouisJenkinsCS
Copy link
Member Author

Fleshed out the idea more here: LouisJenkinsCS/Chapel-Atomic-Objects#1 (comment)

@LouisJenkinsCS
Copy link
Member Author

Implemented array w/ RCU, tested on single node (doing more testing in distributed setting later), seems promising. Array only allows dynamic allocation and indexing, but for purposes of descriptor table is all I need.

image

@mppf
Copy link
Member

mppf commented Jan 10, 2018

I can't tell what the units on the chart are. Is higher better? Can you explain why RCU array is between Sync Array and Array?

@LouisJenkinsCS
Copy link
Member Author

Sure, but first let me add the results for 16 nodes...

image

I'll update this in a bit describing it (I actually started a blog recently, so I'll write it up there). Also, higher is better, its in terms of operations per second.

@LouisJenkinsCS
Copy link
Member Author

I have the early parts describing the RCU algorithm, but it isn't complete yet, keep that in mind. I'll continue updating it and leave another notification when I'm done.

https://louisjenkinscs.github.io/Distributed-Vector/

@mppf
Copy link
Member

mppf commented Jan 10, 2018

@LouisJenkinsCS - I havn't finished reading that yet but it reminded me of a problem we're facing. We have associative domains and arrays, but if you add to the domain, it invalidates array references. That has the side effect that parallel operations on the array can't work. This comes up with say making a histogram in parallel when you don't know the keys beforehand. I.e. I'd like to be able to write:

  var D:domain(int);
  var Count:[D] int;
  forall key in input {
    D += key;
    Count[key] += 1;
  }

but in fact this program suffers from a race condition, because appending to the domain could cause the array to resize and then invalidate the reference to Count[key] used by another task in the forall.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jan 10, 2018

Interesting example, and looking at it RCU would be able to solve the problem (although I'd hate to make it out to be a kind of solve-all panacea). In the case of allowing indexing to be reads and appending to be a write, it would work albeit extremely slowly. Each iteration would be a full-blown write, so in the end you're better off doing this serially, or better yet doing some kind of map-reduce operation where each task spawns its own map and you combine them all at the end. RCU is for situations where we have 99.9% reads.

Edit:

Actually, originally I planned to allow insertions into the data structure in a way that allowed writes to be relatively cheap (only synchronization on a single sync-var needed) so long as we allocated enough memory ahead of time. I removed it because it wouldn't suit my purposes and it slowed down the reads by 3x. Still, it is possible to achieve reasonable performance that would best coarse-grained synchronization, but it would still reduce overall performance.

@LouisJenkinsCS
Copy link
Member Author

Speaking of mappings... I actually had an idea for a distributed hash table based on my own published work I did for the Go programming language. It can actually be perfectly remodeled in a way that fits the distributed aspect (a bit too perfectly, truth be told), and would allow not only concurrent updates, but would scale so well I'm sure it'd put the default sparse distributions to shame (specific data structures definitely would outperform a general data structure at the specific task they are designed for after all)

@LouisJenkinsCS
Copy link
Member Author

@mppf

It is done (for now), I'm not sure if I'll go back and add more visual demonstrations (likely I will) but the textual parts are mostly done (likely with many typos).

@LouisJenkinsCS
Copy link
Member Author

Also, one more update, after removing 'sync array' as it causes runtime crashes (coarse-grained synchronization won't work at all then, which is even more interesting), I got it to run with 32 nodes...

[Array]: nLocales=1, N=16777216, Op/Sec=8.69157e+06
[Array]: nLocales=2, N=16777216, Op/Sec=8.5807e+06
[Array]: nLocales=4, N=33554432, Op/Sec=1.77164e+07
[Array]: nLocales=8, N=33554432, Op/Sec=2.29545e+07
[Array]: nLocales=16, N=33554432, Op/Sec=2.38011e+07
[Array]: nLocales=32, N=33554432, Op/Sec=2.36358e+07
[DistVector]: nLocales=1, N=2097152, Op/Sec=1.96435e+06
[DistVector]: nLocales=2, N=4194304, Op/Sec=3.88863e+06
[DistVector]: nLocales=4, N=8388608, Op/Sec=7.67043e+06
[DistVector]: nLocales=8, N=16777216, Op/Sec=1.54833e+07
[DistVector]: nLocales=16, N=33554432, Op/Sec=3.15365e+07
[DistVector]: nLocales=32, N=67108864, Op/Sec=6.21236e+07

DistVector is the array, so it wins by quite a bit 🥇

@LouisJenkinsCS
Copy link
Member Author

Also, I see what you meant when you said it can be applied to chpl-privatization.c, as yeah the whole idea of allocating it in chunks and performing some RCU update that reuses the same memory (similar to my own array) should work marvelously. I'd do this myself, but I'm 100% certain how to test and verify the results after I finish.

@mppf
Copy link
Member

mppf commented Jan 11, 2018

For chpl-privatization.c - I think that having a RCU capability in our standard modules would be a big deal, and if you can get the RCU in by fixing leaks in chpl-privatization, that'd be a good intermediate point (for PRs). I think we can help (myself and Elliot, probably) with verifying any update to chpl-privatization.c.

@LouisJenkinsCS
Copy link
Member Author

After making corrections to the benchmark, I see that there was extra communications I didn't account for before, but the good news is that the distributed vector is still scaling extremely well (like its consistently 1/3 the performance of the unsynchronized distributed Chapel array), which is extremely good... Like, 300M Op/Sec vs 900M Op/Sec good!

Now, after seeing that QSBR-based RCU yields literally zero-overhead, I realize that right now if I could implement that Chapel-side this would likely bring the data structure actually ahead of (or at least neck-and-neck) with the normal Chapel array, while offering concurrent resizing & indexing.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jan 21, 2018

Update:

I finally got the synchronized array to finish in time by lowering the number of operations per task to 1024 (down from 1M), the results are interesting and exciting.

random_indexing_1024

Edit: Changed labels to be more accurate

@LouisJenkinsCS
Copy link
Member Author

So I was thinking about the true applicability and generality of the descriptor table... why stop at wide-pointers? In fact, its funny but the table could be used to allow atomic instances for structures as well. Essentially the descriptor acts as a layer of indirection similar to how a pointer does, so it would be possible to store at each slot a struct or record. In particular, for GlobalAtomicObjects it would be storing a struct/record, a wide_ptr_t. Its interesting actually... gone are wide pointers, replaced by descriptors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants