Skip to content
/ habu Public

HABU - Highly Asynchronous Buffered Update Library

License

Notifications You must be signed in to change notification settings

PE-Cray/habu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

// // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // 
//  Copyright 2021 Hewlett Packard Enterprise
//
// MIT License
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software 
// and associated documentation files (the "Software"), to deal in the Software without restriction,
// including without limitation the rights to use, copy, modify, merge, publish, distribute, 
// sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
// 
// The above copyright notice and this permission notice shall be included in all copies or 
// substantial portions of the Software.
// 
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT 
// NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, 
// DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
//  
// author:  Nathan Wichmann  (wichmann@hpe.com)
// // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // 

The Heimdallr benchmark was created to test the performance
of small GLOBAL references and updates using different operations, 
including atomics, gets and puts via the shmem library.

Heimdallr is meant to approximate the reference patterns and operations
common in customers that are important to HPE, even if these
reference patterns and operations are not common in HPC.  

Random reference patterns are generated by assuming a power-of-two 
sized local table and a arbitrary number of PEs.  A index array with 
random integers is contructed in such manner as allow the offset
into a table to be a simple bitwise AND and the PE to be a simple 
shift.  

Heimdallr also exercises and test the habu library.
This library aggregates messages, transfers them to a remote PE,
and executes the update.  Each test is paired with a shmem 
version and the results are tested for correctness when possible.
If the correctness test determines the results match with 100%
accuracy a "PASSED" message is given.  If results match with 
and error rate of <1% then a PASSED is still given but the
error rate is printed out.  If an error rate of >1% is detected
then the test is assumed to have FAILED.  The 1% threshold is
somewhat arbitrary.  Some tests are expected to have an error rate,
but that error rate is expected to be low or very low.  On the other
hand, when real errors happen they often cause error rates >>1%.

There are three groups of tests; Atomics, Gets, and Puts.

Atomics are only single word references.  Gets and puts 
can be single word or relatively small blocks.  Heimdallr sweeps
through various block sizes so one can examine the performance of
gets and puts when using those block sizes.  There are also tests to
examine the impact of where the data resides localling.

The first group of tests have only random atomic references.  The first 
test is an atomic increment that does not return any data to the processor
that is initiating the operation.  Other tests include fetching adds,
multiple different atomics in the same iteration, and a function
set up to test the recursive capabilities of habu.  
While atomic adds, both fetching and non-fetching, are important by themselves, 
these atomics are also represent many other atomics such at bitwise, 
min, max, and more.

The second group focuses on gets, testing both blocking gets and 
non-blocking gets.  The non-blocking gets also test 3 different 
methods for allocating the destination arrays; symmetric heap, heap, 
and stack.  Testing these allocation methods are meant to inform
the users as well as the developers.

The third group focuses on puts, testing both blocking puts and 
non-blocking puts.  The non-blocking puts also tests 3 different 
methods for allocating the source arrays; symmetric heap, heap, 
and stack.  Testing these allocation methods are ment to inform
the users as well as the developers.  There is an additional subgroup
that examines put-with-signal operations.  The first of this subgroup
is blocking, the second is non-blocking, while the third is how a
user might implement a put-with-signal functionality if the the shmem
implementation did not have a put-with-signal function.

This benchmark is meant to be run any number of core and nodes.  One
can choose to run with 1 shmem PE per core, or more than one core per PE
with OpenMP threading underneath.  This benchmark utilizes thread-hot
shmem and can be used to examine any performance differences between
shmem only and shmem with threading.

Heimdallr options:

  -t <value>
     Table size (as log base 2) per PE.  For example, if value is 26 the size 
     of the table on each PE will be 2^26*8 = 0.5 GiB.  Default is 25.
     
  -n <value>
     Nupdates (as log base 2) per PE.  Default is 23.
     
  -r <value>
     Number of Repeats completed within a timed test.  Default is 4.
     
  -T <name> 
    Name of single test to be executed.  If <name> = ALL then all of the test
    will run.  If <name> matchs the first srtlen(name) characters of tests 
    as printed out during an ALL test then that test will run.  There can be multiple -T options provided on and run.  Default is ALL.
    
  -w <value> 
     Min Message size (in number of 8B words).  Default is 1.
     
  -W <value>
     Max Message size (in number of 8B words).  Default is 4096.
     
  -G <value>
     Message size growth factor (integer).  If <value> is 0 (zero) then the 
     message size will increase by a rate of square root of 2.  If value>1 then
     the message rate will increase by that rate ever iteration.  
     If the value is less negative, then the message size will increase by 
     32 bytes ever iteration until the message size reaches a size of 
     abs(value) and after that it will revert to a growth as if value was 0 (zero).
     
  -h
    Display option information.
            

The total memory footprint per rank will be the table size 
plus r times the number of updates times 8 bytes.  There are 4 arrays that 
are the size of the number of updates, the index array as well as the sym heap,
heap, and stack versions of the local arrays.  The program simple replicates 
the full table and the space for the index and local arrays on each PE.

The spirit of the benchmark is that when running using all of the CPUs 
in a NUMA domain that the size of the table consumes significantly more 
space that is available in the last level cache.

Finally, Heimdallr is meant to be a vehicle for discussion as much as static
implementation.  Benchmarkers should feel free to discuss the results and 
various implementations with the evaluators rather than just report
the final numbers.  

About

HABU - Highly Asynchronous Buffered Update Library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published