Skip to content

Latest commit

 

History

History
102 lines (81 loc) · 4.57 KB

ldst.md

File metadata and controls

102 lines (81 loc) · 4.57 KB

Quick summary

Instruction General theme Optional special features
ldx    x[i] = memory[i] Load pair
ldy    y[i] = memory[i] Load pair
ldz
ldzi
z[_][i] = memory[i] Load pair, interleaved Z
stx memory[i] =    x[i] Store pair
sty memory[i] =    y[i] Store pair
stz
stzi
memory[i] = z[_][i] Store pair, interleaved Z

Instruction encoding

Bit Width Meaning Notes
10 22 A64 reserved instruction Must be 0x201000 >> 10
5 5 Instruction 0 for ldx
1 for ldy
2 for stx
3 for sty
4 for ldz
5 for stz
6 for ldzi
7 for stzi
0 5 5-bit GPR index See below for the meaning of the 64 bits in the GPR

Operand bitfields

For ldx / ldy:

Bit Width Meaning
63 1 Ignored
62 1 Load multiple registers (1) or single register (0)
61 1 On M1/M2: Ignored (loads are always to consecutive registers)
On M3: Load to non-consecutive registers (1) or to consecutive registers (0)
60 1 On M1: Ignored ("multiple" always means two registers)
On M2/M3: "Multiple" means four registers (1) or two registers (0)
59 1 Ignored
56 3 X / Y register index
0 56 Pointer

For stx / sty:

Bit Width Meaning
63 1 Ignored
62 1 Store pair of registers (1) or single register (0)
59 3 Ignored
56 3 X / Y register index
0 56 Pointer

For ldz / stz:

Bit Width Meaning
63 1 Ignored
62 1 Load / store pair of registers (1) or single register (0)
56 6 Z row
0 56 Pointer

For ldzi / stzi:

Bit Width Meaning
62 2 Ignored
57 5 Z row (high 5 bits thereof)
56 1 Right hand half (1) or left hand half (0) of Z register pair
0 56 Pointer

Description

Move 64 bytes of data between memory (does not have to be aligned) and an AMX register, or move 128 bytes of data between memory (must be aligned to 128 bytes) and an adjacent pair of AMX registers. On M2/M3, can also move 256 bytes of data from memory to four consecutive X or Y registers. On M3, can move 128 or 256 bytes of data from memory to non-consecutive X or Y registers: if bit 61 is set, 128 bytes are moved to registers n and (n+4)%8, or 256 bytes are moved to registers n, (n+2)%8, (n+4)%8, (n+6)%8.

The ldzi and stzi instructions manipulate half of a pair of Z registers. Viewing the 64 bytes of memory and the 64 bytes of every Z register as vectors of i32 / u32 / f32, the mapping between memory and Z is:

Memory0123456789101112131415
Z00 L2 L4 L6 L8 L10 L12 L14 L0 R2 R4 R6 R8 R10 R12 R14 R
Z11 L3 L5 L7 L9 L11 L13 L15 L1 R3 R5 R7 R9 R11 R13 R15 R

In other words, the even Z register contains the even lanes from memory, and the odd Z register contains the odd lanes from memory. By a happy coincidence, this matches up with the "interleaved pair" lane arrangements of mixed-width mac16 / fma16 / fms16 instructions, and with the "interleaved pair" lane arrangements of other instructions when in a (16, 16, 32) arrangement.

Emulation code

See ldst.c.

A representative sample is:

void emulate_AMX_LDX(amx_state* state, uint64_t operand) {
    ld_common(state->x, operand, 7);
}

void ld_common(amx_reg* regs, uint64_t operand, uint32_t regmask) {
    uint32_t rn = (operand >> 56) & regmask;
    const uint8_t* src = (uint8_t*)((operand << 8) >> 8);
    memcpy(regs + rn, src, 64);
    if (operand & LDST_MULTIPLE) {
        uint32_t rs = 1;
        if ((AMX_VER >= AMX_VER_M3) && (operand & LDST_NON_CONSECUTIVE) && (regmask <= 15)) {
            rs = (operand & LDST_MULTIPLE_MEANS_FOUR) ? 2 : 4;
        }
        memcpy(regs + ((rn + rs) & regmask), src + 64, 64);
        if ((AMX_VER >= AMX_VER_M2) && (operand & LDST_MULTIPLE_MEANS_FOUR) && (regmask <= 15)) {
            memcpy(regs + ((rn + rs*2) & regmask), src + 128, 64);
            memcpy(regs + ((rn + rs*3) & regmask), src + 192, 64);
        }
    }
}