Skip to content

Latest commit

 

History

History
254 lines (179 loc) · 12.1 KB

ANS-104.md

File metadata and controls

254 lines (179 loc) · 12.1 KB

ANS-104: Bundled Data v2.0 - Binary Serialization

Status: Standard

Abstract

This document describes the data format and directions for reading and writing bundled binary data. Bundled data is a way of writing multiple independent data transactions (referred to as DataItems in this document) into one top level transaction. A DataItem shares many of the same properties as a normal data transaction, in that it has an owner, data, tags, target, signature, and id. It differs in that is has no ability to transfer tokens, and no reward, as the top level transaction pays the reward for all bundled data.

Motivation

Bundling multiple data transactions into one transaction provides a number of benefits:

  • Allow delegation of payment for a DataItem to a 3rd party, while maintaining the identity and signature of the person who created the DataItem, without them needing to have a wallet with funds

  • Allow multiple DataItems to be written as a group

  • Increase the throughput of logically independent data-writes to the Arweave network

Reference Implementation

There is a reference implementation for the creation, signing, and verification of DataItems and working with bundles in TypeScript

Specification

1. Transaction Format

1.1 Transaction Tags

A bundle of DataItems MUST have the following two tags present:

  • Bundle-Format a string describing the bundling format. The format for this standard is binary
  • Bundle-Version a version string. The version referred to in this standard is 2.0.0

Version changes may occur due to a change in encoding algorithm in the future

1.2 Transaction Body Format

This format for the transaction body is binary data in the following bytes format

N = number of DataItems

Bytes Purpose
32 Numbers of data items
N x 64 Pairs of size and entry ids [size (32 bytes), entry ID (32 bytes)]
Remaining bytes Binary encoded data items in bundle

1.3 DataItem Format

A DataItem is a binary encoded object that has similar properties to a transaction

Field Description Encoding Length (in bytes) Optional
signature type Type of key format used for the signature Binary 2
signature A signature produced by owner Binary Depends on signature type
owner The public key of the owner Binary 512
target An address that this DataItem is being sent to Binary 32 (+ presence byte) ✔️
anchor A value to prevent replay attacks Binary 32 (+ presence byte) ✔️
number of tags Number of tags Binary 8
number of tag bytes Number of bytes used for tags Binary 8
tags An avro array of tag objects Binary Variable
data The data contents Binary Variable

All optional fields will have a leading byte which describes whether the field is present (1 for present, 0 for not present). Any other value for this byte makes the DataItem invalid.

A tag object is an Apache Avro encoded stream representing an object { name: string, value: string }. Prefixing the tags objects with their bytes length means decoders may skip them if they wish.

The anchor and target fields in DataItem are optional. The anchor is an arbitrary value to allow bundling gateways to provide protection from replay attacks against them or their users.

1.3.1 Tag format

Parsing the tags is optional, as they are prefixed by their bytes length.

To conform with deployed bundles, the tag format is Apache Avro with the following schema:

{
  "type": "array",
  "items": {
    "type": "record",
    "name": "Tag",
    "fields": [
      { "name": "name", "type": "bytes" },
      { "name": "value", "type": "bytes" }
    ]
  }
}

Usually the name and value fields are UTF-8 encoded strings, in which case "string" may be specified as the field type rather than "bytes", and avro will automatically decode them.

To encode field and list sizes, avro uses a long datatype that is first zig-zag encoded, and then variable-length integer encoded, using existing encoding specifications. When encoding arrays, avro provides for a streaming approach that separates the content into blocks.

1.3.1.1 ZigZag coding

ZigZag is an integer format where the sign bit is in the 1s place, such that small negative numbers have no high bits set. In surrounding code, normal integers are almost always stored in a twos-complement manner instead, which can be converted as below.

Converting to ZigZag:

zigzag = twos_complement << 1;
if (zigzag < 0) zigzag = ~zigzag;

Converting from ZigZag:

if (zigzag & 1) zigzag = ~zigzag;
twos_complement = zigzag >> 1;
1.3.1.2 Variable-length integer coding

Variable-length integer is a 7-bit little-endian integer format, where the 8th bit of each byte indicates whether another byte (of 7 bits greater significance) follows in the stream.

Converting to VInt:

// writes 'zigzag' to 'vint' buffer
offset = 0;
do {
  vint_byte = zigzag & 0x7f;
  zigzag >>= 7;
  if (zigzag)
    vint_byte |= 0x80;
  vint.writeUInt8(vint_byte, offset);
  offset += 1;
} while(zigzag);

Converting from VInt:

// constructs 'zigzag' from 'vint' buffer
zigzag = 0;
offset = 0;
do {
  vint_byte = vint.readUInt8(offset);
  zigzag |= (vint_byte & 0x7f) << (offset*7);
  vint_byte &= 0x80;
  offset += 1;
} while(vint_byte);
1.3.1.3 Avro tag array format

Avro arrays may arrive split into more than one sequence of items. Each sequence is prefixed by its length, which may be negative, in which case a byte length is inserted between the length and the sequence content. This is used in schemas of larger data to provide for seeking. The end of the array is indicated by a sequence of length zero.

The complete tags format is a single avro array, consisting solely of blocks of the below format. The sequence is terminated by a block with a count of 0. The size field is only present if the count is negative, in which case its absolute value should be used.

Field Description Encoding Length Optional
count Number of items in block ZigZag VInt Variable
size Number of bytes if count<0 ZigZag VInt Variable ✔️
block Concatenated tag items Binary size
1.3.1.4 Avro tag item format

Each item of the avro array is a pair of avro strings or bytes objects, a name and a value, each prefixed by their length.

Field Description Encoding Length Optional
name_size Number of bytes in name ZigZag VInt Variable
name Name of the tag Binary name_size
value_size Number of bytes in value ZigZag VInt Variable
value Value of the tag Binary value_size

2. DataItem signature and id

The signature and id for a DataItem is built in a manner similar to Arweave 2.0 transaction signing. It uses the Arweave 2.0 deep-hash algorithm. The 2.0 deep-hash algorithm operates on arbitrarily nested arrays of binary data, i.e a recursive type of DeepHashChunk = Uint8Array | DeepHashChunk[].

There are reference implementations for the deep-hash algorithm in TypeScript and Erlang

To generate a valid signature for a DataItem, the contents of the DataItem and static version tags are passed to the deep-hash algorithm to obtain a message. This message is signed by the owner of the DataItem to produce the signature. The id of the DataItem, is the SHA256 digest of this signature.

The exact structure and content passed into the deep-hash algorithm to obtain the message to sign is as follows:

[
  utf8Encoded("dataitem"),
  utf8Encoded("1"),
  owner,
  target,
  anchor,
  [
    ... [ tag.name, tag.value ],
    ... [ tag.name, tag.value ],
    ...
  ],
  data
]

2.1 Verifying a DataItem

DataItem verification is a key to maintaining consistency within the bundle standard. A DataItem is valid iff.1:

  • id matches the signature (via SHA-256 of the signature)
  • signature matches the owner's public key
  • tags are all valid
  • an anchor isn't more than 32 bytes

A tag object is valid iff.:

  • there are <= 128 tags
  • each key is <= 1024 bytes
  • each value is <= 3072 bytes
  • only contains a key and value
  • both the key and value are non-empty strings

3. Writing a Bundle of DataItems

To write a bundle of DataItems, each DataItem should be constructed, signed, encoded, and placed in a transaction with the transaction body format and transaction tags specified in Section 1.

3.1 Nested bundle

Arweave Transactions and DataItems have analogous specifications for tagging and bearing of a binary payload. As such, the ANS-104 Bundle Transaction tagging and binary data format specification can be applied to the tags and binary data payload of a DataItem. Assembling a DataItem this way provides for the nesting of ANS-104 Bundles with one-to-many relationships between "parent" and "child" bundles and theoretically unbounded levels of nesting. Additionally, nested DataItem Bundles can be mixed heterogeneously with non-Bundle DataItems at any depth in the Bundle tree.

To construct an ANS-104 DataItem as a nested Bundle:

  • Add tags to the DataItem as described by the specification in section 1.1
  • Provide a binary payload for the DataItem matching the Bundle Transaction Body Format described in section 1.2, i.e. the Bundle header outlining the count, size, and IDs of the subsequent, nested DataItems, each of which should be verifiable using the method described in section 2.1.

Gateway GQL queries for DataItem headers should, upon request, contain a bundledIn field whose value indicates the parent-child relationship of the DataItem to its immediate parent. Any nested bundle should be traceable to a base layer Arweave Transaction by recursively following the bundledIn field up through the chain of parents.

4. Reading a Bundle of DataItems

To read a bundle of DataItems, the list of bytes representing the DataItems can be partitioned using the offsets in each pair. Subsequently, each partition can be parsed to a DataItem object (struct in languages such as Rust/Go etc. or JSON in TypeScript).

This allows for querying of a singleton or a bundle as a whole.

4.1 Indexing DataItems

This format allows for indexing of specific fields in O(N) time. Some form of caching or indexing could be performed by gateways to improve lookup times.

1 - if and only if