Compression documentation (#2711)

@tensorflow/micro Add documentation describing some compression/decompression internals and makefile build procedures. bug=#2710
tensorflow · Oct 3, 2024 · e3f6dc1 · e3f6dc1
1 parent b3967a9
commit e3f6dc1
Showing 1 changed file with 313 additions and 0 deletions.
diff --git a/tensorflow/lite/micro/docs/compression.md b/tensorflow/lite/micro/docs/compression.md
@@ -0,0 +1,313 @@
+# TFLM Compression Support
+
+TFLM supports fixed width compression of const-tensors using lookup tables.
+Const-tensors are typically those containing trained weights or biases, but can
+be any tensor where the values are fixed within the model and unchanging.
+
+Const-tensors are compressed to fixed width bitstrings, and lookup tables are
+added to the model schema for each tensor.
+
+When accessing a compressed tensor, each kernel invokes a common decompression
+method.  Each set of fixed width bits in the tensor bitstring are used as
+indices into the tensor lookup table.  The results of the lookup table operations
+are placed into a scratch buffer representing the tensor decompressed data.
+
+Decompression results in increased latency during inference.
+There will also be an increase in the size of non-persistent arena memory, due to
+the use of scratch buffers to temporarily hold the decompressed data.
+
+# Supported Tensor Types
+
+* FLOAT32, INT8, INT16, INT32, INT64, BOOL
+
+# Supported Kernels
+
+* FULLY_CONNECTED
+* CONV_2D
+* DEPTHWISE_CONV
+* TRANSPOSE_CONV
+* CONCATENATION
+* ASSIGN_VARIABLE
+
+Per-channel quantized tensor support is available for:
+* CONV_2D
+* DEPTHWISE_CONV
+* TRANSPOSE_CONV
+* FULLY_CONNECTED
+
+# Supported Platforms
+
+* X86
+* XTENSA
+  * P6_VISION, HIFI_MINI, HIFI3, HIFI4, HIFI5
+
+# Model and Metadata Schema for Compression
+
+Models that use compression will have a string key in their `Metadata` vector
+corresponding to `COMPRESSION_METADATA`.  The buffer indexed by such a `Metadata`
+entry will contain the compression schema.  The complete compression schema can
+be found [here](https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/compression/metadata.fbs).
+
+For each tensor which is compressed, the following schema element is created:
+```
+table LutTensor {
+    // Look-Up-Table Tensor: a tensor representation where elements are
+    // compressed into indices into a table of values. The indices are unsigned
+    // integers, index_bitwidth-wide, in big-endian bit order, packed into the
+    // buffer identified by the corresponding tflite.Tensor's buffer field. The
+    // values are located in a newly-created buffer, encoded according to the
+    // tflite.Tensor.type. Tensors with multiple channels have distinct values
+    // tables for each channel, concatenated one after another in the buffer.
+    // An element's LUT index must be looked up in the value table for its
+    // channel.
+
+    tensor:int;            // index of the corresponding tflite.Tensor
+    value_buffer:uint;     // index of the buffer containing LUT values
+    index_bitwidth:uint8;  // bit-width of LUT indexes
+}
+```
+
+* `tensor`: the index of the tensor in the current subgraph.  This tensor will
+have had its buffer data replaced with a packed bitstring (see below),
+representing fixed width indices into the `value table`.
+* `value_buffer`: the index of a buffer added to the model.  This buffer contains
+the `value table` (see below) for the tensor, which is used to decompress the
+tensor.  The elements of the `value table` are of the same type (INT8, INT16, etc.)
+as the original (uncompressed) tensor.
+* `index_bitwidth`: the fixed width of each bit group (index) that represents an offset
+into the `value table`.  For per-channel quantized tensors, the index is an
+offset into the `value table` for a specific channel.
+
+## Tensor Bitstrings
+
+Each compressed tensor has its buffer data replaced by a packed bitstring.  The
+bitstring consists of fixed bit width groups (indices), each group representing an offset
+into the `value table`.  The bitstring is in big-endian byte order with the most
+significant bit first.  A bitstring is padded on the end, to the next byte
+boundry, with zero bits.
+
+Example (bit width 3):
+```
+1110000110100000
+--|--|--|--|---|
+  7  0  3  2   padding
+```
+This bitstring represents the indices 7, 0, 3, 2 as offsets into the `value table`.
+Each offset is in the same units as the original (uncompressed) tensor.  So if
+the tensor is INT8, each offset represents a byte in the `value table`. If the
+tensor was FLOAT32, each offset would represent four bytes.
+
+While the compressed tensor data buffer will shrink in size, the tensor shape
+(dimensions) will remain the same as the uncompressed tensor.
+
+The indices in the bitstring are in the same order as the tensor's original data.
+Compression never reorders the tensor data, simplifying the decompression phase.
+
+## Value Tables
+
+A `value table` contains the unique data values from an original (uncompressed)
+tensor.  For each compressed tensor, an additional buffer is added to the model,
+and the `value table` resides as a contiguous sequence of data values within
+that buffer.  Each element in the `value table` is unique, and is of the same type
+(INT16, FLOAT32, etc.) as the uncompressed tensor.  The order of values within
+the `value table` does not have to match the order in which they appeared in
+the uncompressed tensor data.
+
+Example (tensor type is INT16, value table size is 12 bytes):
+```
+tensor data: [2, 4, 4, 10, 1, 7, 99, 10, 2, 4]
+value table: [99, 2, 10, 4, 1, 7]
+```
+A suitable tensor bitstring (bit width 3) for the example would be:
+```
+bitstring: 00101101101010010100001000101100
+             |  |  |  |  |  |  |  |  |  | |
+index:       1  3  3  2  4  5  0  2  1  3 padding
+value:       2  4  4 10  1  7 99 10  2  4
+```
+
+### Per-channel Quantized Tensor Value Tables
+
+For per-channel quantized tensors, a `value table` is present for each channel.
+All of the `value tables` are concatenated together into a single contiguous
+set of values. The number of elements in each `value table` is always identical,
+with zero value padding added to the end of a `value table` as necessary.
+
+Using the previous example tensor (above) with 2 channels:
+```
+tensor data: [2, 4, 4, 10, 1, 7, 99, 10, 2, 4]
+channel:      |______0_____|  |______1______|
+              |            |  |             |
+value table: [1, 10, 2, 4, 0, 99, 10, 2, 7, 4]
+                           |
+                           |__padding
+```
+A suitable tensor bitstring (bit width 3) for the example would be:
+```
+bitstring: 01001101100100001100000101010000
+             |  |  |  |  |  |  |  |  |  | |
+index:       2  3  3  1  0  3  0  1  2  4 padding
+value:       2  4  4 10  1  7 99 10  2  4
+channel:     0  0  0  0  0  1  1  1  1  1
+```
+
+Note that in the above example, compressed tensor indices are specific to a `value table` channel.
+
+Also note that channel 0 (zero) in the `value table` is padded with a single
+zero value at the end.
+
+# The MicroInterpreter and Tensor Decompression
+
+The model schema `Metadata` is first searched for the `COMPRESSION_METADATA` key.
+If found, the associated buffer is decoded using the [compression schema](https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/compression/metadata.fbs).  For each `LutTensor` in the compression schema,
+a `LookupTableData` ([compression.h](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/compression.h))
+structure is instantiated.
+
+```cpp
+struct LookupTableData {
+  static constexpr size_t kMaxBitWidth = 7;
+  static constexpr size_t kMaxValueTableChannelStride = 128;
+
+  const void* value_table;             // Pointer into FlatBuffer Values.
+  uint8_t value_table_channel_stride;  // elements per channel
+  uint8_t compressed_bit_width : 3;    // 1 to 7 bits
+  bool is_per_channel_quantized : 1;   // tensor is per-channel quantized
+  bool use_alternate_axis : 1;         // shape default channel:
+                                       // 0 = first, 1 = last
+  uint8_t reserved : 3;
+};
+```
+
+* `value_table`: Pointer to the buffer memory containing the `value table`.
+Determined from the `LutTensor.value_buffer` and converted to a model schema
+buffer vector.
+* `value_table_channel_stride`: The number of elements (not bytes) between
+`value table` channels.  Only valid for per-channel quantized tensors.
+* `compressed_bit_width`: Number of bits for each `value table` index.
+Determined from `LutTensor.index_bitwidth`.
+* `is_per_channel_quantized`: Will be `true` for per-channel quantized
+tensors.  Determined by inspecting the tensor quantization scale vector size in
+the model schema.
+If the vector size is greater than 1 (one) then the tensor is assumed to be
+per-channel quantized.
+Default value is `false`.
+* `use_alternate_axis`: Arrangement of tensor data vs. channel number.
+See the quantized dimension section below for additional explanation.
+Only valid for per-channel quantized tensors.  Default value is `false`.
+
+## Quantized Dimension
+
+Each per-channel quantized tensor will have as part of its model schema quantization
+information, a `quantized_dimension` field.  This field specifies which dimension
+of the tensor shape along which the scale and zero-point are to be applied. This
+dimension within the shape is sometimes referred to as the `quantization axis`.
+
+The importance of the `quantization axis` is in how the
+tensor data is interpreted with respect to channel number.
+The tensor decompression methods use `LookupTableData.use_alternate_axis` to
+determine the correct `value table` channel for each tensor element.  When the
+`quantized_dimension` field is 0 (zero) then `use_alternate_axis` is `false`.
+If the `quantized_dimension` field is set to 3 (three) (ex. DEPTHWISE_CONV), then
+`use_alternate_axis` will be `true`.
+
+For a tensor with shape [4, 2, 2, 1] and `use_alternate_axis` equal to `false`,
+the tensor data is assumed to be arranged as follows:
+```
+element number:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
+channel number:  0  0  0  0  1  1  1  1  2  2  2  2  3  3  3  3
+```
+
+For a tensor with shape [1, 2, 2, 4] and `use_alternate_axis` equal to `true`,
+the tensor data is assumed to be arranged as follows:
+```
+element number:  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
+channel number:  0  1  2  3  0  1  2  3  0  1  2  3  0  1  2  3
+```
+
+## Decompressing a Tensor
+
+Any kernel can have decompression support easily added.
+Tensor data is decompressed into the designated memory buffer, and is available
+for the lifetime of the memory buffer.
+
+Only the following methods are required to implement decompression within kernel code:
+
+* `MicroContext::AllocateDecompressionScratchBuffer` ([micro_context.h](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/micro_context.h)):
+Allocates a scratch memory buffer within the `MicroInterpreter` to hold the
+decompressed tensor data.
+* `MicroContext::GetTensorCompressionData` ([micro_context.h](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/micro_context.h)):
+Retrieves compressed tensor information (see [compression.h](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/compression.h)).
+* `tflite::micro::GetTensorData` ([kernel_util.h](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/kernels/kernel_util.h)):
+The four argument version of this method will automatically decompress the
+tensor data into the supplied scratch memory buffer.
+
+Please see the [TRANSPOSE_CONV](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/kernels/transpose_conv.cc)
+reference kernel code for an example of how tensor decompression is implemented.
+
+# How to Compress a Model
+
+Compression works best when the targeted tensors in the model have been binned.
+Binning of the model tensors will result in a change in model accuracy, but will
+also allow for better control of the compression ratio.  For example, by binning
+a tensor to just four values among the tensor elements, a fixed-width of two bits
+can be used for each element.  This would result in nearly a four-fold decrease
+in the size of an INT8 tensor.
+
+Tensors to compress are specified with the `--tensors="#, #, ...#"` flag.
+Per-channel quantized tensors using an alternate quantization axis (such as the
+filter tensor supplied to DEPTHWISE_CONV) must use the `--alt_axis_tensors=` flag.
+
+First, align your binned model:
+```
+bazel run --cache_test_results=no --test_output=all -s  tensorflow/lite/micro/tools:tflite_flatbuffer_align -- binned_model.tflite binned_and_aligned.tflite
+```
+
+Next, compress the model, supplying as arguments the target tensors:
+```
+bazel run --cache_test_results=no --test_output=all -s  tensorflow/lite/micro/compression:compress -- binned_and_aligned.tflite compressed.tflite --tensors="1, 2, 7, 10, 3, 5"
+```
+
+Then align the model:
+```
+bazel run --cache_test_results=no --test_output=all -s  tensorflow/lite/micro/tools:tflite_flatbuffer_align -- compressed.tflite compressed_and_aligned.tflite
+```
+
+# The Generic Benchmark Application
+
+The Generic Benchmark Application can be used to see the size of the model, the
+amount of arena memory used, and the size of the interpreter data structures
+including those involved with tensor conpression.
+
+The benchmark also reports total inference time, as well as time taken for
+tensor decompression.  Timing data may be either wall-clock time or processor
+cycle time.  The type of timing data is dependent on the underlying platform
+and/or simulator used.  In some cases, no timing data is available.
+
+The benchmark output includes a CRC32 of the output tensor(s) for comparison
+within the same platform on which the benchmark is run.
+
+For additional information on the Generic Benchmark Application, please refer to
+this [document](https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/tools/benchmarking/README.md).
+
+## How to Run the Generic Benchmark Application
+
+The Generic Benchmark Application can only be built using `make`.
+
+### Without Compression
+
+HIFI3 example:
+```
+make -f ${TENSORFLOW_ROOT}tensorflow/lite/micro/tools/make/Makefile  BUILD_TYPE=default run_tflm_benchmark -j$(nproc) GENERIC_BENCHMARK_MODEL_PATH=binned_and_aligned.tflite TARGET=xtensa TARGET_ARCH=hifi3 OPTIMIZED_KERNEL_DIR=xtensa XTENSA_CORE=HIFI_190304_swupgrade
+```
+
+The model path can be an abolute path, or relative to your local TFLM repository.
+
+### With Compression
+
+HIFI5 example:
+```
+make -f ${TENSORFLOW_ROOT}tensorflow/lite/micro/tools/make/Makefile  BUILD_TYPE=default run_tflm_benchmark -j$(nproc) GENERIC_BENCHMARK_MODEL_PATH=compressed_and_aligned.tflite TARGET=xtensa TARGET_ARCH=hifi5 OPTIMIZED_KERNEL_DIR=xtensa XTENSA_CORE=PRD_H5_RDO_07_01_2022 USE_TFLM_COMPRESSION=1
+```
+
+The model path can be an abolute path, or relative to your local TFLM repository.
+