以下文档概述了 TensorFlow Lite 8 位量化方案的规范。 旨在帮助硬件开发人员为量化 TensorFlow Lite 模型的推理提供硬件支持。
规范摘要
我们提供了一个规范,并且只有在遵循规范的情况下,我们才能对行为提供一些保证。 我们也理解不同的硬件可能具有导致实现略有偏差的偏好和限制,从而导致实现不完全相同。 虽然这在大多数情况下可能是可以接受的(我们将提供一套测试,据我们所知,这些测试包括从多个模型中收集的每个操作容差),但机器学习(在最常见的情况下为深度学习)的性质使得不可能提供任何硬性保证。
8 位量化使用以下公式来近似浮点值。
\[real\_value = (int8\_value - zero\_point) \times scale\]
每个轴(在 Conv 操作中也称为每个通道)或每个张量权重由 int8
范围内的二进制补码值表示 [-127, 127]
零点等于 0。 每个张量激活/输入由 int8
范围内的二进制补码值表示 [-128, 127]
,零点在范围内 [-128, 127]
。
对于特定操作,还有其他例外情况,这些情况将在下面记录。
有符号整数与无符号整数
TensorFlow Lite 量化将主要优先考虑针对 8 位的 int8
量化的工具和内核。 这是因为对称量化由零点等于 0 表示。 此外,许多后端对 int8xint8
累加进行了额外的优化。
每个轴与每个张量
逐张量量化意味着整个张量只有一个比例因子和/或零点。逐轴量化意味着在 `quantized_dimension` 中的每个切片都将有一个比例因子和/或 `zero_point`。量化维度指定了张量形状中与比例因子和零点相对应的维度。例如,一个张量 `t`,其维度为 `dims=[4, 3, 2, 1]`,量化参数为:`scale=[1.0, 2.0, 3.0]`,`zero_point=[1, 2, 3]`,`quantization_dimension=1` 将在 `t` 的第二个维度上进行量化。
t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2
t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3
通常,`quantized_dimension` 是卷积权重的 `output_channel`,但在理论上它可以是对应于内核实现中每个点积的维度,从而在不影响性能的情况下实现更高的量化粒度。这将显著提高精度。
TFLite 正在为越来越多的操作提供逐轴支持。截至本文撰写之时,Conv2d 和 DepthwiseConv2d 已支持逐轴量化。
对称与非对称
激活是非对称的:它们的零点可以在有符号 `int8` 范围内 `[-128, 127]` 中的任何位置。许多激活本质上是非对称的,零点是一种相对廉价的方式,可以有效地获得额外的二进制位精度。由于激活只乘以常数权重,因此可以对常数零点值进行大量优化。
权重是对称的:强制零点等于 0。权重值乘以动态输入和激活值。这意味着权重的零点与激活值相乘会不可避免地带来运行时成本。通过强制零点为 0,我们可以避免这种成本。
数学解释:这类似于 arXiv:1712.05877 中的第 2.3 节,除了我们允许比例因子为逐轴的。这很容易推广,如下所示
\(A\) 是一个 \(m \times n\) 的量化激活矩阵。
\(B\) 是一个 \(n \times p\) 的量化权重矩阵。
考虑将 \(A\) 的第 \(j\) 行 \(a_j\) 乘以 \(B\) 的第 \(k\) 列 \(b_k\),两者长度均为 \(n\)。量化整数和零点值分别为 \(q_a\),\(z_a\) 和 \(q_b\),\(z_b\)。
\[a_j \cdot b_k = \sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} = \sum_{i=0}^{n} (q_{a}^{(i)} - z_a) (q_{b}^{(i)} - z_b) = \sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)} - \sum_{i=0}^{n} q_{a}^{(i)} z_b - \sum_{i=0}^{n} q_{b}^{(i)} z_a + \sum_{i=0}^{n} z_a z_b\]
\(\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}\) 项是不可避免的,因为它执行了输入值和权重值的点积。
\(\sum_{i=0}^{n} q_{b}^{(i)} z_a\) 和 \(\sum_{i=0}^{n} z_a z_b\) 项由每个推理调用中保持不变的常数组成,因此可以预先计算。
\(\sum_{i=0}^{n} q_{a}^{(i)} z_b\) 项需要在每次推理中计算,因为激活在每次推理中都会发生变化。通过强制权重对称,我们可以消除该项的成本。
int8 量化算子规范
下面描述了 int8 tflite 内核的量化要求。
ADD
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
AVERAGE_POOL_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
CONCATENATION
Input ...:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
CONV_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 0)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-axis
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
DEPTHWISE_CONV_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 3)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-axis
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
FULLY_CONNECTED
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1 (Weight):
data_type : int8
range : [-127, 127]
granularity: per-axis (dim = 0)
restriction: zero_point = 0
Input 2 (Bias):
data_type : int32
range : [int32_min, int32_max]
granularity: per-tensor
restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
L2_NORMALIZATION
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 128.0, 0)
LOGISTIC
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 256.0, -128)
MAX_POOL_2D
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
MUL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
RESHAPE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
RESIZE_BILINEAR
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
SOFTMAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 256.0, -128)
SPACE_TO_DEPTH
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
TANH
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (1.0 / 128.0, 0)
PAD
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
GATHER
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
BATCH_TO_SPACE_ND
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
SPACE_TO_BATCH_ND
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
TRANSPOSE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
MEAN
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SUB
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SQUEEZE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
LOG_SOFTMAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: (scale, zero_point) = (16.0 / 256.0, 127)
MAXIMUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
ARG_MAX
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
MINIMUM
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
LESS
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
PADV2
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
GREATER
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
GREATER_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
LESS_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SLICE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
restriction: Input and outputs must all have same scale/zero_point
EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
NOT_EQUAL
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Input 1:
data_type : int8
range : [-128, 127]
granularity: per-tensor
SHAPE
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
QUANTIZE (Requantization)
Input 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor
Output 0:
data_type : int8
range : [-128, 127]
granularity: per-tensor