使用 KerasCV 中的 Stable Diffusion 生成高性能图像

作者： fchollet、lukewood、divamgupta
使用 KerasCV 的 StableDiffusion 模型生成新图像。

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

在 keras.io 上查看

概述

在本指南中，我们将展示如何使用 KerasCV 实现的 stability.ai 文本到图像模型 Stable Diffusion，根据文本提示生成新图像。

Stable Diffusion 是一款功能强大、开源的文本到图像生成模型。虽然存在多个开源实现，使您可以轻松地根据文本提示创建图像，但 KerasCV 的实现提供了一些独特的优势。这些优势包括 XLA 编译和混合精度支持，它们共同实现了最先进的生成速度。

在本指南中，我们将探讨 KerasCV 的 Stable Diffusion 实现，展示如何使用这些强大的性能提升，并探索它们提供的性能优势。

首先，让我们安装一些依赖项并整理一些导入

pip install tensorflow keras_cv --upgrade --quiet

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 721.6/721.6 kB 13.5 MB/s eta 0:00:00

import time
import keras_cv
from tensorflow import keras
import matplotlib.pyplot as plt

简介

与大多数教程不同，我们先解释主题，然后展示如何实现它，对于文本到图像生成，展示比解释更容易。

看看 keras_cv.models.StableDiffusion() 的强大功能。

首先，我们构建一个模型

model = keras_cv.models.StableDiffusion(img_width=512, img_height=512)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https://raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE

接下来，我们提供一个提示

images = model.text_to_image("photograph of an astronaut riding a horse", batch_size=3)


def plot_images(images):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        plt.imshow(images[i])
        plt.axis("off")


plot_images(images)

Downloading data from https://github.com/openai/CLIP/blob/main/clip/bpe_simple_vocab_16e6.txt.gz?raw=true
1356917/1356917 [==============================] - 0s 0us/step
Downloading data from https://hugging-face.cn/fchollet/stable-diffusion/resolve/main/kcv_encoder.h5
492466864/492466864 [==============================] - 9s 0us/step
Downloading data from https://hugging-face.cn/fchollet/stable-diffusion/resolve/main/kcv_diffusion_model.h5
3439090152/3439090152 [==============================] - 63s 0us/step
50/50 [==============================] - 126s 295ms/step
Downloading data from https://hugging-face.cn/fchollet/stable-diffusion/resolve/main/kcv_decoder.h5
198180272/198180272 [==============================] - 2s 0us/step

png

太不可思议了！

但这并不是这个模型的全部功能。让我们尝试一个更复杂的提示。

images = model.text_to_image(
    "cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
plot_images(images)

50/50 [==============================] - 15s 294ms/step

png

可能性实际上是无限的（或者至少扩展到 Stable Diffusion 的潜在流形边界）。

等等，这到底是怎么做到的？

与你此刻可能预期的不同，Stable Diffusion 实际上并不依赖于魔法。它是一种“潜在扩散模型”。让我们深入了解一下这意味着什么。

你可能熟悉“超分辨率”的概念：可以训练一个深度学习模型来“去噪”输入图像——从而将其转换为更高分辨率的版本。深度学习模型不是通过神奇地恢复从噪声、低分辨率输入中丢失的信息来做到这一点的——而是，模型使用其训练数据分布来幻化出给定输入最有可能的视觉细节。要了解更多关于超分辨率的信息，你可以查看以下 Keras.io 教程

Super-resolution

当你将这个想法推向极限时，你可能会开始问——如果我们只是对纯噪声运行这样的模型会怎样？然后模型将“去噪声”并开始幻化出一张全新的图像。通过多次重复这个过程，你可以将一小块噪声变成越来越清晰和高分辨率的人工图像。

这是潜在扩散的关键思想，在 2020 年的使用潜在扩散模型的高分辨率图像合成中提出。要深入了解扩散，你可以查看 Keras.io 教程去噪扩散隐式模型。

Denoising diffusion

现在，要从潜在扩散过渡到文本到图像系统，你仍然需要添加一个关键功能：通过提示关键词控制生成的视觉内容的能力。这是通过“条件化”来实现的，这是一种经典的深度学习技术，它包括将表示一小段文本的向量连接到噪声块，然后在 {图像：标题} 对的数据集上训练模型。

这产生了 Stable Diffusion 架构。Stable Diffusion 由三个部分组成

一个文本编码器，它将你的提示转换为潜在向量。
一个扩散模型，它反复“去噪”一个 64x64 的潜在图像块。
一个解码器，它将最终的 64x64 潜在块转换为更高分辨率的 512x512 图像。

首先，你的文本提示通过文本编码器被投影到一个潜在向量空间中，文本编码器只是一个预训练的、冻结的语言模型。然后，该提示向量被连接到一个随机生成的噪声块，该噪声块通过扩散模型在一系列“步骤”中反复“去噪”（你运行的步骤越多，你的图像就越清晰和漂亮——默认值为 50 步）。

最后，64x64 的潜在图像被发送到解码器以在高分辨率下正确渲染。

The Stable Diffusion architecture

总而言之，这是一个非常简单的系统——Keras 实现包含在四个文件中，总共不到 500 行代码

但是，当你对数十亿张图片及其标题进行训练后，这个相对简单的系统就开始看起来像魔法了。正如费曼所说关于宇宙：“它并不复杂，只是有很多！”

KerasCV 的优势

既然有几种 Stable Diffusion 的实现公开可用，为什么你应该使用 keras_cv.models.StableDiffusion？

除了易于使用的 API 之外，KerasCV 的 Stable Diffusion 模型还具有一些强大的优势，包括

图形模式执行
通过 jit_compile=True 进行 XLA 编译
支持混合精度计算

当这些功能结合在一起时，KerasCV Stable Diffusion 模型的运行速度比朴素实现快几个数量级。本节展示了如何启用所有这些功能，以及使用它们所带来的性能提升。

为了比较，我们运行了基准测试，比较了 HuggingFace diffusers 实现的 Stable Diffusion 与 KerasCV 实现的运行时间。两种实现都承担了生成 3 张图像的任务，每张图像的步数为 50。在这个基准测试中，我们使用了一个 Tesla T4 GPU。

我们所有的基准测试都在 GitHub 上开源，可以在 Colab 上重新运行以重现结果。基准测试的结果显示在下表中

GPU	模型	运行时间
Tesla T4	KerasCV（热启动）	28.97 秒
Tesla T4	diffusers（热启动）	41.33 秒
Tesla V100	KerasCV（热启动）	12.45
Tesla V100	diffusers（热启动）	12.72

Tesla T4 上的执行时间提高了 30%！虽然在 V100 上的改进幅度要低得多，但我们通常预计基准测试的结果将始终有利于 KerasCV，涵盖所有 NVIDIA GPU。

为了完整起见，报告了冷启动和热启动生成时间。冷启动执行时间包括模型创建和编译的一次性成本，因此在生产环境中可以忽略不计（在生产环境中，你会多次重复使用相同的模型实例）。无论如何，以下是冷启动数字

GPU	模型	运行时间
Tesla T4	KerasCV（冷启动）	83.47 秒
Tesla T4	diffusers（冷启动）	46.27 秒
Tesla V100	KerasCV（冷启动）	76.43
Tesla V100	diffusers（冷启动）	13.90

虽然运行本指南的运行时间结果可能会有所不同，但在我们的测试中，KerasCV 实现的 Stable Diffusion 明显快于其 PyTorch 对应版本。这可能主要归因于 XLA 编译。

要开始，让我们首先对未优化的模型进行基准测试

benchmark_result = []
start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Standard", end - start])
plot_images(images)

print(f"Standard model: {(end - start):.2f} seconds")
keras.backend.clear_session()  # Clear session to preserve memory.

50/50 [==============================] - 15s 294ms/step
Standard model: 15.02 seconds

png

混合精度

“混合精度”包括使用 float16 精度执行计算，同时以 float32 格式存储权重。这样做是为了利用 float16 操作在现代 NVIDIA GPU 上比其 float32 对应操作快得多的内核这一事实。

在 Keras（以及 keras_cv.models.StableDiffusion）中启用混合精度计算就像调用一样简单

keras.mixed_precision.set_global_policy("mixed_float16")

就是这样。开箱即用——它可以正常工作。

model = keras_cv.models.StableDiffusion()

print("Compute dtype:", model.diffusion_model.compute_dtype)
print(
    "Variable dtype:",
    model.diffusion_model.variable_dtype,
)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https://raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE
Compute dtype: float16
Variable dtype: float32

如你所见，上面构建的模型现在使用混合精度计算；利用 float16 操作的速度进行计算，同时以 float32 精度存储变量。

# Warm up model to run graph tracing before benchmarking.
model.text_to_image("warming up the model", batch_size=3)

start = time.time()
images = model.text_to_image(
    "a cute magical flying dog, fantasy art, "
    "golden color, high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting, mystery, adventure",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["Mixed Precision", end - start])
plot_images(images)

print(f"Mixed precision model: {(end - start):.2f} seconds")
keras.backend.clear_session()

50/50 [==============================] - 24s 229ms/step
50/50 [==============================] - 11s 229ms/step
Mixed precision model: 11.87 seconds

png

XLA 编译

TensorFlow 内置了 XLA：加速线性代数编译器。 keras_cv.models.StableDiffusion 开箱即用地支持 jit_compile 参数。将此参数设置为 True 将启用 XLA 编译，从而显著提高速度。

让我们在下面使用它

# Set back to the default for benchmarking purposes.
keras.mixed_precision.set_global_policy("float32")

model = keras_cv.models.StableDiffusion(jit_compile=True)
# Before we benchmark the model, we run inference once to make sure the TensorFlow
# graph has already been traced.
images = model.text_to_image("An avocado armchair", batch_size=3)
plot_images(images)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https://raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE
50/50 [==============================] - 71s 233ms/step

png

让我们对我们的 XLA 模型进行基准测试

start = time.time()
images = model.text_to_image(
    "A cute otter in a rainbow whirlpool holding shells, watercolor",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA", end - start])
plot_images(images)

print(f"With XLA: {(end - start):.2f} seconds")
keras.backend.clear_session()

50/50 [==============================] - 12s 233ms/step
With XLA: 11.84 seconds

png

在 A100 GPU 上，我们获得了大约 2 倍的速度提升。太棒了！

将所有内容整合在一起

那么，如何组装世界上性能最优的稳定扩散推理管道（截至 2022 年 9 月）呢？

使用这两行代码

keras.mixed_precision.set_global_policy("mixed_float16")
model = keras_cv.models.StableDiffusion(jit_compile=True)

By using this model checkpoint, you acknowledge that its usage is subject to the terms of the CreativeML Open RAIL-M license at https://raw.githubusercontent.com/CompVis/stable-diffusion/main/LICENSE

以及如何使用它...

# Let's make sure to warm up the model
images = model.text_to_image(
    "Teddy bears conducting machine learning research",
    batch_size=3,
)
plot_images(images)

50/50 [==============================] - 71s 144ms/step

png

它到底有多快？让我们来了解一下！

start = time.time()
images = model.text_to_image(
    "A mysterious dark stranger visits the great pyramids of egypt, "
    "high quality, highly detailed, elegant, sharp focus, "
    "concept art, character concepts, digital painting",
    batch_size=3,
)
end = time.time()
benchmark_result.append(["XLA + Mixed Precision", end - start])
plot_images(images)

print(f"XLA + mixed precision: {(end - start):.2f} seconds")

50/50 [==============================] - 7s 144ms/step
XLA + mixed precision: 7.51 seconds

png

让我们看看结果

print("{:<22} {:<22}".format("Model", "Runtime"))
for result in benchmark_result:
    name, runtime = result
    print("{:<22} {:<22}".format(name, runtime))

Model                  Runtime               
Standard               15.015103816986084    
Mixed Precision        11.867290258407593    
XLA                    11.838508129119873    
XLA + Mixed Precision  7.507506370544434

我们的完全优化的模型仅用四秒钟就在 A100 GPU 上从文本提示生成了三张新图像。

结论

KerasCV 提供了 Stable Diffusion 的最先进实现——并且通过使用 XLA 和混合精度，它提供了截至 2022 年 9 月最快的 Stable Diffusion 管道。

通常，在 keras.io 教程的结尾，我们会留下一些未来的方向供你继续学习。这次，我们给你一个想法

通过模型运行你自己的提示！这绝对令人兴奋！

如果你有自己的 NVIDIA GPU 或 M1 MacBookPro，你也可以在你的机器上本地运行模型。（请注意，在 M1 MacBookPro 上运行时，你不应该启用混合精度，因为它还没有得到 Apple 的 Metal 运行时的良好支持。）