策略

版权所有 2023 The TF-Agents Authors.

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

简介

在强化学习术语中，策略将来自环境的观察映射到动作或动作分布。在 TF-Agents 中，来自环境的观察包含在名为元组 TimeStep('step_type', 'discount', 'reward', 'observation') 中，策略将时间步映射到动作或动作分布。大多数策略使用 timestep.observation，一些策略使用 timestep.step_type（例如，在有状态策略中，在剧集开始时重置状态），但通常会忽略 timestep.discount 和 timestep.reward。

策略与 TF-Agents 中的其他组件以以下方式相关。大多数策略都有一个神经网络，用于从时间步计算动作和/或动作分布。代理可以包含一个或多个策略，用于不同的目的，例如，一个用于部署的正在训练的主要策略，以及一个用于数据收集的噪声策略。策略可以保存/恢复，并且可以独立于代理用于数据收集、评估等。

一些策略在 Tensorflow 中更容易编写（例如，那些具有神经网络的策略），而另一些策略在 Python 中更容易编写（例如，遵循动作脚本）。因此，在 TF 代理中，我们允许 Python 和 Tensorflow 策略。此外，在 TensorFlow 中编写的策略可能必须在 Python 环境中使用，反之亦然，例如，TensorFlow 策略用于训练，但后来部署在生产 Python 环境中。为了简化此过程，我们提供了用于在 Python 和 TensorFlow 策略之间进行转换的包装器。

另一类有趣的策略是策略包装器，它们以某种方式修改给定的策略，例如，添加特定类型的噪声，制作随机策略的贪婪或 epsilon-贪婪版本，随机混合多个策略等。

设置

如果您尚未安装 tf-agents，请运行

pip install tf-agents
pip install tf-keras

import os
# Keep using keras-2 (tf-keras) rather than keras-3 (keras).
os.environ['TF_USE_LEGACY_KERAS'] = '1'

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import abc
import tensorflow as tf
import tensorflow_probability as tfp
import numpy as np

from tf_agents.specs import array_spec
from tf_agents.specs import tensor_spec
from tf_agents.networks import network

from tf_agents.policies import py_policy
from tf_agents.policies import random_py_policy
from tf_agents.policies import scripted_py_policy

from tf_agents.policies import tf_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import actor_policy
from tf_agents.policies import q_policy
from tf_agents.policies import greedy_policy

from tf_agents.trajectories import time_step as ts

Python 策略

Python 策略的接口定义在 policies/py_policy.PyPolicy 中。主要方法是

class Base(object):

  @abc.abstractmethod
  def __init__(self, time_step_spec, action_spec, policy_state_spec=()):
    self._time_step_spec = time_step_spec
    self._action_spec = action_spec
    self._policy_state_spec = policy_state_spec

  @abc.abstractmethod
  def reset(self, policy_state=()):
    # return initial_policy_state.
    pass

  @abc.abstractmethod
  def action(self, time_step, policy_state=()):
    # return a PolicyStep(action, state, info) named tuple.
    pass

  @abc.abstractmethod
  def distribution(self, time_step, policy_state=()):
    # Not implemented in python, only for TF policies.
    pass

  @abc.abstractmethod
  def update(self, policy):
    # update self to be similar to the input `policy`.
    pass

  @property
  def time_step_spec(self):
    return self._time_step_spec

  @property
  def action_spec(self):
    return self._action_spec

  @property
  def policy_state_spec(self):
    return self._policy_state_spec

最重要的方法是 action(time_step)，它将包含来自环境的观察的 time_step 映射到包含以下属性的 PolicyStep 命名元组

action：要应用于环境的动作。
state：要馈送到下次调用 action 的策略状态（例如，RNN 状态）。
info：可选的辅助信息，例如动作日志概率。

time_step_spec 和 action_spec 是输入时间步和输出动作的规范。策略还具有一个 reset 函数，该函数通常用于重置有状态策略中的状态。 update(new_policy) 函数将 self 更新为 new_policy。

现在，让我们看一下 Python 策略的几个示例。

示例 1：随机 Python 策略

PyPolicy 的一个简单示例是 RandomPyPolicy，它为给定的离散/连续 action_spec 生成随机动作。输入 time_step 被忽略。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,
    action_spec=action_spec)
time_step = None
action_step = my_random_py_policy.action(time_step)
print(action_step)
action_step = my_random_py_policy.action(time_step)
print(action_step)

PolicyStep(action=array([5, 3], dtype=int32), state=(), info=())
PolicyStep(action=array([-4,  3], dtype=int32), state=(), info=())

示例 2：脚本化 Python 策略

脚本化策略回放动作脚本，该脚本表示为 (num_repeats, action) 元组列表。每次调用 action 函数时，它都会返回列表中的下一个动作，直到完成指定的重复次数，然后继续执行列表中的下一个动作。可以调用 reset 方法从列表开头开始执行。

action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)
action_script = [(1, np.array([5, 2], dtype=np.int32)),
                 (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeats` to 0 will skip this action.
                 (2, np.array([1, 2], dtype=np.int32)),
                 (1, np.array([3, 4], dtype=np.int32))]

my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(
    time_step_spec=None, action_spec=action_spec, action_script=action_script)

policy_state = my_scripted_py_policy.get_initial_state()
time_step = None
print('Executing scripted policy...')
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)
action_step= my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)
action_step = my_scripted_py_policy.action(time_step, action_step.state)
print(action_step)

print('Resetting my_scripted_py_policy...')
policy_state = my_scripted_py_policy.get_initial_state()
action_step = my_scripted_py_policy.action(time_step, policy_state)
print(action_step)

Executing scripted policy...
PolicyStep(action=array([5, 2], dtype=int32), state=[0, 1], info=())
PolicyStep(action=array([1, 2], dtype=int32), state=[2, 1], info=())
PolicyStep(action=array([1, 2], dtype=int32), state=[2, 2], info=())
Resetting my_scripted_py_policy...
PolicyStep(action=array([5, 2], dtype=int32), state=[0, 1], info=())

TensorFlow 策略

TensorFlow 策略遵循与 Python 策略相同的接口。让我们看一下几个示例。

示例 1：随机 TF 策略

RandomTFPolicy 可用于根据给定的离散/连续 action_spec 生成随机动作。输入 time_step 被忽略。

action_spec = tensor_spec.BoundedTensorSpec(
    (2,), tf.float32, minimum=-1, maximum=3)
input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)

my_random_tf_policy = random_tf_policy.RandomTFPolicy(
    action_spec=action_spec, time_step_spec=time_step_spec)
observation = tf.ones(time_step_spec.observation.shape)
time_step = ts.restart(observation)
action_step = my_random_tf_policy.action(time_step)

print('Action:')
print(action_step.action)

Action:
tf.Tensor([9.8276138e-04 2.8761353e+00], shape=(2,), dtype=float32)

示例 2：Actor 策略

可以使用将 time_steps 映射到动作的网络或将 time_steps 映射到动作分布的网络来创建 Actor 策略。

使用动作网络

让我们定义一个网络，如下所示

class ActionNet(network.Network):

  def __init__(self, input_tensor_spec, output_tensor_spec):
    super(ActionNet, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name='ActionNet')
    self._output_tensor_spec = output_tensor_spec
    self._sub_layers = [
        tf.keras.layers.Dense(
            action_spec.shape.num_elements(), activation=tf.nn.tanh),
    ]

  def call(self, observations, step_type, network_state):
    del step_type

    output = tf.cast(observations, dtype=tf.float32)
    for layer in self._sub_layers:
      output = layer(output)
    actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())

    # Scale and shift actions to the correct range if necessary.
    return actions, network_state

在 TensorFlow 中，大多数网络层设计用于批处理操作，因此我们期望输入 time_steps 为批处理，网络的输出也将为批处理。此外，网络负责以给定 action_spec 的正确范围内生成动作。这通常使用例如最终层的 tanh 激活来完成，以在 [-1, 1] 中生成动作，然后将其缩放和移位到作为输入 action_spec 的正确范围（例如，参见 tf_agents/agents/ddpg/networks.actor_network()）。

现在，我们可以使用上面的网络创建一个 Actor 策略。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((3,),
                                            tf.float32,
                                            minimum=-1,
                                            maximum=1)

action_net = ActionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_net)

我们可以将其应用于遵循 time_step_spec 的任何时间步批次

batch_size = 2
observations = tf.ones([2] + time_step_spec.observation.shape.as_list())

time_step = ts.restart(observations, batch_size)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)

distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor(
[[ 0.85880756 -0.74206954 -0.7772715 ]
 [ 0.85880756 -0.74206954 -0.7772715 ]], shape=(2, 3), dtype=float32)
Action distribution:
tfp.distributions.Deterministic("Deterministic", batch_shape=[2, 3], event_shape=[], dtype=float32)

在上面的示例中，我们使用生成动作张量的动作网络创建了策略。在这种情况下， policy.distribution(time_step) 是围绕 policy.action(time_step) 输出的确定性（delta）分布。生成随机策略的一种方法是将 Actor 策略包装在策略包装器中，该包装器会向动作添加噪声。另一种方法是使用动作分布网络而不是动作网络来创建 Actor 策略，如下所示。

使用动作分布网络

class ActionDistributionNet(ActionNet):

  def call(self, observations, step_type, network_state):
    action_means, network_state = super(ActionDistributionNet, self).call(
        observations, step_type, network_state)

    action_std = tf.ones_like(action_means)
    return tfp.distributions.MultivariateNormalDiag(action_means, action_std), network_state


action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)

my_actor_policy = actor_policy.ActorPolicy(
    time_step_spec=time_step_spec,
    action_spec=action_spec,
    actor_network=action_distribution_net)

action_step = my_actor_policy.action(time_step)
print('Action:')
print(action_step.action)
distribution_step = my_actor_policy.distribution(time_step)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor(
[[-1.          1.          1.        ]
 [-0.7074561   0.1602813   0.34091526]], shape=(2, 3), dtype=float32)
Action distribution:
tfp.distributions.MultivariateNormalDiag("MultivariateNormalDiag", batch_shape=[2], event_shape=[3], dtype=float32)

请注意，在上述示例中，动作被裁剪到给定动作规范的范围内 [-1, 1]。这是因为 ActorPolicy 构造函数参数默认情况下 clip=True。将此设置为 false 将返回网络产生的未裁剪动作。

随机策略可以使用例如 GreedyPolicy 包装器转换为确定性策略，该包装器选择 stochastic_policy.distribution().mode() 作为其动作，并使用围绕此贪婪动作的确定性/delta 分布作为其 distribution()。

示例 3：Q 策略

Q 策略用于 DQN 等代理，并且基于预测每个离散动作的 Q 值的 Q 网络。对于给定的时间步长，Q 策略中的动作分布是使用 q 值作为 logits 创建的分类分布。

input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)
time_step_spec = ts.time_step_spec(input_tensor_spec)
action_spec = tensor_spec.BoundedTensorSpec((),
                                            tf.int32,
                                            minimum=0,
                                            maximum=2)
num_actions = action_spec.maximum - action_spec.minimum + 1


class QNetwork(network.Network):

  def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):
    super(QNetwork, self).__init__(
        input_tensor_spec=input_tensor_spec,
        state_spec=(),
        name=name)
    self._sub_layers = [
        tf.keras.layers.Dense(num_actions),
    ]

  def call(self, inputs, step_type=None, network_state=()):
    del step_type
    inputs = tf.cast(inputs, tf.float32)
    for layer in self._sub_layers:
      inputs = layer(inputs)
    return inputs, network_state


batch_size = 2
observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())
time_steps = ts.restart(observation, batch_size=batch_size)

my_q_network = QNetwork(
    input_tensor_spec=input_tensor_spec,
    action_spec=action_spec)
my_q_policy = q_policy.QPolicy(
    time_step_spec, action_spec, q_network=my_q_network)
action_step = my_q_policy.action(time_steps)
distribution_step = my_q_policy.distribution(time_steps)

print('Action:')
print(action_step.action)

print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor([2 0], shape=(2,), dtype=int32)
Action distribution:
tfp.distributions.Categorical("Categorical", batch_shape=[2], event_shape=[], dtype=int32)

策略包装器

策略包装器可用于包装和修改给定策略，例如添加噪声。策略包装器是 Policy（Python/TensorFlow）的子类，因此可以像任何其他策略一样使用。

示例：贪婪策略

贪婪包装器可用于包装任何实现 distribution() 的 TensorFlow 策略。 GreedyPolicy.action() 将返回 wrapped_policy.distribution().mode()，而 GreedyPolicy.distribution() 是围绕 GreedyPolicy.action() 的确定性/delta 分布。

my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)

action_step = my_greedy_policy.action(time_steps)
print('Action:')
print(action_step.action)

distribution_step = my_greedy_policy.distribution(time_steps)
print('Action distribution:')
print(distribution_step.action)

Action:
tf.Tensor([0 0], shape=(2,), dtype=int32)
Action distribution:
tfp.distributions.Deterministic("Deterministic", batch_shape=[2], event_shape=[], dtype=int32)