TensorFlow 概率中的优化器

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 GitHub 上查看源代码

下载笔记本

摘要

在本 Colab 中，我们将演示如何使用 TensorFlow 概率中实现的各种优化器。

依赖项和先决条件

导入

%matplotlib inline


import contextlib
import functools
import os
import time

import numpy as np
import pandas as pd
import scipy as sp
from six.moves import urllib
from sklearn import preprocessing

import tensorflow as tf
import tf_keras

import tensorflow_probability as tfp

BFGS 和 L-BFGS 优化器

拟牛顿法是一类流行的一阶优化算法。这些方法使用精确 Hessian 的正定近似来找到搜索方向。

Broyden-Fletcher-Goldfarb-Shanno 算法 (BFGS) 是这个一般思想的具体实现。它适用于中等规模的问题，并且是梯度在所有地方都连续时的首选方法（例如，具有 \(L_2\) 惩罚的线性回归）。

L-BFGS 是 BFGS 的有限内存版本，适用于解决 Hessian 矩阵无法以合理成本计算或不是稀疏的较大问题。它们不存储 Hessian 矩阵的完全密集 \(n \times n\) 近似值，而是只保存几个长度为 \(n\) 的向量，这些向量隐式地表示这些近似值。

辅助函数

CACHE_DIR = os.path.join(os.sep, 'tmp', 'datasets')


def make_val_and_grad_fn(value_fn):
  @functools.wraps(value_fn)
  def val_and_grad(x):
    return tfp.math.value_and_gradient(value_fn, x)
  return val_and_grad


@contextlib.contextmanager
def timed_execution():
  t0 = time.time()
  yield
  dt = time.time() - t0
  print('Evaluation took: %f seconds' % dt)


def np_value(tensor):
  """Get numpy value out of possibly nested tuple of tensors."""
  if isinstance(tensor, tuple):
    return type(tensor)(*(np_value(t) for t in tensor))
  else:
    return tensor.numpy()

def run(optimizer):
  """Run an optimizer and measure it's evaluation time."""
  optimizer()  # Warmup.
  with timed_execution():
    result = optimizer()
  return np_value(result)

在简单二次函数上使用 L-BFGS

# Fix numpy seed for reproducibility
np.random.seed(12345)

# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.

# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.

dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)

# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)

# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10

@tf.function
def quadratic_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_lbfgs)

# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.

print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)

Evaluation took: 0.014586 seconds
L-BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

使用 BFGS 解决相同问题

@tf.function
def quadratic_with_bfgs():
  return tfp.optimizer.bfgs_minimize(
    quadratic,
    initial_position=tf.constant(start),
    tolerance=tolerance)

results = run(quadratic_with_bfgs)

print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)

Evaluation took: 0.010468 seconds
BFGS Results
Converged: True
Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Number of iterations: 10

具有 L1 惩罚的线性回归：前列腺癌数据

来自书籍的示例：统计学习的要素：数据挖掘、推理和预测，作者为 Trevor Hastie、Robert Tibshirani 和 Jerome Friedman。

请注意，这是一个具有 L1 惩罚的优化问题。

获取数据集

def cache_or_download_file(cache_dir, url_base, filename):
  """Read a cached file or download it."""
  filepath = os.path.join(cache_dir, filename)
  if tf.io.gfile.exists(filepath):
    return filepath
  if not tf.io.gfile.exists(cache_dir):
    tf.io.gfile.makedirs(cache_dir)
  url = url_base + filename
  print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
  urllib.request.urlretrieve(url, filepath)
  return filepath

def get_prostate_dataset(cache_dir=CACHE_DIR):
  """Download the prostate dataset and read as Pandas dataframe."""
  url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
  return pd.read_csv(
      cache_or_download_file(cache_dir, url_base, 'prostate.data'),
      delim_whitespace=True, index_col=0)

prostate_df = get_prostate_dataset()

Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.

问题定义

np.random.seed(12345)

feature_names = ['lcavol', 'lweight',   'age',  'lbph', 'svi', 'lcp',   
                 'gleason', 'pgg45']

# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
    scalar.fit_transform(
        prostate_df[feature_names].astype('float64')))

# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T'] 

# Select features and labels 
features = prostate_df_train[feature_names]
labels =  prostate_df_train[['lpsa']]

# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)

dtype = feat.dtype

regularization = 0 # regularization parameter
dim = 8 # number of features

# We pick a random starting point for the search
start = np.random.randn(dim + 1)

def regression_loss(params):
  """Compute loss for linear regression model with L1 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The mean square error loss including L1 penalty.
  """
  params = tf.squeeze(params)
  intercept, beta  = params[0], params[1:]
  pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
  mse_loss = tf.reduce_sum(
      tf.cast(
        tf_keras.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
  l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
  total_loss = mse_loss + l1_penalty
  return total_loss

使用 L-BFGS 求解

使用 L-BFGS 拟合。即使 L1 惩罚引入了导数不连续性，但在实践中，L-BFGS 仍然效果很好。

@tf.function
def l1_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
    make_val_and_grad_fn(regression_loss),
    initial_position=tf.constant(start),
    tolerance=1e-8)

results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))

Evaluation took: 0.017987 seconds
L-BFGS Results
Converged: True
Intercept: Fitted (2.3879985744556484)
Beta:      Fitted [ 0.68626215  0.28193532 -0.17030254  0.10799274  0.33634988 -0.24888523
  0.11992237  0.08689026]

使用 Nelder Mead 求解

Nelder Mead 方法是最流行的无导数最小化方法之一。此优化器不使用梯度信息，也不对目标函数的可微性做任何假设；因此，它适用于非光滑目标函数，例如具有 L1 惩罚的优化问题。

对于 \(n\) 维的优化问题，它维护一组 \(n+1\) 个候选解，这些解跨越一个非退化的单纯形。它根据一组移动（反射、扩展、收缩和收缩）连续修改单纯形，使用每个顶点处的函数值。

# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)

@tf.function
def l1_regression_with_nelder_mead():
  return tfp.optimizer.nelder_mead_minimize(
      regression_loss,
      initial_vertex=initial_vertex,
      func_tolerance=1e-10,
      position_tolerance=1e-10)

results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta:      Fitted {}'.format(fitted_beta))

Evaluation took: 0.325643 seconds
Nelder Mead Results
Converged: True
Intercept: Fitted (2.387998456121595)
Beta:      Fitted [ 0.68626266  0.28193456 -0.17030291  0.10799375  0.33635132 -0.24888703
  0.11992244  0.08689023]

具有 L2 惩罚的逻辑回归

在此示例中，我们为分类创建一个合成数据集，并使用 L-BFGS 优化器拟合参数。

np.random.seed(12345)

dim = 5  # The number of features
n_obs = 10000  # The number of observations

betas = np.random.randn(dim)  # The true beta
intercept = np.random.randn()  # The true intercept

features = np.random.randn(n_obs, dim)  # The feature matrix
probs = sp.special.expit(
    np.matmul(features, np.expand_dims(betas, -1)) + intercept)

labels = sp.stats.bernoulli.rvs(probs)  # The true labels

regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)

@make_val_and_grad_fn
def negative_log_likelihood(params):
  """Negative log likelihood for logistic model with L2 penalty

  Args:
    params: A real tensor of shape [dim + 1]. The zeroth component
      is the intercept term and the rest of the components are the
      beta coefficients.

  Returns:
    The negative log likelihood plus the penalty term. 
  """
  intercept, beta  = params[0], params[1:]
  logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
  log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
      labels=lab, logits=logit))
  l2_penalty = regularization * tf.reduce_sum(beta ** 2)
  total_loss = log_likelihood + l2_penalty
  return total_loss

start = np.random.randn(dim + 1)

@tf.function
def l2_regression_with_lbfgs():
  return tfp.optimizer.lbfgs_minimize(
      negative_log_likelihood,
      initial_position=tf.constant(start),
      tolerance=1e-8)

results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]

print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))

Evaluation took: 0.056751 seconds
Converged: True
Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904)
Beta:
    Fitted [-0.18016612  0.53121578 -0.56420632 -0.5336374   2.00499675],
    Actual [-0.20470766  0.47894334 -0.51943872 -0.5557303   1.96578057]

批处理支持

BFGS 和 L-BFGS 都支持批处理计算，例如，从许多不同的起点优化单个函数；或从单个点优化多个参数函数。

单个函数，多个起点

Himmelblau 函数是一个标准的优化测试用例。该函数由下式给出

\[f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]

该函数在以下位置有四个最小值

(3, 2),
(-2.805118, 3.131312),
(-3.779310, -3.283186),
(3.584428, -1.848126).

所有这些最小值都可以从适当的起点到达。

# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.

@make_val_and_grad_fn
def himmelblau(coord):
  x, y = coord[..., 0], coord[..., 1]
  return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2

starts = tf.constant([[1, 1],
                      [-2, 2],
                      [-1, -1],
                      [1, -2]], dtype='float64')

# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.

@tf.function
def batch_multiple_starts():
  return tfp.optimizer.lbfgs_minimize(
      himmelblau, initial_position=starts,
      stopping_condition=tfp.optimizer.converged_all,
      tolerance=1e-8)

results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)

Evaluation took: 0.019095 seconds
Converged: [ True  True  True  True]
Minima: [[ 3.          2.        ]
 [-2.80511809  3.13131252]
 [-3.77931025 -3.28318599]
 [ 3.58442834 -1.84812653]]

多个函数

为了演示目的，在此示例中，我们同时优化大量随机生成的高维二次碗。

np.random.seed(12345)

dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))

@make_val_and_grad_fn
def quadratic(x):
  return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)

# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')

@tf.function
def batch_multiple_functions():
  return tfp.optimizer.lbfgs_minimize(
      quadratic, initial_position=start,
      stopping_condition=tfp.optimizer.converged_all,
      max_iterations=100,
      tolerance=1e-8)

results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))

Evaluation took: 1.994132 seconds
All converged: True
Largest error: 4.4131473142527966e-08