在 TensorFlow.org 上查看 | 在 Google Colab 中运行 | 在 GitHub 上查看源代码 | 下载笔记本 |
摘要
在本 Colab 中,我们将演示如何使用 TensorFlow 概率中实现的各种优化器。
依赖项和先决条件
导入
BFGS 和 L-BFGS 优化器
拟牛顿法是一类流行的一阶优化算法。这些方法使用精确 Hessian 的正定近似来找到搜索方向。
Broyden-Fletcher-Goldfarb-Shanno 算法 (BFGS) 是这个一般思想的具体实现。它适用于中等规模的问题,并且是梯度在所有地方都连续时的首选方法(例如,具有 \(L_2\) 惩罚的线性回归)。
L-BFGS 是 BFGS 的有限内存版本,适用于解决 Hessian 矩阵无法以合理成本计算或不是稀疏的较大问题。它们不存储 Hessian 矩阵的完全密集 \(n \times n\) 近似值,而是只保存几个长度为 \(n\) 的向量,这些向量隐式地表示这些近似值。
辅助函数
在简单二次函数上使用 L-BFGS
# Fix numpy seed for reproducibility
np.random.seed(12345)
# The objective must be supplied as a function that takes a single
# (Tensor) argument and returns a tuple. The first component of the
# tuple is the value of the objective at the supplied point and the
# second value is the gradient at the supplied point. The value must
# be a scalar and the gradient must have the same shape as the
# supplied argument.
# The `make_val_and_grad_fn` decorator helps transforming a function
# returning the objective value into one that returns both the gradient
# and the value. It also works for both eager and graph mode.
dim = 10
minimum = np.ones([dim])
scales = np.exp(np.random.randn(dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(scales * (x - minimum) ** 2, axis=-1)
# The minimization routine also requires you to supply an initial
# starting point for the search. For this example we choose a random
# starting point.
start = np.random.randn(dim)
# Finally an optional argument called tolerance let's you choose the
# stopping point of the search. The tolerance specifies the maximum
# (supremum) norm of the gradient vector at which the algorithm terminates.
# If you don't have a specific need for higher or lower accuracy, leaving
# this parameter unspecified (and hence using the default value of 1e-8)
# should be good enough.
tolerance = 1e-10
@tf.function
def quadratic_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_lbfgs)
# The optimization results contain multiple pieces of information. The most
# important fields are: 'converged' and 'position'.
# Converged is a boolean scalar tensor. As the name implies, it indicates
# whether the norm of the gradient at the final point was within tolerance.
# Position is the location of the minimum found. It is important to check
# that converged is True before using the value of the position.
print('L-BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.014586 seconds L-BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
使用 BFGS 解决相同问题
@tf.function
def quadratic_with_bfgs():
return tfp.optimizer.bfgs_minimize(
quadratic,
initial_position=tf.constant(start),
tolerance=tolerance)
results = run(quadratic_with_bfgs)
print('BFGS Results')
print('Converged:', results.converged)
print('Location of the minimum:', results.position)
print('Number of iterations:', results.num_iterations)
Evaluation took: 0.010468 seconds BFGS Results Converged: True Location of the minimum: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Number of iterations: 10
具有 L1 惩罚的线性回归:前列腺癌数据
来自书籍的示例:统计学习的要素:数据挖掘、推理和预测,作者为 Trevor Hastie、Robert Tibshirani 和 Jerome Friedman。
请注意,这是一个具有 L1 惩罚的优化问题。
获取数据集
def cache_or_download_file(cache_dir, url_base, filename):
"""Read a cached file or download it."""
filepath = os.path.join(cache_dir, filename)
if tf.io.gfile.exists(filepath):
return filepath
if not tf.io.gfile.exists(cache_dir):
tf.io.gfile.makedirs(cache_dir)
url = url_base + filename
print("Downloading {url} to {filepath}.".format(url=url, filepath=filepath))
urllib.request.urlretrieve(url, filepath)
return filepath
def get_prostate_dataset(cache_dir=CACHE_DIR):
"""Download the prostate dataset and read as Pandas dataframe."""
url_base = 'http://web.stanford.edu/~hastie/ElemStatLearn/datasets/'
return pd.read_csv(
cache_or_download_file(cache_dir, url_base, 'prostate.data'),
delim_whitespace=True, index_col=0)
prostate_df = get_prostate_dataset()
Downloading http://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data to /tmp/datasets/prostate.data.
问题定义
np.random.seed(12345)
feature_names = ['lcavol', 'lweight', 'age', 'lbph', 'svi', 'lcp',
'gleason', 'pgg45']
# Normalize features
scalar = preprocessing.StandardScaler()
prostate_df[feature_names] = pd.DataFrame(
scalar.fit_transform(
prostate_df[feature_names].astype('float64')))
# select training set
prostate_df_train = prostate_df[prostate_df.train == 'T']
# Select features and labels
features = prostate_df_train[feature_names]
labels = prostate_df_train[['lpsa']]
# Create tensors
feat = tf.constant(features.values, dtype=tf.float64)
lab = tf.constant(labels.values, dtype=tf.float64)
dtype = feat.dtype
regularization = 0 # regularization parameter
dim = 8 # number of features
# We pick a random starting point for the search
start = np.random.randn(dim + 1)
def regression_loss(params):
"""Compute loss for linear regression model with L1 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The mean square error loss including L1 penalty.
"""
params = tf.squeeze(params)
intercept, beta = params[0], params[1:]
pred = tf.matmul(feat, tf.expand_dims(beta, axis=-1)) + intercept
mse_loss = tf.reduce_sum(
tf.cast(
tf_keras.losses.mean_squared_error(y_true=lab, y_pred=pred), tf.float64))
l1_penalty = regularization * tf.reduce_sum(tf.abs(beta))
total_loss = mse_loss + l1_penalty
return total_loss
使用 L-BFGS 求解
使用 L-BFGS 拟合。即使 L1 惩罚引入了导数不连续性,但在实践中,L-BFGS 仍然效果很好。
@tf.function
def l1_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
make_val_and_grad_fn(regression_loss),
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l1_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('L-BFGS Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.017987 seconds L-BFGS Results Converged: True Intercept: Fitted (2.3879985744556484) Beta: Fitted [ 0.68626215 0.28193532 -0.17030254 0.10799274 0.33634988 -0.24888523 0.11992237 0.08689026]
使用 Nelder Mead 求解
Nelder Mead 方法 是最流行的无导数最小化方法之一。此优化器不使用梯度信息,也不对目标函数的可微性做任何假设;因此,它适用于非光滑目标函数,例如具有 L1 惩罚的优化问题。
对于 \(n\) 维的优化问题,它维护一组 \(n+1\) 个候选解,这些解跨越一个非退化的单纯形。它根据一组移动(反射、扩展、收缩和收缩)连续修改单纯形,使用每个顶点处的函数值。
# Nelder mead expects an initial_vertex of shape [n + 1, 1].
initial_vertex = tf.expand_dims(tf.constant(start, dtype=dtype), axis=-1)
@tf.function
def l1_regression_with_nelder_mead():
return tfp.optimizer.nelder_mead_minimize(
regression_loss,
initial_vertex=initial_vertex,
func_tolerance=1e-10,
position_tolerance=1e-10)
results = run(l1_regression_with_nelder_mead)
minimum = results.position.reshape([-1])
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Nelder Mead Results')
print('Converged:', results.converged)
print('Intercept: Fitted ({})'.format(fitted_intercept))
print('Beta: Fitted {}'.format(fitted_beta))
Evaluation took: 0.325643 seconds Nelder Mead Results Converged: True Intercept: Fitted (2.387998456121595) Beta: Fitted [ 0.68626266 0.28193456 -0.17030291 0.10799375 0.33635132 -0.24888703 0.11992244 0.08689023]
具有 L2 惩罚的逻辑回归
在此示例中,我们为分类创建一个合成数据集,并使用 L-BFGS 优化器拟合参数。
np.random.seed(12345)
dim = 5 # The number of features
n_obs = 10000 # The number of observations
betas = np.random.randn(dim) # The true beta
intercept = np.random.randn() # The true intercept
features = np.random.randn(n_obs, dim) # The feature matrix
probs = sp.special.expit(
np.matmul(features, np.expand_dims(betas, -1)) + intercept)
labels = sp.stats.bernoulli.rvs(probs) # The true labels
regularization = 0.8
feat = tf.constant(features)
lab = tf.constant(labels, dtype=feat.dtype)
@make_val_and_grad_fn
def negative_log_likelihood(params):
"""Negative log likelihood for logistic model with L2 penalty
Args:
params: A real tensor of shape [dim + 1]. The zeroth component
is the intercept term and the rest of the components are the
beta coefficients.
Returns:
The negative log likelihood plus the penalty term.
"""
intercept, beta = params[0], params[1:]
logit = tf.matmul(feat, tf.expand_dims(beta, -1)) + intercept
log_likelihood = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(
labels=lab, logits=logit))
l2_penalty = regularization * tf.reduce_sum(beta ** 2)
total_loss = log_likelihood + l2_penalty
return total_loss
start = np.random.randn(dim + 1)
@tf.function
def l2_regression_with_lbfgs():
return tfp.optimizer.lbfgs_minimize(
negative_log_likelihood,
initial_position=tf.constant(start),
tolerance=1e-8)
results = run(l2_regression_with_lbfgs)
minimum = results.position
fitted_intercept = minimum[0]
fitted_beta = minimum[1:]
print('Converged:', results.converged)
print('Intercept: Fitted ({}), Actual ({})'.format(fitted_intercept, intercept))
print('Beta:\n\tFitted {},\n\tActual {}'.format(fitted_beta, betas))
Evaluation took: 0.056751 seconds Converged: True Intercept: Fitted (1.4111415084244365), Actual (1.3934058329729904) Beta: Fitted [-0.18016612 0.53121578 -0.56420632 -0.5336374 2.00499675], Actual [-0.20470766 0.47894334 -0.51943872 -0.5557303 1.96578057]
批处理支持
BFGS 和 L-BFGS 都支持批处理计算,例如,从许多不同的起点优化单个函数;或从单个点优化多个参数函数。
单个函数,多个起点
Himmelblau 函数是一个标准的优化测试用例。该函数由下式给出
\[f(x, y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2\]
该函数在以下位置有四个最小值
- (3, 2),
- (-2.805118, 3.131312),
- (-3.779310, -3.283186),
- (3.584428, -1.848126).
所有这些最小值都可以从适当的起点到达。
# The function to minimize must take as input a tensor of shape [..., n]. In
# this n=2 is the size of the domain of the input and [...] are batching
# dimensions. The return value must be of shape [...], i.e. a batch of scalars
# with the objective value of the function evaluated at each input point.
@make_val_and_grad_fn
def himmelblau(coord):
x, y = coord[..., 0], coord[..., 1]
return (x * x + y - 11) ** 2 + (x + y * y - 7) ** 2
starts = tf.constant([[1, 1],
[-2, 2],
[-1, -1],
[1, -2]], dtype='float64')
# The stopping_condition allows to further specify when should the search stop.
# The default, tfp.optimizer.converged_all, will proceed until all points have
# either converged or failed. There is also a tfp.optimizer.converged_any to
# stop as soon as the first point converges, or all have failed.
@tf.function
def batch_multiple_starts():
return tfp.optimizer.lbfgs_minimize(
himmelblau, initial_position=starts,
stopping_condition=tfp.optimizer.converged_all,
tolerance=1e-8)
results = run(batch_multiple_starts)
print('Converged:', results.converged)
print('Minima:', results.position)
Evaluation took: 0.019095 seconds Converged: [ True True True True] Minima: [[ 3. 2. ] [-2.80511809 3.13131252] [-3.77931025 -3.28318599] [ 3.58442834 -1.84812653]]
多个函数
为了演示目的,在此示例中,我们同时优化大量随机生成的高维二次碗。
np.random.seed(12345)
dim = 100
batches = 500
minimum = np.random.randn(batches, dim)
scales = np.exp(np.random.randn(batches, dim))
@make_val_and_grad_fn
def quadratic(x):
return tf.reduce_sum(input_tensor=scales * (x - minimum)**2, axis=-1)
# Make all starting points (1, 1, ..., 1). Note not all starting points need
# to be the same.
start = tf.ones((batches, dim), dtype='float64')
@tf.function
def batch_multiple_functions():
return tfp.optimizer.lbfgs_minimize(
quadratic, initial_position=start,
stopping_condition=tfp.optimizer.converged_all,
max_iterations=100,
tolerance=1e-8)
results = run(batch_multiple_functions)
print('All converged:', np.all(results.converged))
print('Largest error:', np.max(results.position - minimum))
Evaluation took: 1.994132 seconds All converged: True Largest error: 4.4131473142527966e-08