doc/source/examples/derivatives/structured_prediction.rst
In this
example\ :math:\newcommand{\reals}{\mathbf{R}}\ :math:\newcommand{\ones}{\mathbf{1}},
we fit a regression model to structured data, using an LLCP. The
training dataset :math:\mathcal D contains :math:N input-output
pairs :math:(x, y), where :math:x \in \reals^{n}_{++} is an input
and :math:y \in \reals^{m}_{++} is an outputs. The entries of each
output :math:y are sorted in ascending order, meaning
:math:y_1 \leq y_2 \leq \cdots \leq y_m.
Our regression model :math:\phi : \reals^{n}_{++} \to \reals^{m}_{++}
takes as input a vector :math:x \in \reals^{n}_{++}, and solves an
LLCP to produce a prediction :math:\hat y \in \reals^{m}_{++}. In
particular, the solution of the LLCP is model's prediction. The model is
of the form
.. math::
\begin{equation} \begin{array}{lll} \phi(x) = & \mbox{argmin} & \ones^T (z/y + y / z) \ & \mbox{subject to} & y_i \leq y_{i+1}, \quad i=1, \ldots, m-1 \ && z_i = c_i x_1^{A_{i1}}x_2^{A_{i2}}\cdots x_n^{A_{in}}, \quad i = 1, \ldots, m. \end{array}\label{e-model} \end{equation}
Here, the minimization is over :math:y \in \reals^{m}_{++} and an
auxiliary variable :math:z \in \reals^{m}_{++}, :math:\phi(x) is the
optimal value of :math:y, and the parameters are
:math:c \in \reals^{m}_{++} and :math:A \in \reals^{m \times n}. The
ratios in the objective are meant elementwise, as is the inequality
:math:y \leq z, and :math:\ones denotes the vector of all ones.
Given a vector :math:x, this model finds a sorted vector
:math:\hat y whose entries are close to monomial functions of
:math:x (which are the entries of :math:z), as measured by the
fractional error.
The training loss :math:\mathcal{L}(\phi) of the model on the training
set is the mean squared loss
.. math::
\mathcal{L}(\phi) = \frac{1}{N}\sum_{(x, y) \in \mathcal D} |y - \phi(x)|_2^2.
We emphasize that :math:\mathcal{L}(\phi) depends on :math:c and
:math:A. In this example, we fit the parameters :math:c and
:math:A in the LLCP to minimize the training loss
:math:\mathcal{L}(\phi).
Fitting. We fit the parameters by an iterative projected gradient
descent method on :math:\mathcal L(\phi). In each iteration, we first
compute predictions :math:\phi(x) for each input in the training set;
this requires solving :math:N LLCPs. Next, we evaluate the training
loss :math:\mathcal L(\phi). To update the parameters, we compute the
gradient :math:\nabla \mathcal L(\phi) of the training loss with
respect to the parameters :math:c and :math:A. This requires
differentiating through the solution map of the LLCP. We can compute
this gradient efficiently, using the backward method in CVXPY (or
CVXPY Layers). Finally, we subtract a small multiple of the gradient
from the parameters. Care must be taken to ensure that :math:c is
strictly positive; this can be done by clamping the entries of :math:c
at some small threshold slightly above zero. We run this method for a
fixed number of iterations.
This example is described in the paper Differentiating through Log-Log Convex Programs <http://web.stanford.edu/~boyd/papers/pdf/diff_llcvx.pdf>__.
Shane Barratt formulated the idea of using an optimization layer to regress on sorted vectors.
Requirements. This example requires PyTorch and CvxpyLayers >= v0.1.3.
.. code:: ipython3
from cvxpylayers.torch import CvxpyLayer
import cvxpy as cp
import matplotlib.pyplot as plt
import numpy as np
import torch
torch.set_default_tensor_type(torch.DoubleTensor)
%matplotlib inline
Data generation
.. code:: ipython3
n = 20
m = 10
# Number of training input-output pairs
N = 100
# Number of validation pairs
N_val = 50
.. code:: ipython3
torch.random.manual_seed(243)
np.random.seed(243)
normal = torch.distributions.multivariate_normal.MultivariateNormal(torch.zeros(n), torch.eye(n))
lognormal = lambda batch: torch.exp(normal.sample(torch.tensor([batch])))
A_true = torch.randn((m, n)) / 10
c_true = np.abs(torch.randn(m))
.. code:: ipython3
def generate_data(num_points, seed):
torch.random.manual_seed(seed)
np.random.seed(seed)
latent = lognormal(num_points)
noise = lognormal(num_points)
inputs = noise + latent
input_cp = cp.Parameter(pos=True, shape=(n,))
prediction = cp.multiply(c_true.numpy(), cp.gmatmul(A_true.numpy(), input_cp))
y = cp.Variable(pos=True, shape=(m,))
objective_fn = cp.sum(prediction / y + y/prediction)
constraints = []
for i in range(m-1):
constraints += [y[i] <= y[i+1]]
problem = cp.Problem(cp.Minimize(objective_fn), constraints)
outputs = []
for i in range(num_points):
input_cp.value = inputs[i, :].numpy()
problem.solve(cp.SCS, gp=True)
outputs.append(y.value)
return inputs, torch.stack([torch.tensor(t) for t in outputs])
.. code:: ipython3
train_inputs, train_outputs = generate_data(N, 243)
plt.plot(train_outputs[0, :].numpy())
.. parsed-literal::
[<matplotlib.lines.Line2D at 0x12b367cd0>]
.. image:: structured_prediction_files/structured_prediction_6_1.png
.. code:: ipython3
val_inputs, val_outputs = generate_data(N_val, 0)
plt.plot(val_outputs[0, :].numpy())
.. parsed-literal::
[<matplotlib.lines.Line2D at 0x12da7e410>]
.. image:: structured_prediction_files/structured_prediction_7_1.png
Monomial fit to each component
------------------------------
We will initialize the parameters in our LLCP model by fitting monomials
to the training data, without enforcing the monotonicity constraint.
.. code:: ipython3
log_c = cp.Variable(shape=(m,1))
theta = cp.Variable(shape=(n, m))
inputs_np = train_inputs.numpy()
log_outputs_np = np.log(train_outputs.numpy()).T
log_inputs_np = np.log(inputs_np).T
offsets = cp.hstack([log_c]*N)
.. code:: ipython3
cp_preds = theta.T @ log_inputs_np + offsets
objective_fn = (1/N) * cp.sum_squares(cp_preds - log_outputs_np)
lstq_problem = cp.Problem(cp.Minimize(objective_fn))
.. code:: ipython3
lstq_problem.is_dcp()
.. parsed-literal::
True
.. code:: ipython3
lstq_problem.solve(verbose=True)
.. parsed-literal::
-----------------------------------------------------------------
OSQP v0.6.0 - Operator Splitting QP Solver
(c) Bartolomeo Stellato, Goran Banjac
University of Oxford - Stanford University 2019
-----------------------------------------------------------------
problem: variables n = 1210, constraints m = 1000
nnz(P) + nnz(A) = 23000
settings: linear system solver = qdldl,
eps_abs = 1.0e-05, eps_rel = 1.0e-05,
eps_prim_inf = 1.0e-04, eps_dual_inf = 1.0e-04,
rho = 1.00e-01 (adaptive),
sigma = 1.00e-06, alpha = 1.60, max_iter = 10000
check_termination: on (interval 25),
scaling: on, scaled_termination: off
warm start: on, polish: on, time_limit: off
iter objective pri res dua res rho time
1 0.0000e+00 3.30e+00 1.22e+04 1.00e-01 3.06e-03s
50 1.0014e-02 1.72e-07 1.64e-07 1.75e-03 7.37e-03s
plsh 1.0014e-02 1.56e-15 1.17e-14 -------- 9.68e-03s
status: solved
solution polish: successful
number of iterations: 50
optimal objective: 0.0100
run time: 9.68e-03s
optimal rho estimate: 8.77e-05
.. parsed-literal::
0.010014212812318733
.. code:: ipython3
c = torch.exp(torch.tensor(log_c.value)).squeeze()
lstsq_val_preds = []
for i in range(N_val):
inp = val_inputs[i, :].numpy()
pred = cp.multiply(c,cp.gmatmul(theta.T.value, inp))
lstsq_val_preds.append(pred.value)
Fitting
-------
.. code:: ipython3
A_param = cp.Parameter(shape=(m, n))
c_param = cp.Parameter(pos=True, shape=(m,))
x_slack = cp.Variable(pos=True, shape=(n,))
x_param = cp.Parameter(pos=True, shape=(n,))
y = cp.Variable(pos=True, shape=(m,))
prediction = cp.multiply(c_param, cp.gmatmul(A_param, x_slack))
objective_fn = cp.sum(prediction / y + y / prediction)
constraints = [x_slack == x_param]
for i in range(m-1):
constraints += [y[i] <= y[i+1]]
problem = cp.Problem(cp.Minimize(objective_fn), constraints)
problem.is_dgp(dpp=True)
.. parsed-literal::
True
.. code:: ipython3
A_param.value = np.random.randn(m, n)
x_param.value = np.abs(np.random.randn(n))
c_param.value = np.abs(np.random.randn(m))
layer = CvxpyLayer(problem, parameters=[A_param, c_param, x_param], variables=[y], gp=True)
.. code:: ipython3
torch.random.manual_seed(1)
A_tch = torch.tensor(theta.T.value)
A_tch.requires_grad_(True)
c_tch = torch.tensor(np.squeeze(np.exp(log_c.value)))
c_tch.requires_grad_(True)
train_losses = []
val_losses = []
lam1 = torch.tensor(1e-1)
lam2 = torch.tensor(1e-1)
opt = torch.optim.SGD([A_tch, c_tch], lr=5e-2)
for epoch in range(10):
preds = layer(A_tch, c_tch, train_inputs, solver_args={'acceleration_lookback': 0})[0]
loss = (preds - train_outputs).pow(2).sum(axis=1).mean(axis=0)
with torch.no_grad():
val_preds = layer(A_tch, c_tch, val_inputs, solver_args={'acceleration_lookback': 0})[0]
val_loss = (val_preds - val_outputs).pow(2).sum(axis=1).mean(axis=0)
print('(epoch {0}) train / val ({1:.4f} / {2:.4f}) '.format(epoch, loss, val_loss))
train_losses.append(loss.item())
val_losses.append(val_loss.item())
opt.zero_grad()
loss.backward()
opt.step()
with torch.no_grad():
c_tch = torch.max(c_tch, torch.tensor(1e-8))
.. parsed-literal::
(epoch 0) train / val (0.0018 / 0.0014)
(epoch 1) train / val (0.0017 / 0.0014)
(epoch 2) train / val (0.0017 / 0.0014)
(epoch 3) train / val (0.0017 / 0.0014)
(epoch 4) train / val (0.0017 / 0.0014)
(epoch 5) train / val (0.0017 / 0.0014)
(epoch 6) train / val (0.0016 / 0.0014)
(epoch 7) train / val (0.0016 / 0.0014)
(epoch 8) train / val (0.0016 / 0.0014)
(epoch 9) train / val (0.0016 / 0.0014)
.. code:: ipython3
with torch.no_grad():
train_preds_tch = layer(A_tch, c_tch, train_inputs)[0]
train_preds = [t.detach().numpy() for t in train_preds_tch]
.. code:: ipython3
with torch.no_grad():
val_preds_tch = layer(A_tch, c_tch, val_inputs)[0]
val_preds = [t.detach().numpy() for t in val_preds_tch]
.. code:: ipython3
fig = plt.figure()
i = 0
plt.plot(val_preds[i], label='LLCP', color='teal')
plt.plot(lstsq_val_preds[i], label='least squares', linestyle='--', color='gray')
plt.plot(val_outputs[i], label='true', linestyle='-.', color='orange')
w, h = 8, 3.5
plt.xlabel(r'$i$')
plt.ylabel(r'$y_i$')
plt.legend()
plt.show()
.. image:: structured_prediction_files/structured_prediction_20_0.png