Restricted Boltzmann Machines

Energy-based models associate a scalar energy to each configuration of the variables of interest. Low energy is more desirable. The probability distribution based on an energy function can be modeled as follows

Energy-based model
Restricted Boltzmann Machine
Gibbs Sampling in RBMs
Contrastive Divergence
Implementation
References

Energy-based model

Energy-based models associate a scalar energy to each configuration of the variables of interest. Low energy is more desirable. The probability distribution based on an energy function can be modeled as follows

\[\Pr(x) = \frac{\exp (-E(x))}{Z}\,,\]

where $Z = \sum_{x} \exp( -E(x))$ denotes the normalization factor or partition function.

Restricted Boltzmann Machine

Restricted Boltzmann Machine [1] (RBM) has an efficient training algorithm. In order to increase the expressive power of the model, we do not fully observe the example $x$, we also want to introduce some non-observed variables. Consider an observed part $x$ and a hidden part $h$. We can then write:

\[\Pr(x) = \sum_h \frac{\exp (-E(x, h))}{Z} \,.\]

In RBM, the energy function is defined as

\[E(x, h) = -a^\top x - b^\top h - h^\top W x \,.\]

Due to the specific structure of RBMs, visible and hidden units are conditionally independent given one-another. Using this property, we can write the conditional probabilities

\begin{aligned} \Pr (h|x) & = \frac{\prod_i\exp(h_ib_i + h_iW_{i.}x)}{\sum_{\tilde{h}}\prod_{i}\exp(\tilde{h}_ib_i + \tilde{h}_iW_{i.}x)}\\ & = \frac{\prod_i\exp(h_ib_i + h_iW_{i.}x)}{\sum_{\tilde{h}_1}\sum_{\tilde{h}_2}...\sum_{\tilde{h}_k}\prod_{i}\exp(\tilde{h}_ib_i + \tilde{h}_iW_{i.}x)}\\ & = \frac{\prod_i\exp(h_ib_i + h_iW_{i.}x)}{\sum_{\tilde{h}_1}\exp(\tilde{h}_1b_1 + \tilde{h}_1W_{1.}x)\sum_{\tilde{h}_{2}}\exp(\tilde{h}_2b_2 + \tilde{h}_2W_{2.}x)...\sum_{\tilde{h}_k}\exp(\tilde{h}_kb_k + \tilde{h}_kW_{k.}x)}\\ & = \frac{\prod_i\exp(h_ib_i + h_iW_{i.}x)}{\prod_i\sum_{\tilde{h}_i}\exp(\tilde{h}_ib_i + \tilde{h}_iW_{i.}x)}\\ & = \prod_i\frac{\exp(h_ib_i + h_iW_{i.}x)}{\sum_{\tilde{h}_i}\exp(\tilde{h}_ib_i + \tilde{h}_iW_{i.}x)}\\ & = \prod_i \Pr(h_i | x)\,, \end{aligned}

where the probability of a specific hidden unit given $x$ is computed as

\[\Pr(h_i = 1 | x) = \frac{\exp(b_i + W_{i.}x)}{\exp(b_i + W_{i.}x) + 1}=\sigma (b_i+W_{i.}x)\,.\]

Since $x$ and $h$ play the same role in the energy function, a similar derivation can be obtained

\[\Pr(x_i = 1 | h) = \frac{\exp(a_i + W_{.i}h)}{\exp(a_i + W_{.i}h) + 1}=\sigma (a_i+W_{.i}h)\,,\]

where $\sigma$ denotes the sigmoid function.

To make RBM as an energy-based model, we define the free energy function as follows

$$ \begin{aligned} F(x) & = -\log \sum_h \exp (-E(x, h)) \\ & = -a^\top x - \sum_i \log (1 + \exp(W_{i.}x + b_i))\,. \end{aligned} $$

Applying the maximum log-likelihood, we have an objective function as

\[\max \sum_i \log \frac{\exp (-F(x_i))}{\sum_{\tilde{x}} \exp(-F(\tilde{x}))}\,,\]

The derivative of negative log-likelihood is computed as

\[\frac{\partial -\log p(x)}{\partial \theta} = \frac{\partial F(x)} {\partial \theta} - \sum_{\tilde{x}} p(\tilde{x}) \frac{\partial F(\tilde{x})}{\partial \theta}\]

Notice that the above gradient contains two terms, which are referred to as the positive and negative phase. The first term increases the probability of training data (by reducing the corresponding free energy), while the second term decreases the probability of samples generated by the model. The second term is intractable because it requires an expectation over all possible configurations of the input $x$ under the distribution formed by the model.

To overcome this issue, we estimate the expectation using a fixed number of model samples. In particular, a set of samples $\mathcal{N}$ is sampled from the model. The derivative of negative log like-likehood is then computed as

\[\frac{\partial -\log p(x)}{\partial \theta} \approx \frac{\partial F(x)} {\partial \theta} - \frac{1}{|\mathcal{N}|}\sum_{\tilde{x} \in \mathcal{N}} \frac{\partial F(\tilde{x})}{\partial \theta}\,.\]

Iideally, one wishes $\tilde{x}$ of $\mathcal{N}$ to be sampled according to the model distribution (similarlly as doing a Monte-Carlo). Now the question is how to samples these examples.

Gibbs Sampling in RBMs

Sampling from an RBM is usefull because of the following reasons. First, it can estimate the derivative of the log-likelihood gradient. Second, it gives inspection of examples generated from the model (i.e., what the model has captured or not captured about the data distribution). The set of variables $(x, h)$ can be sampled in two steps using the Gibbs chain. First we sample $h$ given $x$, and then sample $x$ given $h$. Note that this computation can be performed in an efficient manner since $x_i$ and $h_i$ are conditionally independent.

$$ \begin{aligned} x^{(1)} &\sim P(x) \quad (\text{i.e., from training data})\\ h^{(1)} &\sim P(h|x^{(1)} ) \\ x^{(2)} &\sim P(x|h^{(1)}) \\ h^{(3)} &\sim P(h|x^{(2)} ) \\ \vdots& \\ x^{(k)} &\sim P(x|h^{(k-1)} )\,. \\ \end{aligned} $$

The Gibbs chain starts from a training example because the model becomes better at capturing the structure in the training data. The training process is still computationally expensive.

Contrastive Divergence

Contrastive Divergence (CD) [2,3] uses a few tricks to speed up the sampling process:

Replace the expectation by point estimate.
We obtain this point using the Gibbs sampling, starting from a training example (i.e., from a distribution that is expected to be close to the model distribution, so that the chain will be already close to having converged to its final distribution).
CD does not wait for the chain to converge. Samples are obtained after only k-steps of Gibbs sampling. In pratice, k=1 has been shown to work surprisingly well.

Implementation

All the source codes can be found here.

Real images taken from the MNIST data set

def visible_to_hidden(self, v):
    r"""Conditional sampling a hidden variable given a visible variable.
    Args:
        v (Tensor): The visible variable.
    Returns:
        Tensor: The hidden variable.
    """
    p = torch.sigmoid(F.linear(v, self.W, self.h))
    return p.bernoulli()

def hidden_to_visible(self, h):
    r"""Conditional sampling a visible variable given a hidden variable.
    Args:
        h (Tendor): The hidden variable.
    Returns:
        Tensor: The visible variable.
    """
    p = torch.sigmoid(F.linear(h, self.W.t(), self.v))
    return p.bernoulli()

def free_energy(self, v):
    r"""Free energy function.
    .. math::
        \begin{align}
            F(x) &= -\log \sum_h \exp (-E(x, h)) \\
            &= -a^\top x - \sum_j \log (1 + \exp(W^{\top}_jx + b_j))\,.
        \end{align}
    Args:
        v (Tensor): The visible variable.
    Returns:
        FloatTensor: The free energy value.
    """
    v_term = torch.matmul(v, self.v.t())
    w_x_h = F.linear(v, self.W, self.h)
    h_term = torch.sum(F.softplus(w_x_h), dim=1)
    return torch.mean(-h_term - v_term)

def forward(self, v):
    r"""Compute the real and generated examples.
    Args:
        v (Tensor): The visible variable.
    Returns:
        (Tensor, Tensor): The real and generagted variables.
    """
    h = self.visible_to_hidden(v)
    for _ in range(self.k):
        v_gibb = self.hidden_to_visible(h)
        h = self.visible_to_hidden(v_gibb)
    return v, v_gibb

References

[1] Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In “Parallel distributed processing: Explorations in the microstructure ofcognition.” Vol. 1: Foundations. Cambridge, MA: MIT Press. [2] M. A. Carreira-Perpiñan and G. E. Hinton, “On Contrastive divergence learning,” in Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics (AISTATS’05). [3] G. E. Hinton, “Training products of experts by minimizing contrastive divergence.” Neural computation 14.8 (2002): 1771-1800.