Interleaved Gibbs Diffusion for Constrained Generation

Interleaved Noising Process: Sequential noising of discrete tokens ($D_s$) and continuous vectors ($C_s$). Noising occurs one element at a time, keeping other elements unchanged.

Abstract

We introduce Interleaved Gibbs Diffusion (IGD), a novel generative modeling framework for mixed continuous–discrete data, focusing on constrained generation problems. Prior works on discrete and continuous–discrete diffusion models assume factorized denoising distribution for fast generation, which can hinder the modeling of strong dependencies between random variables encountered in constrained generation. IGD moves beyond this by interleaving continuous and discrete denoising algorithms via a discrete time Gibbs sampling type Markov chain. IGD provides flexibility in the choice of denoisers, allows conditional generation via state-space doubling and inference time scaling via the ReDeNoise method. Empirical evaluations on three challenging tasks—solving 3-SAT, generating molecule structures, and generating layouts—demonstrate state-of-the-art performance. Notably, IGD achieves a 7% improvement on 3-SAT out of the box and achieves state-of-the-art results in molecule generation without relying on equivariant diffusion or domain-specific architectures. We explore a wide range of modeling and interleaving strategies along with hyperparameters in each of these problems.

Method Overview

We describe the Interleaved Gibbs Diffusion (IGD) framework for sampling from a target distribution $\pi$ over $\mathcal{S}_L$ (with $\mathcal{S}_L = \mathcal{X}^{L_1} \times \prod_{i=L_1+1}^{L} \mathbb{R}^{d_i}$). In IGD, both the forward noising and reverse denoising processes operate one element at a time.

Forward Noising Process

Given a sample $s \sim \pi$, the forward process generates a trajectory $s^{(0)}, s^{(1)}, \dots, s^{(T)}$ with $s^{(0)} = s$. At each sequence time $t$, a position $i_t \in \{1,2,\dots,L\}$ is selected (via a round-robin interleaving pattern) for noising. The update rule depends on whether $s_{i_t}$ is discrete or continuous:

For discrete tokens ($i_t \leq L_1$): \[ s^{(t+1)}_j = \begin{cases} z_t, & \text{if } j=i_t \text{ and } z_t \neq \phi,\\[1ex] s^{(t)}_j, & \text{otherwise}, \end{cases} \] where $z_t \sim \Pi_t$ (a noise distribution over $\mathcal{X} \cup \{\phi\}$) and $\phi \notin \mathcal{X}$.

For continuous vectors ($L_1 < i_t \leq L$): \[ s^{(t+1)}_{i_t} = \sqrt{1-\tilde{\beta}_{m_{i_t}^t}}\, s^{(t)}_{i_t} + \sqrt{\tilde{\beta}_{m_{i_t}^t}}\, \epsilon^{(t)}, \] where $m_{i_t}^t$ is the number of visits to position $i_t$ up to time $t$, $[\tilde{\beta}_j]_{j=1}^{m}$ is a monotonically increasing noise schedule, and $\epsilon^{(t)} \sim \mathcal{N}(0, I)$.

Reverse Denoising Process

Starting from a noised sample $\hat{s}^{(T)} \sim P_T$, the reverse process reconstructs the original sample by sequentially denoising each element. For each $t$ (from $T$ down to $1$), we set $\hat{s}^{(t-1)}_{-i_t} = \hat{s}^{(t)}_{-i_t}$. Then, if $\hat{s}^{(t)}_{i_t}$ is discrete, a learned discrete denoiser $\texttt{DiscDen}(\hat{s}^{(t)}, i_t, t)$ is applied to sample $\hat{s}^{(t-1)}_{i_t}$; if continuous, a continuous denoiser $\texttt{ContDen}(\hat{s}^{(t)}, i_t, t)$ is used instead. Under ideal denoisers, the final output $\hat{s}^{(0)}$ satisfies $\hat{s}^{(0)} \sim \pi$.

ReDeNoise Algorithm

To further improve sample quality at inference, we propose the ReDeNoise algorithm. Given a sample $\hat{s}^{(0)}$ from the reverse process, we repeat the following two steps $N_R$ times:

Noise $\hat{s}^{(0)}$ for $T_R$ rounds to obtain $\hat{s}^{(T_R)}$.
Denoise $\hat{s}^{(T_R)}$ back to $\hat{s}^{(0)}$.

Here, $N_R$ and $T_R$ are hyperparameters that control the number of repetitions and the extent of additional noise.

Conditional Generation

For conditional generation, a binary mask indicates which elements are fixed during both forward and reverse processes. This state-space doubling strategy allows the model to generate a subset of coordinates conditioned on the fixed ones.

Training Algorithm

$$ \begin{array}{l} \textbf{Algorithm 2: Model Training} \\[1ex] \textbf{Input: } D,\; f_\theta,\; \text{FwdPrcs},\; \text{opt},\; T,\; \{i_t\}_{t=0}^{T-1},\; \{\Pi_t\},\; \{\beta_j\} \\[1ex] \textbf{Output: } \text{Trained } \theta \\[1ex] \textbf{For each training iteration do:} \\[1ex] \quad 1.\; s^{(0)} \sim D \\[1ex] \quad 2.\; \text{Sample } t \in \{0,1,\dots,T-1\} \\[1ex] \quad 3.\; \textbf{if } s^{(0)}_{i_t} \text{ is discrete then} \\[1ex] \quad\quad (s^{(t)}, z_t, s^{(t+1)}) = \text{FwdPrcs}(s^{(0)}, t, \{i_\tau\}_{\tau=0}^{t}, \{\Pi_\tau\}_{\tau=0}^{t}) \\[1ex] \quad\quad L = \text{BCE}\Bigl(f_\theta(s^{(t+1)}_{i_t}),\, z_t\Bigr) \\[1ex] \quad 4.\; \textbf{else} \\[1ex] \quad\quad (s^{(t)}, \epsilon) = \text{FwdPrcs}(s^{(0)}, t, \{i_\tau\}_{\tau=0}^{t}, \{\beta_j\}) \\[1ex] \quad\quad L = \bigl\| \epsilon - f_\theta(s^{(t)}_{i_t}) \bigr\|_2^2 \\[1ex] \quad 5.\; \theta \leftarrow \text{opt.update}(\theta, L) \end{array} $$

Sampling Algorithm

$$ \begin{array}{l} \textbf{Algorithm 1: Interleaved Gibbs Diffusion} \\[1ex] \textbf{Input: } \hat{s}^{(T)} \sim P_T,\quad \mathtt{DiscDen},\quad \mathtt{ContDen},\quad \{i_t\}_{t=1}^{T} \\[1ex] \textbf{Output: } \hat{s}^{(0)} \sim \pi \\[1ex] \textbf{for } t = T,\, T-1,\, \dots,\, 1 \quad \textbf{do} \\[1ex] \quad \hat{s}^{(t-1)}_{-i_t} = \hat{s}^{(t)}_{-i_t} \\[1ex] \quad \textbf{if } \hat{s}^{(t)}_{i_t} \text{ is discrete then} \\[1ex] \quad\quad \hat{s}^{(t-1)}_{i_t} = \mathtt{DiscDen}\bigl(\hat{s}^{(t)},\, i_t,\, t\bigr) \\[1ex] \quad \textbf{else} \\[1ex] \quad\quad \hat{s}^{(t-1)}_{i_t} = \mathtt{ContDen}\bigl(\hat{s}^{(t)},\, i_t,\, t\bigr) \\[1ex] \quad \textbf{end if} \\[1ex] \textbf{end for} \end{array} $$

Molecule Generation: Generated Samples

C (Carbon)

H (Hydrogen)

O (Oxygen)

N (Nitrogen)

Molecule Generation Results

Performance on the QM9 benchmark. Metrics are reported in percentages.

Layout Generation: Generated Samples

Layout Generation Results (RICO - Category Conditioned)

Results on the RICO dataset for category conditioned layout generation. Lower FID is better; higher mIoU is better.

3-SAT Boolean Satisfiability: Generated Samples

3-SAT Results (n = 5, 7, and 9)

Accuracy of separate models trained for n = 5, 7, and 9. IGD results are highlighted.

BibTeX

@article{anil2025interleaved,
        title={Interleaved Gibbs Diffusion for Constrained Generation},
        author={Anil, Gautham Govind and Yadav, Sachin and Nagaraj, Dheeraj and Shanmugam, Karthikeyan and Jain, Prateek},
        journal={arXiv preprint arXiv:2502.13450},
        year={2025}
      }
}