Fine-tuning against a modified noise, enables Stable Diffusion to generate very dark or light images easily.

Denoising Diffusion Probabilistic Models are a relatively new form of generative neural network model – models which produce samples from a high-dimensional probability distribution learned from data. Other approaches to the same class of problem include Generative Adversarial Networks, Normalizing Flows, and various forms of autoregressive models that sample dimensions one at a time or in blocks. One of the major applications of this kind of modelling is in image synthesis, and diffusion models have recently been very competitive with regards to image quality, particularly with regards to producing globally coherent composition across the image. 

Stable Diffusion is a pre-trained, publicly available model that can use this technique to produce some stunning results. However, it has an interesting limitation that seems to have mostly gone unnoticed. If you try to ask it to generate images that should be particularly dark or light, it almost always generates images whose average value is relatively close to 0.5 (with an entirely black image being 0, and an entirely white image being 1). For example:

Top left: A dark alleyway in a rainstorm (0.301); Top right: Monochrome line-art logo on a white background (0.709); Bottom left: A snow-covered ski slope on a sunny day (0.641); Bottom right: A town square lit only by torchlight (0.452)

For the most part these images are still plausible. But the sort of soft constraint to have average value around 0.5 can lead to things being washed out, areas of bright fog to counteract other dark areas, high-frequency textures (in the logos) rather than empty areas, grey backgrounds rather than white or black, etc. While some of these can be corrected or adjusted by hand with post-processing, there’s also a larger potential limit here in that the overall palette of a scene can correlate with other aspects of presentation and composition in a way that the diffusion model can’t explore as freely as might be possible with other approaches.

But why is it doing this? Am I just imagining the effect and these results are ‘correct’? Is it just a matter of the training data, something about the architecture, or something about diffusion models in general? (It was the last).

First, though, to make sure I wasn’t just imagining things, I tried fine-tuning Stable Diffusion against a single solid black image. Generally fine-tuning Stable Diffusion (SD) works pretty well – there’s a technique called Dreambooth to teach SD new, specific concepts like a particular person’s face or a particular cat, and a few dozen images and a few thousand gradient updates are enough for the model to learn what that particular subject looks like. Extend that to ten thousand steps and it can start to memorize specific images.

But when I fine-tuned against this single, solid black image, even after 3000 steps I was still getting results like this for “A solid black image”:

Using the prompt: “A solid black image”

So it seems like not only does SD not have the ability to produce overly dark or light images out of the box, but it also can’t even learn to do it.

Well, not without changing one thing about it.

To understand what’s going on, it helps to example what exactly a diffusion model is learning to reverse. The usual way diffusion models are formulated is as the inverse of a particular forward stochastic process – repeated addition of small amounts of ‘independently and identically distributed’ (iid) Gaussian noise. That is to say, each pixel in the latent space receives its own random sample at each step. The diffusion model learns to take, say, an image after some number of these steps have been performed, and to figure out the direction to go in to follow that trajectory back to the original image. Given this model that can ‘step backwards towards a real image’, you start with pure noise and reverse the noising process to get a novel image.

The issue turns out to be that you don’t ever completely erase the original image during the forward process, so in turn the reverse model starting from pure noise doesn’t exactly get back to the complete true distribution of images. Instead, those things which noise destroys last are in turn most weakly altered by the reverse process – those things are instead inherited from the latent noise sample that is used to start to process. 

It might not be obvious at first glance, but if you look at the forward process and how it disrupts an image, the longer wavelength features take longer for the noise to destroy:

That’s why for example using the same latent noise seed but different prompts tends to give images that are related to each-other at the level of overall composition, but not at the level of individual textures or small-scale patterns. The diffusion process doesn’t know how to change those long-wavelength features. And the longest wavelength feature is the average value of the image as a whole, which is also the feature that is least likely to vary between independent samples of the latent noise. 

This problem gets worse the higher the dimensionality of the target object is, because the standard deviation of a set of independent noise samples scales like 1/N. So if you’re generating a 4d vector this might not be much of a problem – you just need twice as many samples to get the lowest-frequency component as for the highest frequency component. But in Stable Diffusion at 512×512 resolution, you’re generating a 3 x 64^2 = 12288 dimensional object. So the longest wavelengths change about a factor of 100 slower than the shortest ones, meaning you’d need to be considering hundreds or thousands of steps of the process to capture that, when the default is around 50 (or for some sophisticated samplers, as low as 20).

It does seem like increasing the number of sampling steps does help SD make more extreme images, but we can do a bit better and make a drop-in solution.

The trick has to do with the structure of the noise that we teach a diffusion model to reverse. Because we’re using iid samples, we have this 1/N. But what if we use noise that looks like an iid sample per pixel added to a single iid sample that is the same over the entire image instead? 

In code terms, the current training loop uses noise that looks like: 

noise = torch.randn_like(latents)

But instead, I could use something like this:

noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1)

This would make it so that the model learns to change the zero-frequency of the component freely, because that component is now being randomized ~10 times faster than for the base distribution (the choice of 0.1 there worked well for me given my limited data and training time – if I made it too large it’d tend to dominate too much of the model’s existing behavior, but much smaller and I wouldn’t see an improvement).

Fine-tuning with noise like this for a thousand steps or so on just 40 hand-labeled images is enough to significantly change the behavior of Stable Diffusion, without making it get any worse at the things it could previously generate. Here are the results of the four prompts higher up in the article for comparison:

Top right: A dark alleyway in a rainstorm (0.032); Top left: Monochrome line-art logo on a white background (0.974); Bottom left: Snow-covered ski slope on a sunny day (0.858); Bottom right: A town square lit only by torchlight. (0.031)

A starry sky before and after offset noise.

Superheroes fighting plant monsters in a dark alley before and after offset noise.

There are a number of papers talking about changing the noise schedule of denoising diffusion models, as well as using different distributions than Gaussian for the noise, or even removing noise altogether and instead using other destructive operations like blurring or masking. However, most of the focus seems to be on accelerating the process of inference – being able to use fewer steps, basically. There doesn’t seem to be as much attention on how design decisions about the noise (or image-destroying operation) could constrain the types of images that can easily be synthesized. However, it’s quite relevant for the aesthetic and artistic uses of these models. 

For an individual artist who is digging a bit into customizing these models and doing their own fine-tuning, adjusting to use this offset noise for one project or another wouldn’t be so difficult. You could just use our checkpoint if you like, for that matter. But with fine-tuning on a small number of images like this, the results aren’t ever going to be quite as general or quite as good as large projects could achieve.

So I’d like to conclude this with a request to those involved in training these large models: please incorporate a little bit of offset noise like this into the training process the next time you do a big run. It should significantly increase the expressive range of the models, allowing much better results for things like logos, cut-out figures, naturally bright and dark scenes, scenes with strongly colored lighting, etc. It’s a very easy trick!

Read More