[P] R2IR & R2ID: Resolution Invariant Image Resampler and Diffuser – Trained on 1:1 32×32 images, generalized to arbitrary aspect ratio and resolution, diffuses 4MP images at 4 steps per second.
|
This is a continuation of my ongoing project. The previous posts can be found here and here; formerly known as S2ID and SIID before that. Since then, a lot has changed, and R2IR and R2ID work very differently. You can read into the previous stages, but it’s not necessary. The GitHub repository is here for those that want to see the code. PrefaceOver the past couple of months, I’ve been somewhat disappointed by the pitfalls in classic diffusion models. Subsequently, I’ve been working on my own architecture, aptly named S2ID (Scale Invariant Image Diffuser), and now, aptly and sensibly renamed to R2ID: Resolution Invariant Image Diffuser. R2ID aims to avoid these pitfalls. Namely:
The core concept of the model has gone unchanged: each pixel is a distinct point in the image, whose coordinate and color we know. This pixel is effectively a token, and we can attend to other tokens (pixels) in the image to figure out composition. But unlike LLM tokens, the tokens here are fundamentally a bit different, and that is that they can be infinitely subdivided. A 1MP image upscaled by 2x to 4MP doesn’t contain 4x as much information. Rather, the information is 4x as accurate. Subsequently, a relative, not absolute, coordinate system is used (explained later). R2ID has experienced massive changes, namely solving the biggest drawback to it in the previous stage of iteration, which was speed. Now, R2IR and R2ID are fast enough to actually be viable (and I’d assume competitive) at big resolutions. Before, it used attention over the entire image, which was super slow. The previous post got a lot of suggestions, but one particularly stuck out to me by u/MoridinB who suggested to somehow move the resolution invariance to the autoencoder. So after a break and a lot of pondering, I figured that cross attention with my coordinate system (explained later) could actually work as this “autoencoder” of sorts. Subsequently it was made and named R2IR: Resolution Invariant Image Resampler. While it “kind of” performs the role of an autoencoder by decreasing the height and width, it fundamentally isn’t (explained later). Thus, a pair of models: R2ID for diffusion, and R2IR to make images smaller to make R2ID faster. So much so, that compared to the previous time of 3.5h for training, both R2IR and R2ID are now trained in 2 hours total, thus about 30-60% faster, with memory consumption about 3x less, in spite of having over double the total parameter count. But it gets better. Both R2IR and R2ID have been trained at 32×32 images that have been turned into a 4×4 latent: to sample into, diffuse in, and sample out of those 4×4 latents. Yet in spite of this, both models have proven to:
Even though neither model ever saw any augmented image. This means that you can train on one resolution and aspect ratio, and the model will be pre-configured to be good enough for other resolutions and aspect ratios from the get-go, even if it’s wildly different. I have also come up with an explanation as to why it’s able to do that, and it’s due to the dual coordinate system (explained later). In this post I will:
Model ShowcaseLet us begin with the model showcase. As before, it’s important to note that the model was trained exclusively on 32×32 MNIST images, tensor of size [1, 32, 32]. These images, passed through R2IR, become [64, 4, 4], thus a 4×4 latent. So all subsequent results are effectively testing how well R2IR and R2ID can generalize. I used different resolution and aspect ratio latents, as well as various resolution and aspect ratio images. It’s important to note that with the way R2IR works, the latent and image sizes are decoupled: you can diffuse on one resolution, but resample (thus the name) into a different one. Resampling is not equivalent to a simple upscale, but it’s a smart interpolation of sorts. All will be explained later. Let’s start with 4×4 latents, 32×32 images. The thing that the model was trained on. Training for both models was aggressive, batch size of 100, ema decay of 0.999, linear warmup, cosine decayed scheduler for AdamW optimizer. Learning rate peaks at 1e-3 by the 600th step (end of first epoch) and decays down to 1e-5 over 40 epochs. Thus, a total of 24,000 optimizer steps were made. 4×4 latent, resampled into 32×32 images Strangely enough, the results are… bad. This is because a 4×4 latent is way too small to diffuse in. So let’s bump it up to an 8×8 latent. 8×8 latent, resampled into 32×32 images Much better. But hold up, this latent resolution wasn’t trained. As in, at all. Neither R2ID that diffused in the latent space, nor R2IR that was trained to make these latents in the first place, ever saw a 8×8 latent. Only 4×4 latents. What does this mean? This means that you can train on one resolution, and not worry about inference in another resolution. Intuition suggests that larger latents result in better quality, because just like stated earlier, more pixels means more accurate information. How about we stress test R2IR, the resampler. Lets’ still diffuse on 8×8 latents, but this time, sample into a different resolution. Let’s do 10×10 pixels for the extreme. 8×8 latent, resampled into 10×10 images It still works. If you compare the images, you’d see that the images are identical in structure, and that’s because they come from the same latent. They’re just pixelated, which is expected when you only have 10 pixels to work with. Let’s look at a 16×16 resample. 8×8 latent, resampled into 16×16 images As expected, it’s better yet. Same underlying images as before, just pixelated differently. So R2IR is obviously able to resample a latent into a resolution lower than trained, and it works as expected. But what about higher? Let’s resample into 64×64, to see if we can use higher resolutions, but for the same latent. 8×8 latent, resampled into 64×64 images Yet again, just like before, it works. No surprise here. The way R2IR works (explained later), this is not equivalent to a simple upscale (re-sample). From what you’ve seen now, it may seem like R2IR just upscales some fundamental latent image into different resolutions, but that’s not what’s happening. For each pixel in the output image, R2IR has selectively chosen what parts of the latent it’s attending. This is an adaptive, dynamic process. In fact, this entire time, R2IR was already working overtime: it was never trained to decode 8×8 latents, only trained on 4×4, and it’s shown that it can resample an 8×8 latent into resolutions that it was never trained on either, as R2IR was only ever trained to re-sample back into 32×32. Let’s really stress test it. Diffuse on an 8×8 latent, but re-sample into a different aspect ratio. Shouldn’t really work, right? 8×8 latent, resampled into 27×18 images (3:2 aspect ratio) Nope, it still works. It’s important to note, that with the way the dual coordinate system works (explained later), most of the coordinates that R2IR sees, have not been in the training data. And this isn’t a kind of interpolation between known coordinates, no, the two coordinate systems are actively sending conflicting signals. Yet it works. Now we’ve already seen that R2ID can diffuse at latents on sizes it wasn’t trained on, but Let’s just make sure that it actually works. Let’s diffuse on a non-square latent, like 4×10, but then resample it back to a square image and see if we have any deformities. After all, the 4×4 latent could barely make digits, and now we’re adding a bunch of coordinates to the sides, so we’re not really solving the bottleneck all that well here, and then we’re asking to re-sample back into a square from a non-square latent. 4×10 latent resampled into 32×32 images But no. Yet again, it works. We see residual deformities, because we’ve still only 4 in height. Yet that extra width has been proven useful enough to _still_ solve some deformities. And the resultant images are legible. Okay, let’s really stress test it. Let’s diffuse on a 4×10 latent which is short but wide, but then resample it into a skinny and tall image, like a 16:9 aspect ratio. This is silly and pointless, but still. 4×10 latent resampled into 32×18 images And yet, it still works. We see deformities, but the images are still surprisingly cleaner than that original 4×4. Let’s also diffuse on a 10×4 latent that’s closer to the 16:9 ratio to see if having aspect ratios not conflict helps. 10×4 latent resampled into 32×18 images Surprisingly, this doesn’t seem to have much, if any of an effect. Which seems that one or both of the models don’t actually care about how much you stretch or squeeze the image. And as said before, the way that the dual coordinate system works, both R2IR and R2ID see conflicting coordinates, yet it still works. For completion, here is the t-scrape loss. It’s annoying to measure all permutations, so this is the t-scrape loss for an 8×8 latent as they’ve shown to be good quality. This graph shows the MSE loss between the predicted epsilon noise, and the actual epsilon noise (gaussian noise, mean of 0, stdev of 1) used for that particular timestep (alpha bar), a value between 0 and 1 that represents the SNR of the image. T-scrape loss, absurdly good compared to the previous state Compared to the previous post, this is a _lot_ smoother, and completely mogs the old t-scrape losses across the board, literally 5-10x better pretty much everywhere. Now, let’s take a look at the actual architecture itself. Dual Coordinate Positioning SystemIn the previous post, I didn’t really explain this part well, but this is the one thing that makes everything even work in the first place, for R2IR and R2ID. Thus it’s integral to understand. In short, it’s a system that gives two coordinates to each pixel: where it is with respect to the image’s edges (relative) and where it _actually_ is if you drew it on a screen (absolute (but not actually absolute, it’s still relative)). For the first system, it’s simple: make the edges +-0.5, and see how far along the pixel is. For the second system, we take the image and whatever aspect ratio it is, and inscribe and center it inside a square. Then, these +-0.5 values are given to the square, not the image’s own edges. We then get the coordinate by seeing how far along the square the pixel is. Thus, we have 2 values for X and 2 values for Y, one “relative” and the other “absolute”. We need the first system so that the model knows about image bounds, and we need the second system so that the model doesn’t fix composition to the image edges. Use the first system without the second, and the model will stretch and squeeze the image if you change the inference aspect ratio. Use the second system without the first and the model will crop the image if you change the inference aspect ratio. We next pass these 4 values through a fourier series through powers of 2. This is so that the models can distinguish pixels that are near and pixels that are far. For classic RoPE in LLMs, where we have more and more atomic tokens, we need to distinguish further and further away. But here, we’ve a relative system, so we need ever-increasing frequencies instead, to distinguish adjacent pixels the higher and higher resolution we go. In _this_ example, I used 10 positive frequencies and 6 negative frequencies, so 16 total, x2 for X/Y, x2 for relative/absolute, x2 for sine/cosine, hence a total of 128 positioning channels. The keen viewer may have sensed something off with the high frequencies, as they should: 10 frequencies to the power of 2, that’s way too many. 2^10=1024, which means that the model needs 1024 pixels in order to have the final frequency not look like noise, how is the model not just memorizing the values and instead generalizes? This is because coordinate jitter is used, _before_ the fourier series. For whatever resolution image that R2IR or R2ID uses, if the model is training, to the raw coordinate’s X/Y value, we add gaussian noise with stdev of half a pixel’s width. This means that during training, the pixels that the models look at aren’t in a rigid grid, but are instead like random samples from a continuous field, and thus when the model works with a higher resolution, it’s already seen those coordinates before, and it already knows what color is meant to be there: it’s a mix of if the two adjacent pixels were gaussian fields. To those aware, this sounds awfully similar to gaussian splats, because it is in a sense. In the future, I plan to make RIGSIG: Resolution Invariant Gaussian Splat Image Generator; a model that will directly work on gaussian splats rather than indirectly like here. Now _why_ does this system work? Why is it able to generalize to resolutions, but more interestingly so, aspect ratios? Aside from jittering doing some heavy lifting around the edge pixels (thus making them seem like if they’re further out than they actually are, thus as if the image was different), the main reason is that the center coordinates don’t all that drastically change. When you change the aspect ratio, the pixels that change most are around the edges, not the center, and that’s nice considering that it’s pretty much never that your subject is just cropped for some reason. The subjects are centered, the edges change. Change the aspect ratio, and the middle stays largely the same while the edges change more. 128 channels may sound like a lot, but it really isn’t. Especially considering the parameter count. Let’s take a look at R2IR for a moment. In the current configuration, it has about 3.3M parameters, which can actually be cut down by about 4x (explained later). It expands the color channels from 1 to 64, because I assumed an 8x height and width reduction. For true RGB images that are big, we’d want 16x reduction in height and width. We’d hence get 768 channels instead. As for the positional frequencies, we can go nuts: 16 positive and 16 negative. These negative frequencies, they’re frankly largely useless: ever longer wavelengths that quickly become indistinguishable from a constant considering our relative nature of coordinates (although it is interesting if they can be used as an absolute system), so we can really re-distribute them into the positive frequencies into something like 22 positive and 10 negative (even then, it’s overkill). Just what size image do we need to use the final frequency, so that it’s indistinguishable from noise? What is the resolution limit of the model? 2^22=4194304. We would need 4,194,304 _latent_ pixels to just _start_ using the final frequency. With the assumed 16x compression via R2IR, this would become over 64 million pixels needed along one dimension. And we only need 256 channels for this. 768 color channels and 256 positioning channels means that the model never goes beyond 1024 channels for each token, which by modern standards inflated by LLMs is laughably tiny. Now that I say it, I’m willing to bet that R2ID and the coordinate system may be used for more than images, but say audio instead, or something of the sort, and then these absurd lengths become very practical. The coordinate jitter approach means that even though those channels are indistinguishable from noise, the model still learns enough about them to generalize to resolutions higher. R2IDFrom the narrative perspective, it makes sense to look at R2ID first, since it’s the actual diffusion model. Also, it’s difficult to see use in R2IR unless you understand R2ID and it’s pain point. The concept has largely remained unchanged:
However, 2 major developments:
I started developing R2IR when I was still on the cloud attention idea, and it helped a lot back then. But then I started using linear attention in R2IR, and everything became blazing fast, and I questioned if R2IR was even necessary in the first place. Turns out, yes, it still is, in fact, maybe even more so than before. R2IR makes sense as a natural extension once you figure out the drawbacks of R2ID:
So, let’s make R2IR. R2IRWe now know the drawbacks of R2ID, and we know what we need for R2IR: somehow convert height and width into extra channels. 2 months ago when I made the previous post, one comment stuck out to me. u/MoridinB proposed that instead of having a resolution invariant diffuser, how about I make a resolution invariant autoencoder. Even back then, I had felt the pain of the training time, and the concept sounded amazing in theory, but I had no idea how to do it in practice. Looking into existing architectures, I couldn’t really find the thing I was looking for. The most obvious alternative was to just diffuse in fourier series for example, but that’s not quite it in my opinion. I assumed that there just must be somehow some kind of clean solution and I just haven’t come to it yet. The most obvious solution to the conundrum (less height and width, more channels) is to just use an existing VAE or AE. But there’s a massive problem, and that is that they work on non 1×1 convolution kernels. 1×1 convolution kernels are fine because they’re just an image shape linear layer, they don’t mix pixels together. But that’s not what CNN based autoencoders do. They have 3×3 convolutions in the simplest of configurations, which instantly stops them from being resolution invariant, and makes them pixel density dependent. Training on various resolutions, having multiple kernels for different resolutions, or reusing the same kernel and dynamically scaling it, to me that sounds more like a hack than a clean and correct implementation. Over this time, I had tried:
I genuinely effectively gave up, until at one moment a thought struck me: why not use cross attention? Cross attention selectively passes information from one tensor to another. We typically use it to pass information from text tokens to the image, that way doing our text conditioning. But what if I made an empty latent, populated it with coordinates, and then used cross attention to move information from the image into the latent? What if, for the decoding, each pixel selectively integrated information from the latent? The queries Q know only about their coordinate, while the keys K and values V know about the coordinate and color. Thus, the _only_ way for information to pass through, would be positional based. A kind of smooth view of the image, based on whatever coordinate you’re interested in. Thus I made it, R2IR. The dumb approach of full attention, the quadratic scaling, and yet it still worked. Early R2IR was able to compress and expand out. Now as said before, I made it before switching to linear attention, and the switch to linear attention was triggered by the fatal flaw in the early stage of R2IR, and that is that it requires _even more_ computation than R2ID. Let’s say that we wanted to encode and decode a 1024×1024 image, how many attention calls would we need to do? For encoding, let’s say we want an 8x reduction in height and width, that would be a total of 128×128 latent pixels which is 16,384 total attention calls, and each attention call would be for 1,048,576 total pixels. Yikes. For the decoder, it’s 1,048,576 calls over a sequence length of 16,384. At the time, I was experimenting with cloud point attention, splitting the number of pixels into random groups and only attending within the group as a means to speed up. Similarly, I used only random fractions of the pixels for the KV, but still, it was incredibly slow and I hit OOM on 64×64 images unless I had a batch size of 10 and fractions like 1/4. And then, I stumbled upon Linear Attention, and it literally fixed everything. Blazing speeds, memory, everything. And the reconstructions were even better because no longer are fractions needed and instead you could do full attention. Cloud mechanics become obsolete too. Training R2ID without R2IR and with is like night and day: epochs go from 10 minutes or so to about 40 seconds, batch sizes can be set to 100, and to top it off we reap the rewards of the resampling tricks. So how does this actually work? It’s simple. We make Q hold only the coordinates, and KV hold the coordinates and color. For the case of encoding, Q is the latent and KV is made by the actual image. For the case of decoding, Q is the image, and KV is the latent. The coordinate system is the same one as before. Now one pass of Linear Attention is risky, even if it’s multi-head. This is beacuse it works as an averaging of sort, just one pass of attention, and we risk blurring details, which is exactly what happened. So instead let’s make it a transformer block with residual addition, just like what was done for the “encoder” and “decoder” blocks in R2ID, but we don’t need AdaLN for time conditioning this time around. Let’s have 4 blocks, just in case. First pass does general colors, final passes refine details. And then the final stage is to compress back down to the color space via a 1×1 convolution, whether it be for the latent or the actual image. Does it work? Yes, in fact it works _too_ well. Take a look at the attached images and see if you can spot what’s wrong. They’re all at 1024×1024 resolution, resampled up from a 100×100 latent. 100×100 latent, resampled into a 1024×1024 image 100×100 latent, resampled into a 1024×1024 image 100×100 latent, resampled into a 1024×1024 image That’s right, R2IR has memorized the pixelation from the original image. The raw MNIST images are all 28×28. I trained on 32×32, but that’s still the same amount of info as 28×28. By having 4 blocks instead of 1, R2IR was able to memorize the pixelation that you see on small resolutions. Had I used 1 block instead, it would have been a nice smooth transition. It’s safe to say, the model knows what it’s doing and certainly can capture fine details. Also, just for fun, let’s take a look at how the latent space looks like. This is a fixed set of images, encoded via R2IR and then rendered directly. The reason it works is that the latent space colors are still literal colors, they’re bound between -1 and 1, just like the color space (it’s re-shaped so that [0, 1] re-maps to [-1,1]). Normalization showed to improve the loss, and makes it easier to visualize too. Each column’s 64 rows are an image’s 64 separate channels in the latent space. 32×32 images compressed to 4×4 latents There’s this very interesting, and equally inexplicable pattern. I genuinely have no idea why it loves to do this clean left/right separation? Honestly, no idea, any guesses would be nice. We can also compress the same 32×32 images into a bigger size latent, and see why it is that the model is so robust against resolutions. 32×32 images compressed to 14×14 latents This time, the 32×32 image is compressed to a 14×14 latent instead, meaning that whereas with the 4×4 latent we had no information doubling ([1, 32, 32] -> [64, 4, 4]), we now have over 3x as much of the same information repeated, and not exactly in the cleanest of ways since we don’t have more pixels on the input end. And yet, the latents are _identical_, they just gain some extra details that weren’t there before. All together, the full modelAll together, the model is absolutely nuts, and I really mean it. It is worlds apart to the previous iteration.
Just to really put the case in point: in the previous iteration, to diffuse on a single 1024×1024 image, I would literally have a minute per prediction. Now? R2ID diffuses on a 256×256 latent (equivalent to 2048×2048 image, 4MP) at 4.2 steps per second, at just 1.6GiB at fp32. This is worlds apart, considering that I haven’t really put much effort in to optimize it either. I made a dummy model which did 16x reduction in height and width, and trained it on 3 channel MNIST images. R2IR and R2ID would hence have 1024 channels, 256 of them for positioning, 768 for colors. The model _still worked_, but what was more wild was just how lightweight it was. R2IR had 27M parameters, which is nothing compared to the SDXL VAE, while the 8 encoder block 8 decoder block configuration in R2ID had a total of about 270M paramters, also absolutely nothing by modern standards. I feel it is safe to say that R2IR and R2ID can _truly_ be expanded to big resolutions, and have competitive speeds and quality. The prior concerns raised (speed, memory, ability to capture details), to me seem solved, and now all that’s left is to go bigger. Future development and closing thoughtsAs mentioned just above, the future goal is to expand into actual images. I mean real images at actual resolutions, not dummy datasets. I’m open to suggestions. I think that something at 512px, would be good, with R2IR doing the 16x reduction approach, and thus making R2IR and R2ID function on 1024 channels for positioning and color. The number 1024 is nice and round, the 16x height and width reduction is aggressive, but fits in cleanly with the expansion to 768 color channels from 3. I’ve also briefly mentioned RIGSIG. This is a dummy repo for now that I’ve made, but will eventually™ get to it once R2IR and R2ID are finished. I think that as a starting step, it would make sense to just train a model do learn to move gaussian splats around, step by step, although ideally, I’d make the splats be 3d, and then you could sample at actually different aspect ratios, not just various re-shapes. Don’t know how to do that considering the coordinate sytem I’ve got though, and that’s for later. Related to RIGSIG, I think it may be possible to feed into R2ID some bogus coordinates for nonexistent points, like for example having pixels with coordinates corresponding to many aspect ratios. That way, you diffuse once across all these different aspect ratios, and then just sample once and pick and choose what thing you want. Although I’m concerned that this will be a bit messy. Another option is to use the negative frequencies as an actual absolute system, for example outpainting _is_ adding more information, so that would be nice. Although I’m not really sure how to cleanly tie it all in. In any case, with that being said, thank you for reading. I’m open to critique, suggestions and questions. The code is still a bit messy, but with LLMs it should be simple to understand and run by yourself. I’ll get around to making it cleaner soon™ once I’ve finished with the interesting stuff. As always, kind regards. submitted by /u/Tripel_Meow |