TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training

36 points by fzliu a day ago

I'm struggling to understand where the gains are coming from. What is the intuition for why DiT training was so inefficient?

joshred a day ago

This is the high-level explanation of the simplest diffusion architecture. The model trains by taking an image and iteratively adding noise to the image until there is only noise. Then they take that sequence of noisier and noisier images and they reverse it. The result is that they start with only noise, and they predict the removal of noise at step until they get to the final step (which should be the original image (or training input)).
That process means they may require a hundred or more training iterations on a single image. I haven't digested the paper, but it sounds like they are proposing something conceptually similar to skip layers (but significantly more involved).

earthnail a day ago

Wow, Ommer’s students never fail to impress. 37x faster for a generic architecture, ie no domain specific tricks. Insane.

arjvik a day ago

Isn't this just Mixture-of-Depths but for DiTs?

If so, what are the DiT specific changes that needed to be made?

yorwba a day ago

Mixture-of-Depths trains the model to choose different numbers of layers for different tokens to reduce inference compute. This method is more like stochastic depth / layer dropout, where whether or not the intermediate layers are skipped for a token is random independent of the token value, and they're only using it as a training optimization. As far as I can tell, during inference all tokens are always processed by all layers.

very nice, will have to try it out! this is the same research group from which Robin Rombach (of stable diffusion fame) originated from