Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,8 @@
title: AutoencoderKLQwenImage
- local: api/models/autoencoder_kl_wan
title: AutoencoderKLWan
- local: api/models/autoencoder_rae
title: AutoencoderRAE
- local: api/models/consistency_decoder_vae
title: ConsistencyDecoderVAE
- local: api/models/autoencoder_oobleck
Expand Down
59 changes: 59 additions & 0 deletions docs/source/en/api/models/autoencoder_rae.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# AutoencoderRAE

`AutoencoderRAE` is a representation autoencoder that combines a frozen vision encoder (DINOv2, SigLIP2, or MAE) with a ViT-MAE-style decoder.

Paper: [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690).

The model follows the standard diffusers autoencoder API:
- `encode(...)` returns an `EncoderOutput` with a `latent` tensor.
- `decode(...)` returns a `DecoderOutput` with a `sample` tensor.
Comment on lines +15 to +21
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc: @stevhliu. Could you leave suggestions on the docs?


## Usage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kashif does this need updating?


```python
import torch
from diffusers import AutoencoderRAE

# Load a converted model from the Hub
model = AutoencoderRAE.from_pretrained(
"nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
).to("cuda").eval()

# Encode and decode
x = torch.randn(1, 3, 224, 224, device="cuda")
with torch.no_grad():
latents = model.encode(x).latent
recon = model.decode(latents).sample
```

`encoder_type` supports `"dinov2"`, `"siglip2"`, and `"mae"`. The encoder is built from config
(with random weights) during `__init__`; use `from_pretrained` to load a converted checkpoint
that includes both encoder and decoder weights.

For latent normalization, use `latents_mean` and `latents_std` (matching other diffusers autoencoders).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide an example for this.


See `examples/research_projects/autoencoder_rae/train_autoencoder_rae.py` for a stage-1 style training script
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does stage-2 have? Generation?

(reconstruction and optional encoder-feature losses are computed in the training loop, following diffusers training conventions).

## AutoencoderRAE class

[[autodoc]] AutoencoderRAE
- encode
- decode
- all

## DecoderOutput

[[autodoc]] models.autoencoders.vae.DecoderOutput
45 changes: 45 additions & 0 deletions examples/research_projects/autoencoder_rae/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Training AutoencoderRAE

This example trains the decoder of `AutoencoderRAE` (stage-1 style), while keeping the representation encoder frozen.

It follows the same high-level training recipe as the official RAE stage-1 setup:
- frozen encoder
- train decoder
- pixel reconstruction loss
- optional encoder feature consistency loss

## Quickstart

```bash
accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also include the pretrained encoder path in the example command?

--train_data_dir /path/to/imagenet_like_folder \
--output_dir /tmp/autoencoder-rae \
--resolution 256 \
--encoder_cls dinov2 \
--encoder_input_size 224 \
--patch_size 16 \
--image_size 256 \
--decoder_hidden_size 1152 \
--decoder_num_hidden_layers 28 \
--decoder_num_attention_heads 16 \
--decoder_intermediate_size 4096 \
--train_batch_size 8 \
--learning_rate 1e-4 \
--num_train_epochs 10 \
--report_to wandb \
--reconstruction_loss_type l1 \
--use_encoder_loss \
--encoder_loss_weight 0.1
```

Note: stage-1 reconstruction loss assumes matching target/output spatial size, so `--resolution` must equal `--image_size`.

Dataset format is expected to be `ImageFolder`-compatible:

```text
train_data_dir/
class_a/
img_0001.jpg
class_b/
img_0002.jpg
```
Loading