huggingface · Ando233 · Jan 24, 2026 · Jan 28, 2026 · Jan 28, 2026 · Feb 15, 2026
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -456,6 +456,8 @@
         title: AutoencoderKLQwenImage
       - local: api/models/autoencoder_kl_wan
         title: AutoencoderKLWan
+      - local: api/models/autoencoder_rae
+        title: AutoencoderRAE
       - local: api/models/consistency_decoder_vae
         title: ConsistencyDecoderVAE
       - local: api/models/autoencoder_oobleck

diff --git a/docs/source/en/api/models/autoencoder_rae.md b/docs/source/en/api/models/autoencoder_rae.md
@@ -0,0 +1,59 @@
+<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# AutoencoderRAE
+
+`AutoencoderRAE` is a representation autoencoder that combines a frozen vision encoder (DINOv2, SigLIP2, or MAE) with a ViT-MAE-style decoder.
+
+Paper: [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690).
+
+The model follows the standard diffusers autoencoder API:
+- `encode(...)` returns an `EncoderOutput` with a `latent` tensor.
+- `decode(...)` returns a `DecoderOutput` with a `sample` tensor.
+
+## Usage
+
+```python
+import torch
+from diffusers import AutoencoderRAE
+
+# Load a converted model from the Hub
+model = AutoencoderRAE.from_pretrained(
+    "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08"
+).to("cuda").eval()
+
+# Encode and decode
+x = torch.randn(1, 3, 224, 224, device="cuda")
+with torch.no_grad():
+    latents = model.encode(x).latent
+    recon = model.decode(latents).sample
+```
+
+`encoder_type` supports `"dinov2"`, `"siglip2"`, and `"mae"`. The encoder is built from config
+(with random weights) during `__init__`; use `from_pretrained` to load a converted checkpoint
+that includes both encoder and decoder weights.
+
+For latent normalization, use `latents_mean` and `latents_std` (matching other diffusers autoencoders).
+
+See `examples/research_projects/autoencoder_rae/train_autoencoder_rae.py` for a stage-1 style training script
+(reconstruction and optional encoder-feature losses are computed in the training loop, following diffusers training conventions).
+
+## AutoencoderRAE class
+
+[[autodoc]] AutoencoderRAE
+  - encode
+  - decode
+  - all
+
+## DecoderOutput
+
+[[autodoc]] models.autoencoders.vae.DecoderOutput
diff --git a/examples/research_projects/autoencoder_rae/README.md b/examples/research_projects/autoencoder_rae/README.md
@@ -0,0 +1,45 @@
+# Training AutoencoderRAE
+
+This example trains the decoder of `AutoencoderRAE` (stage-1 style), while keeping the representation encoder frozen.
+
+It follows the same high-level training recipe as the official RAE stage-1 setup:
+- frozen encoder
+- train decoder
+- pixel reconstruction loss
+- optional encoder feature consistency loss
+
+## Quickstart
+
+```bash
+accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \
+  --train_data_dir /path/to/imagenet_like_folder \
+  --output_dir /tmp/autoencoder-rae \
+  --resolution 256 \
+  --encoder_cls dinov2 \
+  --encoder_input_size 224 \
+  --patch_size 16 \
+  --image_size 256 \
+  --decoder_hidden_size 1152 \
+  --decoder_num_hidden_layers 28 \
+  --decoder_num_attention_heads 16 \
+  --decoder_intermediate_size 4096 \
+  --train_batch_size 8 \
+  --learning_rate 1e-4 \
+  --num_train_epochs 10 \
+  --report_to wandb \
+  --reconstruction_loss_type l1 \
+  --use_encoder_loss \
+  --encoder_loss_weight 0.1
+```
+
+Note: stage-1 reconstruction loss assumes matching target/output spatial size, so `--resolution` must equal `--image_size`.
+
+Dataset format is expected to be `ImageFolder`-compatible:
+
+```text
+train_data_dir/
+  class_a/
+    img_0001.jpg
+  class_b/
+    img_0002.jpg
+```