-
Notifications
You must be signed in to change notification settings - Fork 6.8k
feat: implement rae autoencoder. #13046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
382aad0
f82cecc
a3926d7
3ecf89d
0850c8c
24acab0
25bc9e3
f06ea7a
d7cb124
0d59b22
202b14f
7cbbf27
e6d4499
6a9bde6
9522e68
906d79a
d3cbd5a
96520c4
fc52959
a4fc9f6
d06b501
d8b2983
c68b812
61885f3
28a02eb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,59 @@ | ||
| <!-- Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
| --> | ||
|
|
||
| # AutoencoderRAE | ||
|
|
||
| `AutoencoderRAE` is a representation autoencoder that combines a frozen vision encoder (DINOv2, SigLIP2, or MAE) with a ViT-MAE-style decoder. | ||
|
|
||
| Paper: [Diffusion Transformers with Representation Autoencoders](https://huggingface.co/papers/2510.11690). | ||
|
|
||
| The model follows the standard diffusers autoencoder API: | ||
| - `encode(...)` returns an `EncoderOutput` with a `latent` tensor. | ||
| - `decode(...)` returns a `DecoderOutput` with a `sample` tensor. | ||
|
|
||
| ## Usage | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @kashif does this need updating? |
||
|
|
||
| ```python | ||
| import torch | ||
| from diffusers import AutoencoderRAE | ||
|
|
||
| # Load a converted model from the Hub | ||
| model = AutoencoderRAE.from_pretrained( | ||
| "nyu-visionx/RAE-dinov2-wReg-base-ViTXL-n08" | ||
| ).to("cuda").eval() | ||
|
|
||
| # Encode and decode | ||
| x = torch.randn(1, 3, 224, 224, device="cuda") | ||
| with torch.no_grad(): | ||
| latents = model.encode(x).latent | ||
| recon = model.decode(latents).sample | ||
| ``` | ||
|
|
||
| `encoder_type` supports `"dinov2"`, `"siglip2"`, and `"mae"`. The encoder is built from config | ||
| (with random weights) during `__init__`; use `from_pretrained` to load a converted checkpoint | ||
| that includes both encoder and decoder weights. | ||
|
|
||
| For latent normalization, use `latents_mean` and `latents_std` (matching other diffusers autoencoders). | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should provide an example for this. |
||
|
|
||
| See `examples/research_projects/autoencoder_rae/train_autoencoder_rae.py` for a stage-1 style training script | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does stage-2 have? Generation? |
||
| (reconstruction and optional encoder-feature losses are computed in the training loop, following diffusers training conventions). | ||
|
|
||
| ## AutoencoderRAE class | ||
|
|
||
| [[autodoc]] AutoencoderRAE | ||
| - encode | ||
| - decode | ||
| - all | ||
|
|
||
| ## DecoderOutput | ||
|
|
||
| [[autodoc]] models.autoencoders.vae.DecoderOutput | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # Training AutoencoderRAE | ||
|
|
||
| This example trains the decoder of `AutoencoderRAE` (stage-1 style), while keeping the representation encoder frozen. | ||
|
|
||
| It follows the same high-level training recipe as the official RAE stage-1 setup: | ||
| - frozen encoder | ||
| - train decoder | ||
| - pixel reconstruction loss | ||
| - optional encoder feature consistency loss | ||
|
|
||
| ## Quickstart | ||
|
|
||
| ```bash | ||
| accelerate launch examples/research_projects/autoencoder_rae/train_autoencoder_rae.py \ | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe also include the pretrained encoder path in the example command? |
||
| --train_data_dir /path/to/imagenet_like_folder \ | ||
| --output_dir /tmp/autoencoder-rae \ | ||
| --resolution 256 \ | ||
| --encoder_cls dinov2 \ | ||
| --encoder_input_size 224 \ | ||
| --patch_size 16 \ | ||
| --image_size 256 \ | ||
| --decoder_hidden_size 1152 \ | ||
| --decoder_num_hidden_layers 28 \ | ||
| --decoder_num_attention_heads 16 \ | ||
| --decoder_intermediate_size 4096 \ | ||
| --train_batch_size 8 \ | ||
| --learning_rate 1e-4 \ | ||
| --num_train_epochs 10 \ | ||
| --report_to wandb \ | ||
| --reconstruction_loss_type l1 \ | ||
| --use_encoder_loss \ | ||
| --encoder_loss_weight 0.1 | ||
| ``` | ||
|
|
||
| Note: stage-1 reconstruction loss assumes matching target/output spatial size, so `--resolution` must equal `--image_size`. | ||
|
|
||
| Dataset format is expected to be `ImageFolder`-compatible: | ||
|
|
||
| ```text | ||
| train_data_dir/ | ||
| class_a/ | ||
| img_0001.jpg | ||
| class_b/ | ||
| img_0002.jpg | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc: @stevhliu. Could you leave suggestions on the docs?