LayoutDETR:
Detection Transformer Is a Good Multimodal Layout Designer

ECCV 2024


Ning Yu      Chia-Chih Chen      Zeyuan Chen      Rui Meng     
Gang Wu      Paul Josel      Juan Carlos Niebles      Caiming Xiong      Ran Xu
Salesforce Research
          

Abstract


Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins.

Results

Qualitative results



Quantitative results



Video


Materials




Paper



Poster

Code

Citation

@inproceedings{yu2024layoutdetr,
  title={LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer},
  author={Yu, Ning and Chen, Chia-Chih and Chen, Zeyuan and Meng, Rui and Wu, Gang and Josel, Paul and Niebles, Juan Carlos and Xiong, Caiming and Xu, Ran},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2024}
}

Acknowledgement


We thank Shu Zhang, Silvio Savarese, Abigail Kutruff, Brian Brechbuhl, Elham Etemad, and Amrutha Krishnan from Salesforce for constructive advice.

Related work


A taxonomy of related work and ours

K. Kikuchi, E. Simo-Serra, M. Otani, K. Yamaguchi. Constrained graphic layout generation via latent optimization. MM 2021.
Comment: A layout GAN baseline method that is used as our GAN backbone.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko. End-to-end object detection with transformers. ECCV 2020.
Comment: An object detection architecture that is used for our background image understanding and bounding box generation.
M. Zhou, C. Xu, Y. Ma, T. Ge, Y. Jiang, W. Xu. Composition-aware Graphic Layout GAN for Visual-textual Presentation Designs. IJCAI 2022.
Comment: A GAN + DETR based baseline method for graphical layout design. Their CGL dataset is used for our experimental benchmarking.
Y. Cao, Y. Ma, M. Zhou, C. Liu, H. Xie, T. Ge, Y. Jiang. Geometry Aligned Variational Transformer for Image-conditioned Layout Generation. MM 2022.
Comment: A VAE + DETR based baseline method for graphical layout design.
Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, A. Kovashka. Automatic understanding of image and video advertisements. CVPR 2017.
Comment: An ad banner dataset that is used to benchmark our experiments.
G. Li, G. Baechler, M. Tragut, Y. Li. Learning to denoise raw mobile UI layouts for improving datasets at scale. CHI 2022.
Comment: A mobile application UI dataset that is used to benchmark our experiments.