Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image Model

Media & Embodied AI · Published: Jul 03, 2024 · Priya Sharma · ~6 min read

Author

Priya Sharma · Enterprise AI & Governance Editor

Regulation, enterprise adoption, and what teams should verify before they deploy.

I watched a significant shift in generative AI governance as researchers from the University of Hong Kong and ByteDance challenged the diffusion monopoly. They have open-sourced LlamaGen, an autoregressive text-to-image model built on Llama that reportedly outperforms mainstream diffusion architectures like LDM and DiT on ImageNet. This development places the burden of proof squarely on enterprises to verify whether these new efficiency gains justify the compliance risks associated with less mature generation methods.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 2

I read the technical claims with interest, noting that while diffusion models have dominated the landscape for some time, this new approach suggests a return to primitive autoregressive architectures. The GitHub repository has already garnered nearly 900 stars, indicating strong industry curiosity about whether Llama can handle image tokenization effectively. My desk is tracking how quickly legal teams will assess the liability of these open-source alternatives against established proprietary tools.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 3

△ LlamaGen image generation examples: the first row shows class-conditioned generation, and the second row shows text-to-image generation.

I followed the release details to understand how autoregressive models achieve this parity with diffusion techniques. The authors argue that their method proves autoregressive architectures can still deliver highly competitive performance when properly engineered. I believe enterprises should verify the specific benchmarks used before assuming these results translate directly to production-grade content safety standards.

The Shift to Autoregressive Image Generation

I see a clear pivot in how the open-source community approaches image synthesis. For years, the benchmark for autoregressive models was stuck at an FID score of roughly 15, as achieved by VQ-GAN on ImageNet back in 2020. That ceiling has now been shattered. ViT-VQGAN had already pushed that metric to approximately 3.0 by 2021, and proprietary systems like DALL-E 1 and Parti proved the potential of text-to-image generation long before this release.

Because those earlier breakthroughs remained closed-source, I followed the HKU and ByteDance research team’s decision to build an open-source base autoregressive image model from scratch. Their goal was not just replication, but architectural transparency. They identified three pillars for success in modern generation: robust image compressors or tokenizers, scalable generation models, and high-quality training data.

To achieve this, they retained the CNN architecture used by VQ-GAN to discretize continuous images into tokens. However, I noted their refined understanding of the Image Tokenizer compared to 2020 standards:

An excellent tokenizer requires a larger codebook size and lower codebook vector dimension. Meanwhile, better image reconstruction necessitates a greater number of tokens.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 4

△ VQ-GAN architecture (not part of this project).

Architecturally, LlamaGen is rooted in the Llama language model. It incorporates Pre-Normalization via RMSNorm, SwiGLU activations, and RoPE embeddings. While techniques like AdaLN could theoretically boost performance, I observed that the authors deliberately kept the structure identical to the Llama language model for consistency.

For both class-conditional and text-to-image tasks, they implemented a straightforward mechanism: class or text embeddings serve as starting tokens, with subsequent image tokens generated through next-token prediction. This unified approach allows native autoregressive models to plug directly into existing LLM deployment frameworks like vLLM. In practice, this integration delivered a 326%-414% speedup for LlamaGen during deployment.

I think the speed gains via vLLM are significant for enterprise inference costs. My sense is using LLaVA to caption training data introduces potential bias into the aesthetic filter. What concerns me is that enterprises must verify if “high-aesthetic” internal data meets their compliance standards.

Data Curation and Training Stages

The training regimen is split into two distinct phases, each with specific data constraints. In the first stage, I noted the model was trained on a 50-million subset of LAION-COCO at a resolution of 256×256. The original dataset contained 600 million image-text pairs, but the authors applied rigorous filtering criteria: effective image URLs, aesthetic scores, watermark detection, CLIP text-image similarity, and minimum image size requirements.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 5

The second stage involved fine-tuning on 10 million internal high-aesthetic-quality images, increasing the resolution to 512×512. Crucially, the text descriptions for these curated images were not manually written; they were generated by LLaVA. This reliance on synthetic captions for fine-tuning data is a detail I flagged as critical for governance teams to understand when assessing provenance.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 6

This two-stage process highlights the burden of proof on data quality. By moving from a massive, filtered public dataset to a smaller, internally curated set with AI-generated captions, the authors prioritized aesthetic alignment over raw volume. I believe enterprises adopting this model should audit how LLaVA’s biases might influence the final image outputs.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 7

Performance Comparable to Diffusion Models

I read the performance data closely, and what stands out is that the retrained Image Tokenizer now beats previous benchmarks on ImageNet and COCO. It surpasses established models like VQGAN, ViT-VQGAN, and MaskGIT. This shift matters because it proves discrete representations can match continuous ones, such as the SD VAE used in diffusion architectures. I see this as a critical step: image quantization is no longer the bottleneck for reconstruction quality.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 8

In actual generation tests on the ImageNet dataset, LlamaGen showed strong competitiveness across FID, IS, Precision, and Recall metrics. Notably, the LlamaGen-3B model outperforms popular diffusion models like LDM and DiT. I follow this release with interest because it suggests even basic autoregressive architectures can serve as foundations for advanced image generation systems.

Additionally, compared to previous autoregressive models, LlamaGen surpasses earlier models across various parameter scales. The authors attribute this success to the improved Image Tokenizer and the better scalability of the Llama architecture.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 9

In text-to-image generation, the first stage of training gives the model image-text alignment capabilities, but visual quality needs improvement. The second stage significantly enhanced this quality. I read that this stems from using high-quality aesthetic images and increasing resolution from 256×256 to 512×512. Higher resolutions clearly lead to better visual effects in this context.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 10

When provided with longer text inputs, LlamaGen can also generate images that combine both image-text alignment and high visual quality.

Llama Can Now Generate Images: HKU and ByteDance Release Open-Source Autoregressive Text-to-Image… — figure 11

However, the authors acknowledge that if compared to the development trajectory of diffusion models, current LlamaGen is only at the stage equivalent to Stable Diffusion v1. Future improvement directions include SDXL (higher resolution, more aspect ratios), ControlNet (greater controllability), and Sora (video generation). From the perspective of multimodal large models, it has been proven feasible for autoregressive models to handle both understanding and generation tasks separately. The next step is joint training within a single model.

I think enterprises should verify if discrete tokenization reduces inference costs compared to diffusion. My sense is the SD v1 equivalence means this is not yet ready for high-stakes commercial production. I see governance risks in open-sourcing models that mimic proprietary capabilities without controls. What concerns me is that we must watch how joint training affects output safety and alignment over time.

The project is currently open-sourced and supports online demos. Interested parties are encouraged to try it out.

Online Demo:
https://huggingface.co/spaces/FoundationVision/LlamaGen Paper:
https://arxiv.org/abs/2406.06525 Project Homepage:
https://peizesun.github.io/llamagen/ GitHub:
https://github.com/FoundationVision/LlamaGen
Hugging Face:
https://huggingface.co/FoundationVision/LlamaGen