Unprecedented Photorealism: Imagen is a text-to-image diffusion model with a high degree of photorealism.
Deep Language Understanding: The model leverages large transformer language models to understand text and generate high-fidelity images.
State-of-the-art Performance: Imagen achieves a state-of-the-art FID score of 7.27 on the COCO dataset and is preferred by human raters over other models in terms of sample quality and image-text alignment.
DrawBench Benchmark: To assess text-to-image models in greater depth, the team introduced DrawBench, a comprehensive and challenging benchmark for text-to-image models.
Image Generation: Imagen can generate photorealistic images from input text, enabling a wide range of applications in visual content creation.
Benchmarking: The DrawBench benchmark introduced by the team can be used to evaluate and compare the performance of different text-to-image models.
Limitations and Societal Impact
Ethical Challenges: The potential misuse of text-to-image models raises serious ethical concerns. The team decided not to release the code or a public demo to mitigate these risks.
Data Requirements: The reliance on large, mostly uncurated, web-scraped datasets can lead to the propagation of social stereotypes and harmful associations.
Bias: Imagen inherits the social biases and limitations of large language models, and there is a risk that it has encoded harmful stereotypes and representations.
Mode Dropping: Imagen may run into the issue of dropping modes of the data distribution, which can compound the social consequences of dataset bias.