DALL-E - Architecture



DALL-E is an AI, model that generates pictures based on the textual description given by the user. It is a part of the GPT (Generative Pre-trained Transformer) family and works on a transformer model to create visual content.

DALL-E mainly depends on the following technologies −

  • Natural Language Processing (NLP) − It helps the model to understand the meaning of the text description given by the user.
  • Large Language Model (LLM) − It encodes the text and image in a way that tells the semantic information. OpenAI has developed its own LLM called CLIP, which is part of DALL-E.
  • Diffusion Model − This is mainly used to generate images.

Contrastive Language-Image Pretraining (CLIP)

CLIP is a large language model developed by OpenAI exclusively for the functioning of the DALL-E model. It is trained on several images with associated captions to bridge the gap between textual description and images. As its name suggests, the "contrastive" modelcompares the given text prompt with the captions of the existing images in the dataset to check if the input matches with any image captions. Every image-caption pair is assigned a similarity score, and the pair with the highest similarity score is picked. To perform this task, the model relies on two components −

  • Text Encoder − It converts the user's text prompt into text embedding, which are numerical values that are understood by DALL-E.
  • Image Encoder − Similar to the text encoder, this component is used to convert images into image embedding.

Now, it compares the values of both text and image embedding and checks for resemblance in the semantic information, which is called cosine similarity. The representation below would help you understand better −

DALL-E CLIP Architecture

Working of DALL-E

DALL-E works on processing input data and transforms it into flexible data to perform generating tasks.

DALL-E Working

The workflow of the model is described below −

  • Once the textual description for an image is provided, it is given to CLIP's text encoder. The meaning of the prompt is understood using NLP, and then it is converted into a high-dimensional vector representation that captures semantic meaning. This vector representation is called text embedding.
  • Next, the text embedding is then passed to prior, a type of generative model that can sample from a probability distribution to produce realistic images.
  • In the final step, once the prior generated image embedding is passed through the diffusion decoder, which generates the final image.
Advertisements