French AI startup Mistral has showcased its first multimodal model, Pixtral 12B, which is capable of processing both images and text.
The Pixtral 12B model, featuring 12 billion parameters and built on Mistral’s existing text-based model Nemo 12B, is designed for tasks such as captioning images, identifying objects, and answering queries related to images.
At 24GB in size, Pixtral 12B is available for free under the Apache 2.0 license, allowing users to use, modify, or commercialize it without any restrictions.
Developers can download the model from GitHub and Hugging Face, although functional web demos are not yet available.
Mistral’s head of developer relations has indicated that Pixtral 12B will soon be integrated into the company’s chatbot, Le Chat, and its API platform, La Platforme.
Multimodal models like Pixtral 12B represent a new frontier in generative AI, building on the advancements made by tools such as OpenAI’s GPT-4 and Anthropic’s Claude.
However, there are concerns about the data sources used for training these models.
Mistral, similar to other AI companies, likely trained Pixtral 12B on extensive publicly available web data—a practice that has led to legal challenges from copyright holders who question the “fair use” argument often cited by tech firms.
The release of Pixtral 12B comes after Mistral secured $645 million in funding, which has raised its valuation to $6 billion.
With support from backers like Microsoft, Mistral is positioning itself as a European counterpart to OpenAI.