Gemma 4, the newest iteration of multimodal models from Google DeepMind, has been released on Hugging Face, promising a versatile and powerful tool for developers and researchers alike. This family of models is designed to support a range of inputs, including text, images, and audio, all while maintaining high performance and accessibility.
What’s New with Gemma 4?
Building on the foundations laid by its predecessors, Gemma 4 enhances its capabilities with improved architecture and functionality. It supports image, text, and audio inputs, generating text responses based on these modalities. The text decoder is derived from the original Gemma model, now equipped to handle long context windows. Notably, the image encoder has been upgraded to accommodate variable aspect ratios and a configurable number of image token inputs, optimizing the balance between speed, memory usage, and output quality.
Architecture and Performance
Gemma 4 features a combination of architectural elements from earlier versions and other open models, intentionally omitting complex features that may hinder performance. This design choice aims to ensure compatibility across various libraries and devices, making it suitable for long context and agentic applications. The 31B dense model achieves an impressive estimated LMArena score of 1452, while the 26B mixture-of-experts model reaches 1441 with only 4B active parameters.
Multimodal Capabilities
Gemma 4 excels in multimodal tasks, demonstrating proficiency in areas such as optical character recognition (OCR), speech-to-text conversion, and object detection. The model can also perform function calling, reasoning, and code generation. For instance, it can detect GUI elements and respond in JSON format, showcasing its ability to understand and process visual information effectively.
Furthermore, smaller variants of Gemma 4 can process videos with audio, while larger models can analyze videos without audio. This flexibility allows for a wide range of applications, from analyzing live performances to generating HTML code based on visual prompts.
With its robust capabilities and open-access model, Gemma 4 is poised to become a significant asset in the toolkit of AI developers and researchers, inviting exploration and innovation across various fields.
This article was produced by NeonPulse.today using human and AI-assisted editorial processes, based on publicly available information. Content may be edited for clarity and style.








