New Vision-Language Models: Qwen3-VL-4B and 8B on NetMind

We're bringing vision-language capabilities to the NetMind Model Library with two new additions: Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct.

Qwen3-VL represents the most advanced vision-language model in the Qwen series to date. It features:

Superior multimodal reasoning for STEM and math
Deeper visual perception for recognizing a vast range of subjects
Enhanced agent functions to operate PC and mobile GUIs

The model was optimised for processing books and long videos, improved spatial understanding for both 2D and 3D grounding, and expanded OCR supporting 32 languages. Available in scalable Dense and MoE architectures, it is designed for flexible deployment from edge to cloud.

Together, they bring powerful multimodal AI to edge devices and single-GPU workstations, from document intelligence to GUI automation to visual reasoning.

Models at a Glance

Qwen3-VL-4B-Instruct

Total parameters: 4B
Modalities: Vision + Language
Pricing: $0.01 (input) / $0.03 (output) per million tokens
RPM: 600

Qwen3-VL-8B-Instruct

Total parameters: 8B
Modalities: Vision + Language
Pricing: $0.02 (input) / $0.07 (output) per million tokens
RPM: 600

Built for Multimodal Development

Both models integrate seamlessly into the NetMind API with:

Native vision-language processing for images, documents, and videos
Production-grade latency for real-time applications
Consistent throughput at 600 RPM
Transparent pricing for predictable scaling
Multi-language OCR supporting 32 languages

Whether you're creating GUI automation agents that navigate interfaces, developing educational tools for STEM problem-solving with diagrams, or deploying multimodal chatbots that understand both text and images, these models provide the vision-language capabilities to match your requirements.

Start Building with Qwen3-VL-4B and Qwen3-VL-8B

With the addition of Qwen3-VL-4B-Instruct and Qwen3-VL-8B-Instruct, the NetMind Model Library now offers more comprehensive multimodal AI capabilities, from cost-effective edge deployment to advanced visual reasoning, from document intelligence to autonomous agent operations.

If you build something with these, we want to see them. Join the discussion in our Reddit community.

User-Agent: