Meta Releases Llama 4: Open-Weight AI With a 10 Million Token Context Window
Meta launched Llama 4, its largest open-weight model family, featuring a 10 million token context window in the Scout variant and multimodal capability across the lineup. It's the most significant open-source AI release of 2026.
Meta released the Llama 4 model family in April 2026 — its most capable open-weight release to date and the largest context window available from any major AI provider.
The Llama 4 Family
Meta released three models simultaneously:
Llama 4 Scout: 17 billion active parameters, mixture-of-experts architecture, 10 million token context window. The context window is the headline number — ten times larger than Gemini's 1 million token maximum, and roughly 80 times larger than GPT-4.1's current limit. Scout is designed for tasks requiring extremely long context: entire codebases, multi-year document archives, large-scale data analysis. It runs efficiently on a single H100 GPU, making it deployable on standard enterprise hardware.
Llama 4 Maverick: 17 billion active parameters, 1 million token context, optimised for instruction-following and conversational tasks. Meta claims Maverick matches or exceeds GPT-4o and Gemini 2.0 Flash on standard benchmarks while being open-weight. Independent benchmark verification is ongoing.
Llama 4 Behemoth: Not yet released publicly. Meta describes it as a 288 billion parameter model used to train Scout and Maverick through knowledge distillation. Behemoth is Meta's frontier research model; commercial availability has not been announced.
All Llama 4 models are natively multimodal — they process text and images in the same call, without requiring separate vision components.
Why Open-Weight Matters
Llama 4 is open-weight, meaning Meta releases the model weights for download and use. This is the fundamental difference from OpenAI, Anthropic, and Google, who provide API access only.
Open-weight deployment means:
Data stays on your infrastructure. Queries processed by a locally deployed Llama 4 model never leave your environment. For enterprises with data residency requirements, regulated data, or confidentiality concerns about sending queries to external APIs, this is the enabling capability.
No per-token cost at inference. Once the model is deployed on owned hardware, inference cost is hardware cost only — no API fees per call. For extremely high-volume workloads, the economics of local deployment can be substantially lower than API pricing.
Full control over the model. Organisations can fine-tune Llama 4 on proprietary data, modify the system prompt and safety behaviour, and deploy at whatever scale their infrastructure supports.
The tradeoff: operating a frontier-class model requires GPU infrastructure and engineering capability that many organisations do not have. Llama 4 Scout's single-H100 efficiency reduces this barrier significantly compared to previous large open-weight models.
The 10 Million Token Context in Practice
Scout's 10 million token context is the most practically significant capability in the release. To understand what this enables: 10 million tokens is approximately 7.5 million words — the equivalent of roughly 50 full-length novels, or the complete source code of a large enterprise application, or several years of company Slack messages.
Applications that become possible with this context length:
- Processing an entire legal document archive in a single query
- Analysing a company's complete customer support history without sampling
- Reviewing a large codebase for security vulnerabilities in one pass
- Summarising all research published in a field over a multi-year period
Whether these use cases work well in practice — whether the model maintains accuracy and attention across the full 10 million token context — is being tested in production deployments now. Context window claims and production performance at the extremes of that window frequently diverge. Independent evaluation results will be available within weeks.
Competitive Implications
Llama 4's release compresses the capability advantage that proprietary model providers have used to justify API pricing. If Maverick genuinely matches GPT-4o on standard tasks and Scout's context performance holds up in production, enterprises have a credible open-weight alternative for the first time at frontier capability levels.
For OpenAI, Anthropic, and Google, the response is predictable: accelerate capability improvements, reduce pricing on tiers where open-weight competition is most direct, and emphasise the safety, reliability, and support advantages of managed API services over self-hosted deployment.
For enterprises, the Llama 4 release creates a genuine evaluation question: for workloads where data residency, cost at scale, or customisation matter, the open-weight route is now worth calculating seriously for the first time.