Meta Releases Llama 4: Open-Weight AI With a 10 Million Token Context Window

Meta released the Llama 4 model family in April 2026 — its most capable open-weight release to date and the largest context window available from any major AI provider.

The Llama 4 Family

Meta released three models simultaneously:

Llama 4 Scout: 17 billion active parameters, mixture-of-experts architecture, 10 million token context window. The context window is the headline number — ten times larger than Gemini's 1 million token maximum, and roughly 80 times larger than GPT-4.1's current limit. Scout is designed for tasks requiring extremely long context: entire codebases, multi-year document archives, large-scale data analysis. It runs efficiently on a single H100 GPU, making it deployable on standard enterprise hardware.

Llama 4 Maverick: 17 billion active parameters, 1 million token context, optimised for instruction-following and conversational tasks. Meta claims Maverick matches or exceeds GPT-4o and Gemini 2.0 Flash on standard benchmarks while being open-weight. Independent benchmark verification is ongoing.

Llama 4 Behemoth: Not yet released publicly. Meta describes it as a 288 billion parameter model used to train Scout and Maverick through knowledge distillation. Behemoth is Meta's frontier research model; commercial availability has not been announced.

All Llama 4 models are natively multimodal — they process text and images in the same call, without requiring separate vision components.

Why Open-Weight Matters

Llama 4 is open-weight, meaning Meta releases the model weights for download and use. This is the fundamental difference from OpenAI, Anthropic, and Google, who provide API access only.

Open-weight deployment means:

Data stays on your infrastructure. Queries processed by a locally deployed Llama 4 model never leave your environment. For enterprises with data residency requirements, regulated data, or confidentiality concerns about sending queries to external APIs, this is the enabling capability.

No per-token cost at inference. Once the model is deployed on owned hardware, inference cost is hardware cost only — no API fees per call. For extremely high-volume workloads, the economics of local deployment can be substantially lower than API pricing.

Full control over the model. Organisations can fine-tune Llama 4 on proprietary data, modify the system prompt and safety behaviour, and deploy at whatever scale their infrastructure supports.

The tradeoff: operating a frontier-class model requires GPU infrastructure and engineering capability that many organisations do not have. Llama 4 Scout's single-H100 efficiency reduces this barrier significantly compared to previous large open-weight models.

The 10 Million Token Context in Practice

Scout's 10 million token context is the most practically significant capability in the release. To understand what this enables: 10 million tokens is approximately 7.5 million words — the equivalent of roughly 50 full-length novels, or the complete source code of a large enterprise application, or several years of company Slack messages.

Applications that become possible with this context length:

Processing an entire legal document archive in a single query
Analysing a company's complete customer support history without sampling
Reviewing a large codebase for security vulnerabilities in one pass
Summarising all research published in a field over a multi-year period

Whether these use cases work well in practice — whether the model maintains accuracy and attention across the full 10 million token context — is being tested in production deployments now. Context window claims and production performance at the extremes of that window frequently diverge. Independent evaluation results will be available within weeks.

Competitive Implications

Llama 4's release compresses the capability advantage that proprietary model providers have used to justify API pricing. If Maverick genuinely matches GPT-4o on standard tasks and Scout's context performance holds up in production, enterprises have a credible open-weight alternative for the first time at frontier capability levels.

For OpenAI, Anthropic, and Google, the response is predictable: accelerate capability improvements, reduce pricing on tiers where open-weight competition is most direct, and emphasise the safety, reliability, and support advantages of managed API services over self-hosted deployment.

For enterprises, the Llama 4 release creates a genuine evaluation question: for workloads where data residency, cost at scale, or customisation matter, the open-weight route is now worth calculating seriously for the first time.

The Llama 4 Family

Why Open-Weight Matters

The 10 Million Token Context in Practice

Competitive Implications

Discussion