The Open Model Landscape

What 'open' actually means

The word "open" in AI is doing a lot of heavy lifting, and most of it is marketing. Before evaluating specific models, you need to understand the spectrum of openness, because it directly affects what you can legally deploy in your enterprise.

Fully open source (Apache 2.0, MIT). You can use the model for any purpose, modify it, distribute it, and build commercial products with it. No usage restrictions, no reporting requirements, no revenue thresholds. Gemma 4 (Apache 2.0) and Mistral's Apache-licensed models fall here. This is the cleanest option for enterprise deployment.

Community licence with restrictions (Llama licence, Qwen licence). Meta's Llama models use a custom licence that permits commercial use but imposes conditions: if your product has more than 700 million monthly active users, you need a separate licence from Meta. This sounds generous until you remember that enterprise deployments embedded in widely-used internal tools could theoretically approach these thresholds in large organisations. More practically, the custom licence means your legal team needs to review and approve it -- adding weeks to deployment timelines.

Open weights, restricted use. Some models release weights but restrict certain use cases. For example, models trained on specific datasets may prohibit use in certain industries or for certain applications. Always read the full licence, not just the headline.

Gated access. Some "open" models require you to accept terms and request access through Hugging Face or the provider's portal before downloading. This is an administrative speed bump, not a legal restriction, but it matters for automated deployment pipelines.

For enterprise deployment, Apache 2.0 is the gold standard. Your legal team signs off once, and every team in the organisation can deploy without additional review. Any other licence creates friction proportional to the number of teams that want to use the model.

Your legal team has approved Apache 2.0 models for unrestricted internal use. A team wants to deploy Llama 3.1 8B instead because it scores higher on a benchmark they care about. What is the practical implication?

Gemma 4: the Apache 2.0 frontrunner

Google's Gemma 4 family is the most significant development for enterprise edge AI, primarily because of its licence. Apache 2.0 with no usage restrictions means any organisation can deploy it for any purpose without legal review beyond confirming the licence.

The Gemma 4 lineup relevant to edge deployment:

Gemma 4 E2B (2B effective parameters, MoE architecture)

Total parameters: higher due to MoE, but only ~2B active per token
Quantised size: 1.5-2GB at INT4 (Q4_K_M)
Runs in: browser tabs via WebGPU, phones, Raspberry Pi-class devices
Quality: remarkably capable for summarisation, extraction, classification, simple Q&A
Limitation: struggles with complex multi-step reasoning, long-form generation quality drops

Gemma 4 E4B (4B effective parameters, MoE architecture)

Quantised size: 3-4GB at INT4
Runs in: browsers (needs 6GB+ GPU VRAM), laptops, phones with 6GB+ RAM
Quality: significant step up from E2B on reasoning tasks, handles most enterprise tasks well
Sweet spot: the best quality-to-size ratio for client-side deployment on modern hardware

Gemma 4 12B

Quantised size: 7-8GB at INT4
Runs in: laptops with discrete GPUs, desktops, on-premises servers
Quality: strong across all standard enterprise tasks

Gemma 4 27B

Quantised size: 15-17GB at INT4
Runs in: workstations with 24GB+ GPU VRAM, on-premises servers
Quality: competitive with significantly larger proprietary models on most benchmarks
The go-to choice for on-premises deployment where you have GPU budget

The E2B and E4B models use a Mixture-of-Experts (MoE) architecture, which is why their effective parameter count is lower than their total parameter count. MoE models activate only a subset of their parameters for each token, reducing compute and memory requirements during inference while maintaining quality that punches above the active parameter count.

The rest of the field

Meta Llama 3.x family

Llama 3.1 and 3.2 remain widely used, with models at 1B, 3B, 8B, and 70B parameter sizes. The 8B model is a strong general-purpose choice for on-premises deployment.

Key considerations for enterprise:

Custom Llama licence, not Apache 2.0. Requires legal review.
700M MAU threshold for commercial use without separate agreement.
Restriction on using outputs to train competing models.
Very large community and ecosystem -- more fine-tuned variants available than any other family.
Llama 3.2 1B and 3B are specifically designed for edge deployment on mobile and IoT devices.

Mistral family

Mistral offers models under both Apache 2.0 and proprietary licences. For edge AI:

Mistral 7B (Apache 2.0): mature, well-tested, but increasingly outperformed by newer models.
Mistral Small (Devstral): strong coding model, available under Apache 2.0.
Mixtral 8x7B (Apache 2.0): MoE architecture, excellent quality, but at ~25GB quantised it requires server-class hardware.

Microsoft Phi-4 family

Microsoft's small model line is specifically designed for edge deployment:

Phi-4 Mini (3.8B): MIT licence. Strong reasoning for its size, particularly good at maths and structured tasks.
Phi-4 Multimodal: handles text and images, useful for document understanding workflows.
Phi models tend to excel at structured, analytical tasks but can feel less natural for conversational use cases.

Alibaba Qwen 3 family

Qwen 3 offers models from 0.6B to 235B under Apache 2.0:

Qwen 3 4B: competitive with Gemma 4 E4B, strong multilingual support (particularly CJK languages).
Qwen 3 8B: excellent general-purpose model for on-premises deployment.
Qwen 3 32B: MoE architecture, strong reasoning, competitive with much larger models.
Apache 2.0 licence makes it enterprise-friendly.
Particularly strong choice if your organisation works extensively with Chinese, Japanese, or Korean text.

Your enterprise operates globally with significant business in Japan and South Korea. You need a small model for on-device deployment that handles multilingual document summarisation. Which model family should you evaluate first?

Benchmarks that matter vs benchmarks that don't

The AI community has a benchmarking problem. Most public benchmarks measure capabilities that do not map directly to enterprise workloads.

Benchmarks that are less useful for enterprise evaluation:

MMLU (Massive Multitask Language Understanding): Tests broad academic knowledge across 57 subjects. Tells you if a model knows undergraduate-level facts. Does not tell you if it can summarise your internal documents accurately.
HumanEval/MBPP: Measures code generation on isolated algorithmic problems. Does not tell you if the model can work with your specific codebase, frameworks, and patterns.
HellaSwag: Tests commonsense reasoning through sentence completion. Fun, but irrelevant to whether the model can extract key dates from a contract.

Benchmarks that are more useful:

MT-Bench: Multi-turn conversation quality. Closer to how enterprise users actually interact with AI.
RULER/NIAH (Needle in a Haystack): Tests long-context retrieval accuracy. Matters for RAG and document analysis workloads.
IFEval (Instruction Following Evaluation): Tests whether the model follows specific formatting and constraint instructions. Critical for enterprise workflows where output format matters.

The only benchmark that truly matters: your own evaluation.

Create a test set of 50-100 representative tasks from your actual workloads. Run every candidate model against this set. Score outputs on:

Accuracy: Is the information correct?
Completeness: Did it address all parts of the request?
Format compliance: Did it follow output format instructions?
Hallucination rate: Did it invent information not in the source material?
Language quality: Is the output well-written and professional?

A model that scores 5 points lower on MMLU but 15 points higher on your internal evaluation set is the better model for your organisation. Full stop.

You are evaluating Gemma 4 27B vs Llama 3.1 70B for an on-premises document summarisation deployment. Llama 3.1 70B scores higher on MMLU. Gemma 4 27B requires one GPU; Llama 3.1 70B requires four. How should you decide?

Why Apache 2.0 matters for enterprise deployment

We have covered the licence landscape, but this point deserves its own emphasis because it has outsized practical impact on enterprise deployment velocity.

When you standardise on Apache 2.0 models, something operationally powerful happens: legal review becomes a one-time event.

Your legal team reviews the Apache 2.0 licence once. They confirm it permits commercial use, modification, distribution, and embedding in proprietary products with no restrictions. They issue a blanket approval. From that point forward, any team in the organisation can deploy any Apache 2.0 model -- Gemma 4, Qwen 3, Mistral 7B, any Apache 2.0 fine-tune -- without going back to legal.

Compare this with a portfolio approach where different teams use Llama (custom licence), Gemma (Apache 2.0), Mistral Large (proprietary licence), and various fine-tunes with unknown licence provenance. Each model requires its own legal review. Each fine-tune's training data provenance needs verification. Each deployment needs licence compliance monitoring.

At a 500-person engineering organisation, this overhead is real. One enterprise we worked with estimated that legal review of AI model licences was consuming 80+ hours of in-house counsel time per quarter, because different teams kept selecting models with different licensing terms.

The Apache 2.0 standardisation decision is not about which model is 2% better on a benchmark. It is about reducing licence compliance overhead to near zero so that teams can move at the speed of engineering, not the speed of legal review.

This is a major reason why the Gemma 4 and Qwen 3 families are particularly attractive for enterprise edge AI. Both offer competitive models across the size spectrum -- from 2B to 27B+ -- under Apache 2.0. You can use E2B in the browser, E4B on phones, 12B on laptops, and 27B on-premises, all under a single approved licence.

✎

Module 2 -- Final Assessment

What is the key difference between Meta's Llama licence and Apache 2.0 that matters most for enterprise deployment?

Why do Gemma 4's E2B and E4B models have lower effective parameter counts than their total parameter counts?

Your internal evaluation shows Model A scores 72% on your task set with 1 GPU required, and Model B scores 76% with 4 GPUs required. Model B scores 8 points higher on MMLU. What is the most defensible deployment decision?

An enterprise standardises on Apache 2.0 models exclusively. What operational benefit does this provide beyond legal risk reduction?