For UAE and broader GCC AI workloads operating on Arabic data, the language model layer reached an inflection point in late 2025. Open-weight Arabic-capable models — Falcon-H1 Arabic from TII and Jais 2 from G42's Inception — moved from research-curiosity to production-quality. By May 2026 they're the default choice for any sovereign deployment processing Arabic content.
This post is the side-by-side comparison we use during the discovery sprint of Sovereign AI engagements. It's written for the architect or AI engineer choosing the model layer for an Arabic-heavy workload.
Why this question matters more in 2026
Three trends converged.
The CBUAE Sovereign Financial Cloud (launched February 2026) made data residency a hard constraint for in-scope workloads. Calling Claude or GPT for Arabic banking conversations now triggers compliance friction even where it didn't before.
The Dubai agentic AI mandate added documentation requirements that favour models you can fully audit — open-weight, deployed on your infrastructure, with traceable provenance.
The Arabic-LLM quality bar crossed the production threshold. Earlier Arabic-capable models (Jais 1, AceGPT) had quality gaps that pushed sophisticated workloads back to GPT-4 with translation layers. Falcon-H1 Arabic and Jais 2 closed enough of that gap to be the right answer for most cases.
The benchmark picture
We benchmark both models against client-specific corpora during discovery. The pattern across 2026 engagements:
| Task type | Falcon-H1 Arabic 70B | Jais 2 70B | Claude Sonnet 4.6 |
|---|---|---|---|
| MSA Q&A on retrieved context | Strong | Strong | Strong |
| Gulf dialect customer support | Strongest | Strong | Adequate |
| Egyptian/Levantine dialect | Strong | Adequate | Adequate |
| Saudi-context factual recall | Strong | Strongest | Adequate |
| Code generation (English) | Adequate | Adequate | Strongest |
| Code generation (Arabic context) | Strong | Strong | Adequate |
| Math reasoning | Adequate | Adequate | Strongest |
| Structured output (JSON) | Strong | Strong | Strongest |
| English-only tasks | Adequate | Adequate | Strongest |
| Arabic-English code-switching | Strong | Strong | Adequate |
For Arabic-content-heavy workloads, the Arabic models match or beat Claude Sonnet on most relevant tasks. The places they don't match — code, math, English-only — are usually addressable by hybrid routing.
The hybrid pattern
The architecture we ship by default for GCC clients:
- Detect language at request entry (lightweight classifier — ~50ms latency).
- Arabic input → route to Falcon-H1 Arabic or Jais 2 on sovereign infrastructure.
- English input → route to Claude/GPT through Vercel AI Gateway (subject to data classification — for regulated data, English routes to a sovereign English model like Llama 3.3 instead).
- Code-switched input → route to whichever Arabic LLM benchmarked stronger for the use case.
Routing decisions happen at the application layer. From the user's perspective, it's one interface; from the architecture's perspective, the right model handles each request.
The cost picture
Steady-state for a typical UAE banking deployment (~20-50 concurrent users on Arabic queries, ~3M tokens/day of Arabic content):
| Configuration | Capex | Monthly opex |
|---|---|---|
| Cloud-only (Claude via Bedrock) | AED 0 | AED 35,000–80,000 |
| Sovereign Foundations tier (Falcon-H1 70B on H100) | AED 280,000 | AED 12,000–18,000 |
| Sovereign Sovereign tier (multi-GPU H100/H200) | AED 380,000 | AED 18,000–28,000 |
Cloud-only is cheaper at very low volumes. Sovereign overtakes economically past about 1.5–2M tokens/day, while always being the only compliant option for in-scope regulated workloads.
For our Hisabi.ai operations, the hybrid pattern (Sovereign Falcon-H1 for Arabic, Claude via Gateway for English non-regulated tasks) hits the sweet spot of compliance + cost + quality.
When to pick Falcon-H1 Arabic
Pick Falcon-H1 Arabic if:
- Your primary workload is dialectal Arabic (Gulf, Egyptian, Levantine, Maghrebi)
- Latency budget is tight (slightly faster than Jais 2 at same parameter count)
- You want the open-source / non-commercial-restrictive licensing posture (Falcon's Apache 2.0 is more permissive)
- You're deploying on UAE infrastructure and want a UAE-developed model (TII originated)
When to pick Jais 2
Pick Jais 2 if:
- Saudi-context factual recall matters specifically (training corpus weighting)
- You want G42 ecosystem integration (Inception, Stargate UAE infrastructure)
- You're operating in a Saudi-regulatory context and prefer the KSA association
- Your benchmarks on your specific corpus favour it (always test)
The benchmark you should actually run
Generic benchmarks tell you the rough order of magnitude. Your specific workload is what decides. Standard discovery procedure:
- Curate 50–100 cases representative of your real traffic.
- Hand-label expected outputs (or rubric-based scoring criteria).
- Run all candidate models on the same inputs with consistent prompts.
- Score with the same scoring function across all models.
- Tally by task type to see where each model is strong.
Expect this to take 1–2 weeks for a single use case. The output is defensible model selection, not a guess from generic benchmarks.
What this changes about UAE/GCC AI agencies
Two strategic implications:
1. Translation layers are obsolete. The pattern of "translate Arabic to English, run English LLM, translate back" was an interim hack. With Falcon-H1 and Jais 2 production-ready, agencies still shipping translation-layer architectures are leaving quality and compliance on the table.
2. Sovereign + Arabic is the moat. Western AI agencies serving GCC clients have to either deploy Arabic-LLM expertise or partner — and most haven't. Local agencies who run Arabic-LLM in production (us; a small number of others) hold a structural advantage on regulated GCC work.
Where Codenovai fits
Every Sovereign AI + RAG deployment we ship for Arabic-content clients runs on Falcon-H1 Arabic or Jais 2 — selected based on benchmark results against the client's specific corpus during discovery. We've shipped both in production. We have the H100 capacity, the inference tooling (vLLM and Ollama), and the eval harness to make either choice work.
For pure Arabic-content workloads, we recommend starting with the Foundations tier at AED 150,000 — a single-corpus deployment on hardware sized for your specific workload, with model selection part of the engagement, not pre-decided.
Book a scoping call — bring your sample corpus and we'll have benchmark results within 7 days.
