Desktop and Mobile Deployment

The native advantage

Browser deployment is powerful for reach, but native applications have advantages that matter for enterprise:

Full GPU access. Native applications use the GPU directly via Metal, CUDA, Vulkan, or DirectML. No browser sandbox. No shared memory constraints. A native app on an M3 MacBook Pro with 36GB unified memory can run models that no browser tab could handle.
Background processing. Browsers throttle background tabs. Native applications can run inference in the background while the user does other work.
System integration. Native apps can access the filesystem, clipboard, system notifications, and OS-level accessibility features. A desktop AI assistant that monitors your clipboard for text to summarise is possible natively but not in a browser.
Offline by default. Native applications with bundled models work offline without any special architecture. The model is just a file on disk.

The tradeoff is distribution. Browsers need no installation; native applications need packaging, signing, and deployment through MDM or app stores. For enterprise internal tools where you control the deployment pipeline, this is usually acceptable.

Your organisation wants an AI-powered writing assistant that helps employees draft emails and reports. It should work offline and integrate with the OS clipboard. What deployment target fits best?

macOS: Metal, llama.cpp, and MLX

macOS on Apple Silicon is the best platform for local AI inference, full stop. The unified memory architecture, Metal GPU framework, and neural engine make it the most capable edge AI platform available.

llama.cpp with Metal

llama.cpp supports Metal acceleration natively. On Apple Silicon, it uses the GPU via Metal for matrix operations while leveraging the unified memory architecture to avoid the CPU-GPU data transfer bottleneck that plagues discrete GPU systems.

# Build llama.cpp with Metal support
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_METAL=1

# Run inference
./main -m gemma-4-27b-it-Q4_K_M.gguf \
  -p "Summarise this contract:" \
  --n-gpu-layers 99 \  # Offload all layers to GPU
  -n 512 \             # Max tokens to generate
  -t 8                 # CPU threads for non-GPU operations

Performance on Apple Silicon (Gemma 4 27B, Q4_K_M):

Chip	Memory	Tokens/sec (prompt)	Tokens/sec (generation)
M1 (8GB)	8 GB	Cannot fit 27B	Use 12B or smaller
M1 Pro (16GB)	16 GB	80-120	12-18
M2 Max (32GB)	32 GB	150-220	18-25
M3 Pro (18GB)	18 GB	100-150	15-22
M3 Max (36GB)	36 GB	200-280	22-30
M4 Max (48GB)	48 GB	250-350	28-38

MLX (Apple's ML framework)

MLX is Apple's machine learning framework designed specifically for Apple Silicon. It provides a NumPy-like API with lazy evaluation and unified memory optimisation.

# Install MLX
pip install mlx-lm

# Run inference with MLX
mlx_lm.generate \
  --model mlx-community/gemma-2-27b-it-4bit \
  --prompt "Summarise this document:" \
  --max-tokens 512

MLX often outperforms llama.cpp on Apple Silicon by 10-20% because it is specifically optimised for the Metal and Neural Engine pipeline. For macOS-specific deployments, MLX is the performance leader.

When to choose which:

llama.cpp: Cross-platform compatibility, GGUF format ecosystem, server mode with API
MLX: Maximum performance on Apple Silicon, Python-native API, training and fine-tuning support

Windows: CUDA, Vulkan, and DirectML

Windows deployment centres on GPU framework selection.

NVIDIA GPUs (CUDA): The fastest path. llama.cpp with CUDA, vLLM, and text-generation-inference all support CUDA natively. If your fleet has NVIDIA GPUs (even GTX 1060 or newer), CUDA is the answer.

# Build llama.cpp with CUDA on Windows
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release

# Run with GPU offloading
.\build\bin\Release\main.exe -m model.gguf --n-gpu-layers 99

AMD GPUs (Vulkan): llama.cpp supports Vulkan as a cross-vendor GPU backend. Performance is 60-80% of CUDA on equivalent hardware, but it works on AMD, Intel, and NVIDIA GPUs.

# Build llama.cpp with Vulkan
cmake -B build -DLLAMA_VULKAN=ON
cmake --build build --config Release

Intel GPUs (DirectML/SYCL): For Intel Arc or integrated graphics, DirectML provides hardware acceleration through the Windows ML runtime. Performance is lower than CUDA but adequate for small models.

Windows-specific considerations:

GPU drivers matter more than on macOS. Keep NVIDIA drivers updated for latest CUDA performance.
Windows Defender real-time scanning can slow model file loading. Exclude model directories from scanning.
WSL2 (Windows Subsystem for Linux) provides GPU passthrough, so Linux deployment instructions work inside WSL2.

Linux deployment

Linux is the most flexible platform. All inference engines support Linux natively, and you have full control over drivers, GPU scheduling, and system configuration.

# llama.cpp with CUDA on Linux
cmake -B build -DLLAMA_CUDA=ON
cmake --build build

# llama.cpp as a server (OpenAI-compatible API)
./build/bin/llama-server \
  -m gemma-4-27b-it-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99

The llama-server mode is particularly useful for enterprise deployment: it exposes an OpenAI-compatible REST API, so existing applications that use openai.ChatCompletion.create() can point to your local server with a one-line configuration change.

Your organisation has a mixed fleet: 60% of laptops have NVIDIA GPUs, 25% have AMD GPUs, and 15% have Intel integrated graphics only. What inference engine setup provides the broadest coverage?

iOS and Android: on-device inference

Mobile deployment has unique constraints: limited thermal headroom, battery conservation, and strict OS-level restrictions on background processing.

iOS deployment options:

CoreML: Apple's machine learning framework for iOS. Optimised for the Neural Engine and GPU on Apple chipsets. CoreML models are compiled to a device-specific format at install time.

Best performance for Apple hardware
Requires model conversion to CoreML format (using coremltools)
Supported model sizes: up to ~4B parameters quantised on devices with 6GB+ RAM (iPhone 15 Pro and later)
Integrates with Swift and SwiftUI natively

MLX on iOS: Apple is extending MLX to iOS, enabling the same framework used on macOS for mobile inference. This simplifies cross-device deployment within the Apple ecosystem.

MediaPipe LLM: Google's cross-platform inference library works on iOS with GPU acceleration via Metal. Supports Gemma models out of the box.

// MediaPipe LLM on iOS (Swift)
import MediaPipeLLM

let inference = LlmInference(options: LlmInference.Options(
    modelPath: Bundle.main.path(forResource: "gemma-2b-it-q4", ofType: "bin")!,
    maxTokens: 512,
    topK: 40,
    temperature: 0.3
))

let response = try await inference.generateResponse(
    inputText: "Summarise this meeting note:"
)

Android deployment options:

MediaPipe LLM: The primary option for Android. Uses GPU acceleration via OpenCL or the Android NNAPI (Neural Networks API). Broad device support.

// MediaPipe LLM on Android (Kotlin)
val options = LlmInference.LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma-2b-it-q4.bin")
    .setMaxTokens(512)
    .setTemperature(0.3f)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)
val result = llmInference.generateResponse("Summarise this report:")

llama.cpp via NDK: For maximum flexibility, llama.cpp can be compiled for Android using the NDK and integrated into apps via JNI. This gives access to the full GGUF model ecosystem.

TensorFlow Lite (TFLite): Google's lightweight inference engine for mobile. Supports quantised models with GPU and NNAPI delegates. More mature than MediaPipe LLM but requires TFLite model format.

Mobile-specific constraints:

Thermal throttling: Phones reduce CPU/GPU clock speeds when they get hot. Sustained inference causes throttling within 30-60 seconds, reducing performance by 30-50%. Design for burst usage (quick queries) rather than continuous generation.
Battery drain: GPU inference at full load consumes 3-6W on a modern phone. A 10-second inference run uses roughly the same energy as 2-3 minutes of web browsing. Acceptable for occasional use, problematic if the user is running dozens of queries per hour.
Background restrictions: iOS and Android aggressively suspend background applications. Model loading takes 2-5 seconds, so if the app is backgrounded and resumed, the user may experience a loading delay. Keep the model loaded in a foreground service (Android) or use background task extensions (iOS) to mitigate this.
RAM limits: Even flagships have 8-12GB of RAM, shared with the OS and all apps. Practical model size limit is 2-4GB at INT4, which means models up to ~4B parameters.

What to realistically expect from Gemma 4 E2B on a modern phone (2024+):

Device	Tokens/sec (generation)	Time-to-first-token	Max context
iPhone 15 Pro (8GB)	15-25	150-300ms	4K tokens
iPhone 16 Pro (8GB)	20-30	100-200ms	4K tokens
Samsung Galaxy S24 Ultra (12GB)	12-20	200-400ms	4K tokens
Pixel 9 Pro (12GB)	14-22	180-350ms	4K tokens

These numbers are for the E2B model (2B effective parameters, Q4). Adequate for quick summarisation, Q&A, and classification tasks. Not suitable for extended conversations or long document generation.

You are building a field maintenance app for technicians who need AI-assisted troubleshooting on-site. They use Android phones. Which inference approach is best?

Electron and Tauri: desktop application packaging

For enterprise distribution, you need to package the inference engine and model into a distributable desktop application. Two frameworks dominate this space.

Electron bundles Chromium and Node.js. Your application is a web app (HTML/CSS/JavaScript) with full Node.js capabilities for file system access, process management, and native addons.

Pros: mature ecosystem, large community, good documentation, easy to hire for
Cons: large binary size (~150-200MB for the Electron shell alone, before your model), high memory usage
llama.cpp integration: use node-llama-cpp (Node.js bindings) or spawn llama.cpp as a child process

Tauri bundles the system's native webview (WebKit on macOS, WebView2 on Windows, WebKitGTK on Linux) with a Rust backend. Much smaller binary size and memory footprint.

Pros: 5-10MB shell (vs 150-200MB for Electron), lower memory usage, Rust backend for performance-critical code
Cons: smaller community, requires Rust knowledge for backend, less mature plugin ecosystem
llama.cpp integration: Rust bindings via llama-cpp-rs, or spawn as a child process

The practical architecture:

Desktop App (Electron or Tauri)
├── Frontend (HTML/CSS/JS)
│   └── Chat UI with streaming output
├── Backend (Node.js or Rust)
│   ├── llama.cpp inference engine
│   ├── Model file management (download, cache, update)
│   └── Local API server (localhost:PORT)
└── Bundled model file (1.5-17GB)
    └── Or: first-run download with progress UI

The model file is either bundled with the application (increasing installer size by 1.5-17GB) or downloaded on first run. For enterprise deployment via MDM, bundling avoids network issues during installation. For self-service download, first-run download with caching is more practical.

Distribution via MDM (Mobile Device Management):

Enterprise IT teams deploy desktop applications through MDM tools (Jamf, Intune, SCCM). The application package includes:

The Electron/Tauri shell
The inference engine (llama.cpp binary or library)
Configuration files (model selection, enterprise settings)
Optionally, the model file itself

The model file can also be distributed separately via a network share, reducing the MDM package size.

✎

Module 7 -- Final Assessment

Why does Apple Silicon provide a particular advantage for local AI inference compared to systems with discrete GPUs?

A field technician's Android phone starts generating AI responses slowly after 30-60 seconds of continuous use. What is the most likely cause?

You need to deploy a desktop AI application to 2,000 enterprise Windows machines via Intune. The application uses Gemma 4 E4B (3GB model). What is the most practical distribution strategy?

Your organisation's laptop fleet is 60% NVIDIA GPU, 25% AMD GPU, 15% Intel integrated. What llama.cpp build strategy provides the best experience across all devices?