In the previous post, I showed you how easy it is to integrate a pre-defined, on-device language model into your app using MLX Swift.

Exploring MLX Swift: Adding On-Device Inference to your App
Learn how to integrate MLX Swift into your iOS and macOS apps for on-device AI inference. This guide shows how to add local language models using Apple’s MLX framework, enabling offline AI capabilities on Apple silicon.

We used models directly from ModelRegistry which is pretty nice and convenient. But, you are not limited to the models defined in MLXLLM. You can even go bigger and more custom, right up to running a 7B parameter model on your iPhone 16 Pro!


In this post, I will walk you through how to configure a model that is not already defined in ModelRegistry. We will explore how to bring in a Qwen 2.5 model as an example.

Beyond Pre-Defined Models

We relied on pre-built configurations like ModelRegistry.llama3_2_1B_4bit or ModelRegistry.llama3_2_3B_4bit to load and run on-device inference.

But as new model launches, you will most likely not find them in ModelRegistry by default initially. If you are anything like me who loves to feed his shiny object syndrome, you can define your own configuration and load it manually. This approach will let you choose any model—maybe even one you trained yourself—to run locally!

Below is an excerpt from a Qwen 2 model configuration from MLXLLM. It defines layers, attention mechanisms, rotary embeddings, and more.

public struct Qwen2Configuration: Codable, Sendable {
    var hiddenSize: Int
    var hiddenLayers: Int
    var intermediateSize: Int
    var attentionHeads: Int
    var rmsNormEps: Float
    var vocabularySize: Int
    var kvHeads: Int
    var ropeTheta: Float = 1_000_000
    var ropeTraditional: Bool = false
    var ropeScaling: [String: StringOrNumber]? = nil
    var tieWordEmbeddings = false

    // Decoding logic...
}

// The model definition includes TransformerBlock, MLP, and Attention modules.
// All these pieces come together to form the Qwen2Model class, which conforms
// to the LLMModel protocol and can be plugged into MLX Swift’s inference pipeline.
public class Qwen2Model: Module, LLMModel, KVCacheDimensionProvider {
    public let vocabularySize: Int
    public let kvHeads: [Int]

    let model: Qwen2ModelInner
    let configuration: Qwen2Configuration

    @ModuleInfo(key: "lm_head") var lmHead: Linear

    // Initialization logic...
    // ...
}

This code snippet may look like a lot, but in reality, it is not something you have to deal with, at least in the case of Qwen models for now. Think of it as a template which is already present in MLXLLM.

You can load and run the model locally just like you did with the smaller model examples

Steps to Use a Custom Model

  • Create or Adapt a Configuration: Find (or create) a configuration that matches the architecture of your model.
  • Point to Your Model Weights and Tokenizer: Make sure you have the .safetensors files for Qwen 2.5 Coder on Hugging Face or in your app’s bundle. The tokenizer files should also be accessible. The Qwen 2.5 Coder model already has been converted by the MLX Community so you can directly use it
mlx-community/Qwen2.5-Coder-7B-Instruct-4bit · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Here is the code to create an extension on ModelRegistry. You can follow the naming pattern used by the existing configurations, where the model name and size are reflected in the variable name. For a Qwen 2.5 Coder 7B model, a suitable variable name might be:

extension MLXLLM.ModelRegistry {
    static public let qwen2_5Coder_7B_4bit = ModelConfiguration(
        id: "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
        overrideTokenizer: "PreTrainedTokenizer",
        defaultPrompt: "What is really the meaning of life?"
    )
}

When you see the code setting overrideTokenizer: "PreTrainedTokenizer" in a ModelConfiguration, it means that the model will rely on a tokenizer that has been pretrained and configured externally—typically from the model’s own Hugging Face Hub repository or other provided configuration files—instead of using a built-in or default tokenizer setup.

Rather than constructing a tokenizer from scratch or using a generic tokenizer, you are telling the code to load a tokenizer that matches the model’s architecture, vocabulary, special tokens, and text-processing rules. This ensures that the model and tokenizer align perfectly, since the tokenizer is responsible for converting raw input text into the token IDs that the model understands.

Then, you load the container:

let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: customModelConfiguration
) { progress in
    debugPrint("Downloading custom model: \(Int(progress.fractionCompleted * 100))%")
}

This will download weights, create the model, and prepare it for inference—exactly like in the previous blog post.

Finall, generate the text:

let prompt =
        """
When asked to write code, respond with only the code itself. Do not use markdown code fences (```), file names, or explanatory text. Just output the raw code directly.

Example - if asked "Write a hello world SwiftUI view", respond exactly like this:
struct ContentView: View {
    var body: some View {
        Text("Hello World")
    }
}

Not like this:
```swift
struct ContentView: View {
    var body: some View {
        Text("Hello World")
    }
}

Write a beautiful SwiftUI view. Skip any preamble.
"""
let result = try await modelContainer.perform { [prompt] context in
    let input = try await context.processor.prepare(input: .init(prompt: prompt))
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        let text = context.tokenizer.decode(tokens: tokens)
        // Update your UI with 'text' as it’s generated
        self.output = text
        return .more
    }
}
The prompt can be broken down in user and system prompt and it will be covered in a subsequent blog post.

That is it! You have loaded a custom, larger model and generated text on-device.

Oh wait.

The app has been killed by the operating system because it is using too much memory. Why?

Running a big language model—like a 7B parameter model—can easily consume several gigabytes of RAM, even after quantization to 4-bit weights. This can quickly push memory use beyond what iOS considers safe for a background or even a foreground app, triggering the system to terminate it in order to maintain overall device stability.

Increased Memory Limit 

Apple has a Increased Memory Limit entitlement (com.apple.developer.kernel.increased-memory-limit) that you can add to your app to access more memory on supported devices.

Remember that even with this entitlement, the system still manages resources and will only allocate additional memory if it is available and won’t compromise overall device stability.

To enable the Increased Memory Limit entitlement in your app:

  • Select Your App’s Target: In the Project Navigator, click on your project file at the top. Then, select your app’s target from the “Targets” section.
  • Go to the Signing & Capabilities Tab: With your target selected, click on the Signing & Capabilities tab in the main editor area.
  • Add a New Capability: Click the + Capability button in the upper-left corner of the “Signing & Capabilities” section.
  • Search for “Increased Memory Limit”: Type “Increased Memory Limit” into the search bar that appears. If the entitlement is available, it will appear in the search results.
  • Enable the Increased Memory Limit Entitlement:

Select the “Increased Memory Limit” capability to add it. This will automatically update your app’s entitlements file to include:

<key>com.apple.developer.kernel.increased-memory-limit</key>
<true/>

Release Configuration

Apart from that, you can also test memory usage in Release configuration. Debug builds can have additional overhead and less optimization. Switching to a Release build might show better memory performance.

And after setting up the entitlement, this powerful model should work on your latest iPhone!

0:00
/0:08

Moving Forward

There are many models that you can play around with. As soon as the latest Hermes 3 dropped, I converted it into MLX and got it running as it was based on the Llama 3.2 architecture, something that MLX already supports:

rudrankriyam/Hermes-3-Llama-3.2-3B-4bit · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

In the subsequent blog posts, we will explore working with the configurations and generation parameters!


If you have any questions or want to share what you’re building with MLX Swift, feel free to drop a comment below or reach out on Twitter @rudrankriyam or on Bluesky @rudrankriyam.bsky.social.

Happy MLXing!

String Catalog

String Catalog - App Localization on Autopilot

Push to GitHub, and we'll automatically localize your app for 40+ languages, saving you hours of manual work.

Tagged in: