In the previous post, I showed you how easy it is to integrate a pre-defined, on-device language model into your app using MLX Swift.
We used models directly from ModelRegistry
which is pretty nice and convenient. But, you are not limited to the models defined in MLXLLM
. You can even go bigger and more custom, right up to running a 7B parameter model on your iPhone 16 Pro!
In this post, I will walk you through how to configure a model that is not already defined in ModelRegistry
. We will explore how to bring in a Qwen 2.5 model as an example.
Beyond Pre-Defined Models
We relied on pre-built configurations like ModelRegistry.llama3_2_1B_4bit
or ModelRegistry.llama3_2_3B_4bit
to load and run on-device inference.
But as new model launches, you will most likely not find them in ModelRegistry
by default initially. If you are anything like me who loves to feed his shiny object syndrome, you can define your own configuration and load it manually. This approach will let you choose any model—maybe even one you trained yourself—to run locally!
Below is an excerpt from a Qwen 2 model configuration from MLXLLM
. It defines layers, attention mechanisms, rotary embeddings, and more.
public struct Qwen2Configuration: Codable, Sendable {
var hiddenSize: Int
var hiddenLayers: Int
var intermediateSize: Int
var attentionHeads: Int
var rmsNormEps: Float
var vocabularySize: Int
var kvHeads: Int
var ropeTheta: Float = 1_000_000
var ropeTraditional: Bool = false
var ropeScaling: [String: StringOrNumber]? = nil
var tieWordEmbeddings = false
// Decoding logic...
}
// The model definition includes TransformerBlock, MLP, and Attention modules.
// All these pieces come together to form the Qwen2Model class, which conforms
// to the LLMModel protocol and can be plugged into MLX Swift’s inference pipeline.
public class Qwen2Model: Module, LLMModel, KVCacheDimensionProvider {
public let vocabularySize: Int
public let kvHeads: [Int]
let model: Qwen2ModelInner
let configuration: Qwen2Configuration
@ModuleInfo(key: "lm_head") var lmHead: Linear
// Initialization logic...
// ...
}
This code snippet may look like a lot, but in reality, it is not something you have to deal with, at least in the case of Qwen models for now. Think of it as a template which is already present in MLXLLM
.
You can load and run the model locally just like you did with the smaller model examples
Steps to Use a Custom Model
- Create or Adapt a Configuration: Find (or create) a configuration that matches the architecture of your model.
- Point to Your Model Weights and Tokenizer: Make sure you have the
.safetensors
files for Qwen 2.5 Coder on Hugging Face or in your app’s bundle. The tokenizer files should also be accessible. The Qwen 2.5 Coder model already has been converted by the MLX Community so you can directly use it
Here is the code to create an extension on ModelRegistry
. You can follow the naming pattern used by the existing configurations, where the model name and size are reflected in the variable name. For a Qwen 2.5 Coder 7B model, a suitable variable name might be:
extension MLXLLM.ModelRegistry {
static public let qwen2_5Coder_7B_4bit = ModelConfiguration(
id: "mlx-community/Qwen2.5-Coder-7B-Instruct-4bit",
overrideTokenizer: "PreTrainedTokenizer",
defaultPrompt: "What is really the meaning of life?"
)
}
When you see the code setting overrideTokenizer: "PreTrainedTokenizer"
in a ModelConfiguration
, it means that the model will rely on a tokenizer that has been pretrained and configured externally—typically from the model’s own Hugging Face Hub repository or other provided configuration files—instead of using a built-in or default tokenizer setup.
Rather than constructing a tokenizer from scratch or using a generic tokenizer, you are telling the code to load a tokenizer that matches the model’s architecture, vocabulary, special tokens, and text-processing rules. This ensures that the model and tokenizer align perfectly, since the tokenizer is responsible for converting raw input text into the token IDs that the model understands.
Then, you load the container:
let modelContainer = try await LLMModelFactory.shared.loadContainer(
configuration: customModelConfiguration
) { progress in
debugPrint("Downloading custom model: \(Int(progress.fractionCompleted * 100))%")
}
This will download weights, create the model, and prepare it for inference—exactly like in the previous blog post.
Finall, generate the text:
let prompt =
"""
When asked to write code, respond with only the code itself. Do not use markdown code fences (```), file names, or explanatory text. Just output the raw code directly.
Example - if asked "Write a hello world SwiftUI view", respond exactly like this:
struct ContentView: View {
var body: some View {
Text("Hello World")
}
}
Not like this:
```swift
struct ContentView: View {
var body: some View {
Text("Hello World")
}
}
Write a beautiful SwiftUI view. Skip any preamble.
"""
let result = try await modelContainer.perform { [prompt] context in
let input = try await context.processor.prepare(input: .init(prompt: prompt))
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
let text = context.tokenizer.decode(tokens: tokens)
// Update your UI with 'text' as it’s generated
self.output = text
return .more
}
}
The prompt can be broken down in user and system prompt and it will be covered in a subsequent blog post.
That is it! You have loaded a custom, larger model and generated text on-device.
Oh wait.
The app has been killed by the operating system because it is using too much memory. Why?
Running a big language model—like a 7B parameter model—can easily consume several gigabytes of RAM, even after quantization to 4-bit weights. This can quickly push memory use beyond what iOS considers safe for a background or even a foreground app, triggering the system to terminate it in order to maintain overall device stability.
Increased Memory Limit
Apple has a Increased Memory Limit entitlement (com.apple.developer.kernel.increased-memory-limit) that you can add to your app to access more memory on supported devices.
Remember that even with this entitlement, the system still manages resources and will only allocate additional memory if it is available and won’t compromise overall device stability.
To enable the Increased Memory Limit entitlement in your app:
- Select Your App’s Target: In the Project Navigator, click on your project file at the top. Then, select your app’s target from the “Targets” section.
- Go to the Signing & Capabilities Tab: With your target selected, click on the Signing & Capabilities tab in the main editor area.
- Add a New Capability: Click the + Capability button in the upper-left corner of the “Signing & Capabilities” section.
- Search for “Increased Memory Limit”: Type “Increased Memory Limit” into the search bar that appears. If the entitlement is available, it will appear in the search results.
- Enable the Increased Memory Limit Entitlement:
Select the “Increased Memory Limit” capability to add it. This will automatically update your app’s entitlements file to include:
<key>com.apple.developer.kernel.increased-memory-limit</key>
<true/>
Release Configuration
Apart from that, you can also test memory usage in Release configuration. Debug builds can have additional overhead and less optimization. Switching to a Release build might show better memory performance.
And after setting up the entitlement, this powerful model should work on your latest iPhone!
Moving Forward
There are many models that you can play around with. As soon as the latest Hermes 3 dropped, I converted it into MLX and got it running as it was based on the Llama 3.2 architecture, something that MLX already supports:
In the subsequent blog posts, we will explore working with the configurations and generation parameters!
If you have any questions or want to share what you’re building with MLX Swift, feel free to drop a comment below or reach out on Twitter @rudrankriyam or on Bluesky @rudrankriyam.bsky.social.
Happy MLXing!