In the previous post, I walked you through configuring different models that are already available in MLX Community page on Hugging Face. We explored how to use pre-defined model configurations to integrate on-device inference into your iOS app.
When I saw Teknium announce the latest Hermes 3 3B model, I was excited but realised nobody had converted it to MLX format so I decided to do it on my own.
As it is based on the Llama 3.2 architecture, I could leverage MLX’s existing support for Llama models to get it running with absolutely minimal effort.
Steps to Convert Model to MLX
Here’s how you can convert a particular model from Hugging Face into MLX format using mlx-lm tools:
- Create a Project Folder: First, organize your workspace:
mkdir hermes3_conversion
cd hermes3_conversion
- Set Up a Python Environment: Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
- Install MLX Tools: Install
mlx-lm
Python package:
pip install mlx mlx-lm
- Convert the Model: Use the MLX conversion tool to convert the model from Hugging Face:
mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q
• --hf-path: The Hugging Face model repository path.
• -q: Optimizes model weights during conversion, reducing memory usage to make the model feasible for mobile devices.
When you run this command, you will see an output similar to:
[INFO] Loading
Fetching 8 files: 100%|███████████████████████████| 8/8 [00:00<00:00, 10.91it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.
- Upload to MLX Community (Optional)
If you want to be a good MLX club member, you can share the converted model with others:
mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q --upload-repo mlx-community/Hermes-3-Llama-3.2-3B-4bit
This uploads the converted model to the Hugging Face mlx-community repository for easy access to everyone.
Note: Make sure you have joined the community first!
Configuring the Converted Model in Your App
Once the model is converted, create a configuration for it in your project. Extend the ModelRegistry
to add support for Hermes 3:
extension MLXLLM.ModelRegistry {
static public let hermes3Llama_3_2_3B_4bit = ModelConfiguration(
id: "mlx-community/Hermes-3-Llama-3.2-3B-4bit",
defaultPrompt: "What are the implications of the Fermi Paradox?"
)
}
This configuration specifies:
id
: The model’s repository path on Hugging Face.defaultPrompt
: A sample prompt for testing the model.
Loading the Model in Your App
Now that the configuration is ready, load the model using LLMModelFactory
:
let modelContainer = try await LLMModelFactory.shared.loadContainer(
configuration: ModelRegistry.hermes3Llama_3_2_3B_4bit
) { progress in
debugPrint("Downloading Hermes 3 model: \(Int(progress.fractionCompleted * 100))%")
}
This downloads the weights, creates the model, and prepares it for inference. Once the model is loaded, you can use it to generate text. Here is an example:
let prompt = """
What are the implications of the Fermi Paradox?
"""
let result = try await modelContainer.perform { [prompt] context in
let input = try await context.processor.prepare(input: .init(prompt: prompt))
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
let text = context.tokenizer.decode(tokens: tokens)
Task { @MainActor in
self.output = text
}
return .more
}
}
Running a 3B parameter model, even with 4-bit quantization, can push memory usage close to the iOS limit. Use Apple’s Increased Memory Limit entitlement to allow your app to exceed the default memory cap. Refer to my previous post for detailed steps:
Unsupported Architectures
If the model architecture is not supported by MLX, you must start with MLX Swift's source code to add compatibility. This requires a deep understanding of the needed architecture in the first place.
I wasted a whole night trying to get Nvidia's Hymba 1.5B running but I am not sure MLX Swift has support for the Mamba architecture nor do I possess the knowledge (yet) to add it myself.
Even after hours of working with Cursor and seeing the sunrise, I got it to work but the response was gibberish, and I had no idea how to fix it.
Moving Forward
That’s it! You have now converted and configured a model, ready for on-device inference with MLX Swift. In the future posts, we will explore optimizing inference with generation parameters.
If you have any questions or stories about working with MLX Swift, let me know on Twitter @rudrankriyam or Bluesky @rudrankriyam.bsky.social.
Happy MLXing!