Exploring MLX Swift: Converting Models to MLX

In the previous post, I walked you through configuring different models that are already available in MLX Community page on Hugging Face. We explored how to use pre-defined model configurations to integrate on-device inference into your iOS app.

When I saw Teknium announce the latest Hermes 3 3B model, I was excited but realised nobody had converted it to MLX format so I decided to do it on my own.

Introducing a smol Hermes 3 LLM!

Hermes 3 3B is now available on @huggingface alongside quantized GGUF versions to make it even smaller.

More info and download links here: https://t.co/WIVBuWK2Sf

Run Hermes on phones, laptops, and CPUs without sacrificing speed, and may also… pic.twitter.com/ad8sjG7eb1
— Nous Research (@NousResearch) December 11, 2024

As it is based on the Llama 3.2 architecture, I could leverage MLX’s existing support for Llama models to get it running with absolutely minimal effort.

Steps to Convert Model to MLX

Here’s how you can convert a particular model from Hugging Face into MLX format using mlx-lm tools:

Create a Project Folder: First, organize your workspace:

mkdir hermes3_conversion
cd hermes3_conversion

Set Up a Python Environment: Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install MLX Tools: Install mlx-lm Python package:

pip install mlx mlx-lm

Convert the Model: Use the MLX conversion tool to convert the model from Hugging Face:

mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q

• --hf-path: The Hugging Face model repository path.

• -q: Optimizes model weights during conversion, reducing memory usage to make the model feasible for mobile devices.

When you run this command, you will see an output similar to:

[INFO] Loading
Fetching 8 files: 100%|███████████████████████████| 8/8 [00:00<00:00, 10.91it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.

Upload to MLX Community (Optional)

If you want to be a good MLX club member, you can share the converted model with others:

mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q --upload-repo mlx-community/Hermes-3-Llama-3.2-3B-4bit

This uploads the converted model to the Hugging Face mlx-community repository for easy access to everyone.

Note: Make sure you have joined the community first!

Configuring the Converted Model in Your App

Once the model is converted, create a configuration for it in your project. Extend the ModelRegistry to add support for Hermes 3:

extension MLXLLM.ModelRegistry {
    static public let hermes3Llama_3_2_3B_4bit = ModelConfiguration(
        id: "mlx-community/Hermes-3-Llama-3.2-3B-4bit",
        defaultPrompt: "What are the implications of the Fermi Paradox?"
    )
}

This configuration specifies:

id: The model’s repository path on Hugging Face.
defaultPrompt: A sample prompt for testing the model.

Loading the Model in Your App

Now that the configuration is ready, load the model using LLMModelFactory:

let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelRegistry.hermes3Llama_3_2_3B_4bit
) { progress in
    debugPrint("Downloading Hermes 3 model: \(Int(progress.fractionCompleted * 100))%")
}

This downloads the weights, creates the model, and prepares it for inference. Once the model is loaded, you can use it to generate text. Here is an example:

let prompt = """
What are the implications of the Fermi Paradox?
"""

let result = try await modelContainer.perform { [prompt] context in
    let input = try await context.processor.prepare(input: .init(prompt: prompt))
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        let text = context.tokenizer.decode(tokens: tokens)
        
        Task { @MainActor in
            self.output = text
        }
        return .more
    }
}

Running a 3B parameter model, even with 4-bit quantization, can push memory usage close to the iOS limit. Use Apple’s Increased Memory Limit entitlement to allow your app to exceed the default memory cap. Refer to my previous post for detailed steps:

Unsupported Architectures

If the model architecture is not supported by MLX, you must start with MLX Swift's source code to add compatibility. This requires a deep understanding of the needed architecture in the first place.

I wasted a whole night trying to get Nvidia's Hymba 1.5B running but I am not sure MLX Swift has support for the Mamba architecture nor do I possess the knowledge (yet) to add it myself.

Even after hours of working with Cursor and seeing the sunrise, I got it to work but the response was gibberish, and I had no idea how to fix it.

Moving Forward

That’s it! You have now converted and configured a model, ready for on-device inference with MLX Swift. In the future posts, we will explore optimizing inference with generation parameters.

If you have any questions or stories about working with MLX Swift, let me know on Twitter @rudrankriyam or Bluesky @rudrankriyam.bsky.social.

Happy MLXing!

Tagged in:

AI MLX Swift

Exploring MLX Swift: Converting Models to MLX

Master AI-assisted iOS Development

Steps to Convert Model to MLX

Configuring the Converted Model in Your App

Loading the Model in Your App

Unsupported Architectures

Moving Forward

Master AI-assisted iOS Development

Rudrank Riyam

Other Stories

Exploring MLX Swift: Configuring Different Models

Exploring MLX Swift: Adding On-Device Vision Models to Your App

AiOS Dispatch 7

AiOS Dispatch 6

AiOS Dispatch 5

Press ESC to close

Or check our Popular Categories...

Subscribe to AiOS Dispatch!

Steps to Convert Model to MLX

Configuring the Converted Model in Your App

Loading the Model in Your App

Unsupported Architectures

Moving Forward

Share Article:

Related Articles

Other Stories

Exploring MLX Swift: Configuring Different Models

Exploring MLX Swift: Adding On-Device Vision Models to Your App