In the previous post, I walked you through configuring different models that are already available in MLX Community page on Hugging Face. We explored how to use pre-defined model configurations to integrate on-device inference into your iOS app.

Exploring MLX Swift: Configuring Different Models
Learn how to integrate custom large language models into iOS/macOS apps using MLX Swift. This guide shows how to configure and run models like Qwen 2.5 locally on Apple silicon, with tips for handling memory limits and entitlements for on-device AI inference.

When I saw Teknium announce the latest Hermes 3 3B model, I was excited but realised nobody had converted it to MLX format so I decided to do it on my own.

As it is based on the Llama 3.2 architecture, I could leverage MLX’s existing support for Llama models to get it running with absolutely minimal effort.

Steps to Convert Model to MLX

Here’s how you can convert a particular model from Hugging Face into MLX format using mlx-lm tools:

  • Create a Project Folder: First, organize your workspace:
mkdir hermes3_conversion
cd hermes3_conversion
  • Set Up a Python Environment: Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate
  • Install MLX Tools: Install mlx-lm Python package:
pip install mlx mlx-lm
  • Convert the Model: Use the MLX conversion tool to convert the model from Hugging Face:
mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q

• --hf-path: The Hugging Face model repository path.

• -q: Optimizes model weights during conversion, reducing memory usage to make the model feasible for mobile devices.

When you run this command, you will see an output similar to:

[INFO] Loading
Fetching 8 files: 100%|███████████████████████████| 8/8 [00:00<00:00, 10.91it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.
  • Upload to MLX Community (Optional)

If you want to be a good MLX club member, you can share the converted model with others:

mlx_lm.convert --hf-path NousResearch/Hermes-3-Llama-3.2-3B -q --upload-repo mlx-community/Hermes-3-Llama-3.2-3B-4bit

This uploads the converted model to the Hugging Face mlx-community repository for easy access to everyone.

mlx-community (MLX Community)
Org profile for MLX Community on Hugging Face, the AI community building the future.
Note: Make sure you have joined the community first!

Configuring the Converted Model in Your App

Once the model is converted, create a configuration for it in your project. Extend the ModelRegistry to add support for Hermes 3:

extension MLXLLM.ModelRegistry {
    static public let hermes3Llama_3_2_3B_4bit = ModelConfiguration(
        id: "mlx-community/Hermes-3-Llama-3.2-3B-4bit",
        defaultPrompt: "What are the implications of the Fermi Paradox?"
    )
}

This configuration specifies:

  • id: The model’s repository path on Hugging Face.
  • defaultPrompt: A sample prompt for testing the model.

Loading the Model in Your App

Now that the configuration is ready, load the model using LLMModelFactory:

let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: ModelRegistry.hermes3Llama_3_2_3B_4bit
) { progress in
    debugPrint("Downloading Hermes 3 model: \(Int(progress.fractionCompleted * 100))%")
}

This downloads the weights, creates the model, and prepares it for inference. Once the model is loaded, you can use it to generate text. Here is an example:

let prompt = """
What are the implications of the Fermi Paradox?
"""

let result = try await modelContainer.perform { [prompt] context in
    let input = try await context.processor.prepare(input: .init(prompt: prompt))
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        let text = context.tokenizer.decode(tokens: tokens)
        
        Task { @MainActor in
            self.output = text
        }
        return .more
    }
}

Running a 3B parameter model, even with 4-bit quantization, can push memory usage close to the iOS limit. Use Apple’s Increased Memory Limit entitlement to allow your app to exceed the default memory cap. Refer to my previous post for detailed steps:

Exploring MLX Swift: Configuring Different Models
Learn how to integrate custom large language models into iOS/macOS apps using MLX Swift. This guide shows how to configure and run models like Qwen 2.5 locally on Apple silicon, with tips for handling memory limits and entitlements for on-device AI inference.

Unsupported Architectures

If the model architecture is not supported by MLX, you must start with MLX Swift's source code to add compatibility. This requires a deep understanding of the needed architecture in the first place.

I wasted a whole night trying to get Nvidia's Hymba 1.5B running but I am not sure MLX Swift has support for the Mamba architecture nor do I possess the knowledge (yet) to add it myself.

Even after hours of working with Cursor and seeing the sunrise, I got it to work but the response was gibberish, and I had no idea how to fix it.

Moving Forward

That’s it! You have now converted and configured a model, ready for on-device inference with MLX Swift. In the future posts, we will explore optimizing inference with generation parameters.

If you have any questions or stories about working with MLX Swift, let me know on Twitter @rudrankriyam or Bluesky @rudrankriyam.bsky.social.

Happy MLXing!

String Catalog

String Catalog - App Localization on Autopilot

Push to GitHub, and we'll automatically localize your app for 40+ languages, saving you hours of manual work.

Tagged in: