Exploring MLX Swift: Adding On-Device Inference to your App

I love the output of programming more than the process itself, and that applies to MLX Swift too. So, in this post, you will learn how to easily add on-device inference to your existing app.

What is MLX, Though?

MLX is a machine learning framework by Apple that is specifically designed for Apple silicon. It lets you run language models locally on your devices. This means no round trips to a server and better privacy for your users. And use them easily on flights or trains that provides minimal internet speeds.

MLX Swift expands MLX to the Swift language, so iOS developers like us do not have to spend time indenting Python code.

GitHub - ml-explore/mlx-swift: Swift API for MLX

Swift API for MLX. Contribute to ml-explore/mlx-swift development by creating an account on GitHub.

GitHubml-explore

We will learn how to use the MLXLLM package, a library designed to provide a simple way to use pre-trained Large Language Models (LLMs) for on-device inference. By the end of this you will be able to generate responses locally in your app.

TL;DR: Getting Started

Here are the few steps involved in adding a model to your application:

Add the MLX Swift Examples Package: Add the MLXLLM package to your project. This provides all the model logic, utilities and helper classes to download and perform inference. I usually avoid dependencies but this one definitely makes my life easier.
Choose a Model: Pick a pre-trained model to run on device. For this post, I will choose a lightweight one so it is easier to get started with.
Load the Model: Use the existing configured model from MLXLLM to load the weights and configuration.
Create Input: Create an input (prompt) for the model.
Run inference: Feed this prompt into the model and generate the output.

Note: The given code is not the best practices for iOS development. The idea is to get started with MLX Swift in 20 lines of code, and see the ouput of on-device inference visually.

Adding the MLX Swift Examples Package

Add the MLX Swift Example repo as a package dependency to your project:

In Xcode, open your project and go to the Project Settings -> Package Dependencies
Press + and paste this: https://github.com/ml-explore/mlx-swift-examples/ and select Add Package
Set the Dependency Rule to Branch and the main branch
Add MLXLLM to your desired target

Next, import the given packages in the file that you prefer:

import MLXLLM
import MLXLMCommon

Choosing a Model

The MLXLLM package provides constants for popular models, such as:

// A small model that works well on many devices
let modelConfiguration = ModelRegistry.llama3_2_1B_4bit

These models are hosted on the Hugging Face Hub, and MLXLLM knows where to download the weights and configuration for each of them. For the initial example, let us use llama3_2_1B_4bit to run the model on most devices.

Loading the Model

Download the model weights and set up the model with the following:

let modelConfiguration = ModelRegistry.llama3_2_1B_4bit

let modelContainer = try await LLMModelFactory.shared.loadContainer(configuration: modelConfiguration) { progress in
    debugPrint("Downloading \(modelConfiguration.name): \(Int(progress.fractionCompleted * 100))%")
}

LLMModelFactory.shared.loadContainer takes care of downloading the model files, creating the model using the correct architecture, and loading the weights from a .safetensors file. Note that loadContainer is an asynchronous method, so use await.
The code also provides a closure for progress and you can use it to update the UI as needed during the download process.

Prepare Input

Convert the prompt to input understood by the mode which is done using the prepare method in ModelContext:

let prompt = "What is really the meaning of life?"
        
let result = try await modelContainer.perform { [prompt] context in
    let input = try await context.processor.prepare(input: .init(prompt: prompt))
}

perform on ModelContainer provides an actor so that you can threadsafe access the model and the tokenizer inside the block.
The code also specifies input which is of LMInput type.
prompt is turned into a set of tokens using context.processor.prepare(input:)

Run Inference

Run the inference and use the text that is produced:

return try MLXLMCommon.generate(
    input: input, parameters: .init(), context: context) { tokens in
        let text = context.tokenizer.decode(tokens: tokens)

        Task { @MainActor in
            self.output = text
        }
        return .more
    }

We also use the context that we get from the perform block.
generate method runs all of the code necessary to do the inference loop to produce the text from the tokens.
The closure provides a callback for the tokens as they are produced, and allows the display of intermediate values.

Complete Code Example

Here is the code shown above combined together into a single view with a generate function:

import SwiftUI
import MLXLLM
import MLXLMCommon

struct ContentView: View {
    @State private var output: String = ""

    var body: some View {
        VStack {
            Image(systemName: "globe")
                .imageScale(.large)
                .foregroundStyle(.tint)
            Text("Hello, world!")

            Text(output)
        }
        .padding()
        .task {
            do {
                try await generate()
            } catch {
                debugPrint(error)
            }
        }
    }
}

extension ContentView {
    private func generate() async throws {
        let modelConfiguration = ModelRegistry.llama3_2_1B_4bit
        let modelContainer = try await LLMModelFactory.shared.loadContainer(configuration: modelConfiguration) { progress in
            debugPrint("Downloading \(modelConfiguration.name): \(Int(progress.fractionCompleted * 100))%")
        }

        let prompt = "What is really the meaning of life?"

        let _ = try await modelContainer.perform { [prompt] context in
            let input = try await context.processor.prepare(input: .init(prompt: prompt))

            return try MLXLMCommon.generate(
                input: input, parameters: .init(), context: context) { tokens in
                    let text = context.tokenizer.decode(tokens: tokens)

                    Task { @MainActor in
                        self.output = text
                    }
                    return .more
                }
        }
    }
}

You can use it as a boilerplate code to get started with MLX Swift in your app!

0:00

/0:08

Moving Forward

This is a minimal version of using MLXLLM. You can now experiment with:

Different model configurations: Using other pre-defined models from ModelRegistry
More sophisticated decoding/generation parameters using GenerateParameters

You are ready to start building applications with on-device language models. In my next post, I will show you how to configure a different model which is not defined in ModelRegistry and start generating text - and you will be surprised at how few lines of code it takes!

If you are working with MLX Swift, I would love to hear about your experiences with. Drop a comment below or reach out on Twitter @rudrankriyam or Bluesky @rudrankriyam.bsky.social!

Happy MLXing!