In my previous post, we explored integrating large language models with MLX Swift.

Exploring MLX Swift: Adding On-Device Inference to your App
Learn how to integrate MLX Swift into your iOS and macOS apps for on-device AI inference. This guide shows how to add local language models using Apple’s MLX framework, enabling offline AI capabilities on Apple silicon.

Let's open our eyes to the visual side of things and explore how you can integrate on-device vision models into your app using MLX Swift. This feature's PR was recently approved, and I was tracking the PR for a month!


What Are Vision Models?

Vision Language Models (VLMs), a subset of multimodal machine learning models, can interpret both text and images. They handle tasks such as:

  • Describing images: Generating captions for pictures.
  • Answering visual questions: Responding to queries about an image.
  • Detecting objects: Identifying entities and their locations in images.
  • Performing OCR (Optical Character Recognition): Recognizing and extracting text from images.

TL;DR: Getting Started

Here is a quick roadmap:

  • Add the MLX Vision Examples Package: Include the prebuilt utilities for vision models with MLXVLM and MLXLMCommon packages. I usually avoid dependencies but these definitely makes my life easier.
  • Select a Pre-Trained Vision Model: Use a model from MLX’s registry. We will go with Google's PaliGemma model.
  • Load the Model: Download and set up the model weights.
  • Prepare Input: Preprocess images for inference.
  • Run Inference: Generate results and display them.
Note: The given code is not the best practices for iOS development. The idea is to get started with MLX Swift in 20 lines of code, and see the ouput of visual on-device inference.

Adding the MLX Vision Examples Package

Start by adding the MLX Vision package to your project:

  • In Xcode, open your project and go to the Project Settings -> Package Dependencies
  • Press + and paste this: https://github.com/ml-explore/mlx-swift-examples/ and select Add Package
  • Set the Dependency Rule to Branch and the main branch
  • Add MLXVLM to your desired target

You can also add it directly if you already have MLXLLM in your project:

Next, import the required modules:

import MLXVLM
import MLXLMCommon
import PhotosUI

Choosing a Model

There are two pre-defined models in the package as of writing this post, and I will go with the first one that I saw, PaliGemma-3B-Mix.

static public let paligemma3bMix448_8bit = ModelConfiguration(
    id: "mlx-community/paligemma-3b-mix-448-8bit",
    defaultPrompt: "Describe the image in English"
)

This setup points to a pre-trained model hosted on Hugging Face, making it easy to download and use directly in your app.

Loading the Model

Once the model is defined, load it with the following code:

let modelContainer = try await VLMModelFactory.shared.loadContainer(
    configuration: ModelRegistry.paligemma3bMix448_8bit
) { progress in
    debugPrint("Downloading \(ModelRegistry.paligemma3bMix448_8bit.name): \(Int(progress.fractionCompleted * 100))%")
}

The VLMModelFactory handles:

  • Downloading model weights,
  • Preparing the model for on-device inference, and
  • Providing progress updates for UI feedback.

Preparing Input

Images need preprocessing before inference. The PhotosPicker component enables users to select an image, which is resized to match the model’s expected dimensions of 448x448 pixels:

PhotosPicker(selection: $imageSelection, matching: .images, photoLibrary: .shared()) {
    Text("Select Image")
}
.onChange(of: imageSelection) { _, newValue in
    Task {
        if let newValue, let data = try? await newValue.loadTransferable(type: Data.self), let uiImage = UIImage(data: data) {
            image = uiImage
            if let ciImage = CIImage(image: uiImage) {
                try await processImage(ciImage)
            }
        }
    }
}

Running Inference

To generate predictions, use the following code:

var input = UserInput(prompt: "Describe the image in English", images: [.ciImage(ciImage)])

input.processing.resize = .init(width: 448, height: 448)

let result = try await container.perform { [input] context in
    let input = try await context.processor.prepare(input: input)
    
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        Task { @MainActor in
            self.result += context.tokenizer.decode(tokens: tokens)
        }
        return tokens.count >= 800 ? .stop : .more
    }
}

We resize the image, use the ModelContainer to preprocess and tokenize the input and then finally generate the text output by decoding tokens.

Complete Code Example

Here is how you can implement this in SwiftUI:

import SwiftUI
import MLXVLM
import PhotosUI

struct ContentView: View {
    @State private var image: UIImage?
    @State private var result: String = ""
    @State private var isLoading = false
    @State private var imageSelection: PhotosPickerItem?
    @State private var modelContainer: ModelContainer?

    var body: some View {
        VStack {
            if let image {
                Image(uiImage: image)
                    .resizable()
                    .scaledToFit()
                    .cornerRadius(12)
                    .padding()
                    .frame(height: 300)
            }

            PhotosPicker(selection: $imageSelection, matching: .images) {
                Text("Select Image")
            }
            .onChange(of: imageSelection) { _, newValue in
                Task {
                    if let newValue {
                        if let data = try? await newValue.loadTransferable(type: Data.self), let uiImage = UIImage(data: data) {
                            image = uiImage
                            isLoading = true
                            if let ciImage = CIImage(image: uiImage) {
                                try await processImage(ciImage)
                            }
                            isLoading = false
                        }
                    }
                }
            }

            if isLoading {
                ProgressView()
            } else {
                Text(result)
                    .padding()
            }
        }
        .task {
            do {
                modelContainer = try await VLMModelFactory.shared.loadContainer(configuration: ModelRegistry.paligemma3bMix448_8bit)
            } catch {
                debugPrint(error)
            }
        }
    }
}
extension ContenttView {
    private func processImage(_ ciImage: CIImage) async throws {
        guard let container = modelContainer else { return }

        var input = UserInput(prompt: prompt, images: [.ciImage(ciImage)])
        input.processing.resize = .init(width: 448, height: 448)

        let result = try await container.perform { [input] context in
            let input = try await context.processor.prepare(input: input)

            return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
                Task { @MainActor in
                    self.result = context.tokenizer.decode(tokens: tokens)
                }

                return tokens.count >= 800 ? .stop : .more
            }
        }
    }
}

And here is how it recognizes the scene of me enjoying by the pool!

0:00
/0:11

Moving Forward

With PaliGemma-3B-Mix, you have seen how to integrate vision models into your iOS app using MLX Swift. To take it further:

  • Experiment with Qwen-VL-2B-Instruct for conversational tasks, as PaliGemma is only for single-turn vision language model and not meant for conversational use.
  • Tweak preprocessing and decoding parameters for optimal results.

If you have questions or want to share your experiments, reach out on Twitter @rudrankriyam!

Happy MLXing!

String Catalog

String Catalog - App Localization on Autopilot

Push to GitHub, and we'll automatically localize your app for 40+ languages, saving you hours of manual work.

Tagged in: