In my previous post, we explored integrating large language models with MLX Swift.
Let's open our eyes to the visual side of things and explore how you can integrate on-device vision models into your app using MLX Swift. This feature's PR was recently approved, and I was tracking the PR for a month!
What Are Vision Models?
Vision Language Models (VLMs), a subset of multimodal machine learning models, can interpret both text and images. They handle tasks such as:
- Describing images: Generating captions for pictures.
- Answering visual questions: Responding to queries about an image.
- Detecting objects: Identifying entities and their locations in images.
- Performing OCR (Optical Character Recognition): Recognizing and extracting text from images.
TL;DR: Getting Started
Here is a quick roadmap:
- Add the MLX Vision Examples Package: Include the prebuilt utilities for vision models with
MLXVLM
andMLXLMCommon
packages. I usually avoid dependencies but these definitely makes my life easier. - Select a Pre-Trained Vision Model: Use a model from MLX’s registry. We will go with Google's PaliGemma model.
- Load the Model: Download and set up the model weights.
- Prepare Input: Preprocess images for inference.
- Run Inference: Generate results and display them.
Note: The given code is not the best practices for iOS development. The idea is to get started with MLX Swift in 20 lines of code, and see the ouput of visual on-device inference.
Adding the MLX Vision Examples Package
Start by adding the MLX Vision package to your project:
- In Xcode, open your project and go to the Project Settings -> Package Dependencies
- Press
+
and paste this:https://github.com/ml-explore/mlx-swift-examples/
and select Add Package - Set the
Dependency Rule
toBranch
and themain
branch - Add
MLXVLM
to your desired target
You can also add it directly if you already have MLXLLM
in your project:
Next, import the required modules:
import MLXVLM
import MLXLMCommon
import PhotosUI
Choosing a Model
There are two pre-defined models in the package as of writing this post, and I will go with the first one that I saw, PaliGemma-3B-Mix.
static public let paligemma3bMix448_8bit = ModelConfiguration(
id: "mlx-community/paligemma-3b-mix-448-8bit",
defaultPrompt: "Describe the image in English"
)
This setup points to a pre-trained model hosted on Hugging Face, making it easy to download and use directly in your app.
Loading the Model
Once the model is defined, load it with the following code:
let modelContainer = try await VLMModelFactory.shared.loadContainer(
configuration: ModelRegistry.paligemma3bMix448_8bit
) { progress in
debugPrint("Downloading \(ModelRegistry.paligemma3bMix448_8bit.name): \(Int(progress.fractionCompleted * 100))%")
}
The VLMModelFactory handles:
- Downloading model weights,
- Preparing the model for on-device inference, and
- Providing progress updates for UI feedback.
Preparing Input
Images need preprocessing before inference. The PhotosPicker component enables users to select an image, which is resized to match the model’s expected dimensions of 448x448 pixels:
PhotosPicker(selection: $imageSelection, matching: .images, photoLibrary: .shared()) {
Text("Select Image")
}
.onChange(of: imageSelection) { _, newValue in
Task {
if let newValue, let data = try? await newValue.loadTransferable(type: Data.self), let uiImage = UIImage(data: data) {
image = uiImage
if let ciImage = CIImage(image: uiImage) {
try await processImage(ciImage)
}
}
}
}
Running Inference
To generate predictions, use the following code:
var input = UserInput(prompt: "Describe the image in English", images: [.ciImage(ciImage)])
input.processing.resize = .init(width: 448, height: 448)
let result = try await container.perform { [input] context in
let input = try await context.processor.prepare(input: input)
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
Task { @MainActor in
self.result += context.tokenizer.decode(tokens: tokens)
}
return tokens.count >= 800 ? .stop : .more
}
}
We resize the image, use the ModelContainer
to preprocess and tokenize the input and then finally generate the text output by decoding tokens.
Complete Code Example
Here is how you can implement this in SwiftUI:
import SwiftUI
import MLXVLM
import PhotosUI
struct ContentView: View {
@State private var image: UIImage?
@State private var result: String = ""
@State private var isLoading = false
@State private var imageSelection: PhotosPickerItem?
@State private var modelContainer: ModelContainer?
var body: some View {
VStack {
if let image {
Image(uiImage: image)
.resizable()
.scaledToFit()
.cornerRadius(12)
.padding()
.frame(height: 300)
}
PhotosPicker(selection: $imageSelection, matching: .images) {
Text("Select Image")
}
.onChange(of: imageSelection) { _, newValue in
Task {
if let newValue {
if let data = try? await newValue.loadTransferable(type: Data.self), let uiImage = UIImage(data: data) {
image = uiImage
isLoading = true
if let ciImage = CIImage(image: uiImage) {
try await processImage(ciImage)
}
isLoading = false
}
}
}
}
if isLoading {
ProgressView()
} else {
Text(result)
.padding()
}
}
.task {
do {
modelContainer = try await VLMModelFactory.shared.loadContainer(configuration: ModelRegistry.paligemma3bMix448_8bit)
} catch {
debugPrint(error)
}
}
}
}
extension ContenttView {
private func processImage(_ ciImage: CIImage) async throws {
guard let container = modelContainer else { return }
var input = UserInput(prompt: prompt, images: [.ciImage(ciImage)])
input.processing.resize = .init(width: 448, height: 448)
let result = try await container.perform { [input] context in
let input = try await context.processor.prepare(input: input)
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
Task { @MainActor in
self.result = context.tokenizer.decode(tokens: tokens)
}
return tokens.count >= 800 ? .stop : .more
}
}
}
}
And here is how it recognizes the scene of me enjoying by the pool!
Moving Forward
With PaliGemma-3B-Mix, you have seen how to integrate vision models into your iOS app using MLX Swift. To take it further:
- Experiment with Qwen-VL-2B-Instruct for conversational tasks, as PaliGemma is only for single-turn vision language model and not meant for conversational use.
- Tweak preprocessing and decoding parameters for optimal results.
If you have questions or want to share your experiments, reach out on Twitter @rudrankriyam!
Happy MLXing!