Exploring MLX Swift: Adding On-Device Vision Models to Your App
In my previous post, we explored integrating large language models with MLX Swift.
Exploring MLX Swift: Adding On-Device Inference to your App
Now let's add vision capabilities. MLX Swift recently added support for Vision Language Models through PR #187, bringing image and video understanding to on-device inference. This means your app can now describe photos, answer questions about videos, or extract text from images—all running locally on Apple Silicon.
What Are Vision Models?
Vision Language Models (VLMs), a subset of multimodal machine learning models, can interpret both text and images (or videos). They handle tasks such as:
- Describing images: Generating captions for pictures.
- Answering visual questions: Responding to queries about an image or video.
- Detecting objects: Identifying entities and their locations in images.
- Performing OCR (Optical Character Recognition): Recognizing and extracting text from images.
- Understanding videos: Analyzing video content and describing events or actions.
TL;DR: Getting Started
Here is a quick roadmap:
- Add the MLX Vision Examples Package: Include the prebuilt utilities for vision models with
MLXVLMandMLXLMCommonpackages. - Select a Pre-Trained Vision Model: Use a model from MLX's registry. We will go with Qwen3-VL-4B-Instruct model.
- Load the Model: Download and set up the model weights.
- Prepare Input: Preprocess images or videos for inference.
- Run Inference: Generate results and display them.
Note: The given code is not the best practices for iOS development. The idea is to get started with MLX Swift in 20 lines of code, and see the ouput of visual on-device inference.
Adding the MLX Vision Examples Package
Start by adding the MLX Vision package to your project:
- In Xcode, open your project and go to the Project Settings -> Package Dependencies
- Press
+and paste this:https://github.com/ml-explore/mlx-swift-examples/and select Add Package - Set the
Dependency RuletoBranchand themainbranch - Add
MLXVLMto your desired target
Next, import the required modules:
import MLXVLM
import MLXLMCommon
import PhotosUI
import AVKitChoosing a Model
The VLMRegistry provides convenient access to pre-configured vision language models. For this tutorial, we'll use Qwen3-VL-4B-Instruct.
static public let qwen3VL4BInstruct4Bit = ModelConfiguration(
id: "lmstudio-community/Qwen3-VL-4B-Instruct-MLX-4bit",
defaultPrompt: "Describe the image in English"
)This setup points to a pre-trained model hosted on Hugging Face, making it easy to download and use directly in your app. Qwen3-VL also handles both images and videos.
Loading the Model
Once the model is defined, load it with the following code:
let modelContainer = try await VLMModelFactory.shared.loadContainer(
configuration: VLMRegistry.qwen3VL4BInstruct4Bit
) { progress in
debugPrint("Downloading \(VLMRegistry.qwen3VL4BInstruct4Bit.name): \(Int(progress.fractionCompleted * 100))%")
}The VLMModelFactory handles:
- Downloading model weights,
- Preparing the model for on-device inference, and
- Providing progress updates for UI feedback.
Preparing Input
Images and videos need preprocessing before inference. The PhotosPicker component enables users to select an image or video, which the model will automatically resize to its expected dimensions.
For videos, you'll need a custom Transferable type to handle video file selection:
#if os(iOS) || os(visionOS)
struct TransferableVideo: Transferable {
let url: URL
static var transferRepresentation: some TransferRepresentation {
FileRepresentation(contentType: .movie) { movie in
SentTransferredFile(movie.url)
} importing: { received in
let sandboxURL = try SandboxFileTransfer.transferFileToTemp(from: received.file)
return .init(url: sandboxURL)
}
}
}
#endif
struct SandboxFileTransfer {
static func transferFileToTemp(from sourceURL: URL) throws -> URL {
let tempDir = FileManager.default.temporaryDirectory
let sandboxURL = tempDir.appendingPathComponent(sourceURL.lastPathComponent)
if FileManager.default.fileExists(atPath: sandboxURL.path()) {
try FileManager.default.removeItem(at: sandboxURL)
}
try FileManager.default.copyItem(at: sourceURL, to: sandboxURL)
return sandboxURL
}
}Then use PhotosPicker to select media:
PhotosPicker(
selection: $mediaSelection,
matching: PHPickerFilter.any(of: [PHPickerFilter.images, PHPickerFilter.videos]),
photoLibrary: .shared()
) {
Text("Select Image or Video")
}
.onChange(of: mediaSelection) { _, newValue in
Task {
if let newValue {
// Try to load as video first
if let video = try? await newValue.loadTransferable(type: TransferableVideo.self) {
videoURL = video.url
image = nil
} else if let data = try? await newValue.loadTransferable(type: Data.self),
let uiImage = UIImage(data: data) {
image = uiImage
videoURL = nil
if let ciImage = CIImage(image: uiImage) {
try await processMedia(image: ciImage, videoURL: nil)
}
}
}
}
}Running Inference
To generate predictions for images:
await MainActor.run { result = "" }
var input = UserInput(prompt: "Describe the image in English", images: [.ciImage(ciImage)])
let result = try await container.perform { [input] context in
let input = try await context.processor.prepare(input: input)
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
Task { @MainActor in
self.result += context.tokenizer.decode(tokens: tokens)
}
return tokens.count >= 800 ? .stop : .more
}
}For videos, pass the video URL instead:
await MainActor.run { result = "" }
var input = UserInput(prompt: "What's happening in this video?", videos: [.url(videoURL)])
let result = try await container.perform { [input] context in
let input = try await context.processor.prepare(input: input)
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
Task { @MainActor in
self.result += context.tokenizer.decode(tokens: tokens)
}
return tokens.count >= 800 ? .stop : .more
}
}You can also combine both images and videos in a single input if needed. The model will automatically handle preprocessing and resizing for both media types.
Complete Code Example
Here is how you can implement this in SwiftUI with support for both images and videos:
import SwiftUI
import MLXVLM
import MLXLMCommon
import PhotosUI
import AVKit
struct ContentView: View {
@State private var image: UIImage?
@State private var videoURL: URL?
@State private var result: String = ""
@State private var isLoading = false
@State private var mediaSelection: PhotosPickerItem?
@State private var modelContainer: ModelContainer?
var body: some View {
VStack {
if let videoURL {
VideoPlayer(player: AVPlayer(url: videoURL))
.frame(height: 300)
.cornerRadius(12)
} else if let image {
Image(uiImage: image)
.resizable()
.scaledToFit()
.cornerRadius(12)
.padding()
.frame(height: 300)
}
PhotosPicker(
selection: $mediaSelection,
matching: PHPickerFilter.any(of: [PHPickerFilter.images, PHPickerFilter.videos])
) {
Text("Select Image or Video")
}
.onChange(of: mediaSelection) { _, newValue in
Task {
if let newValue {
if let video = try? await newValue.loadTransferable(type: TransferableVideo.self) {
videoURL = video.url
image = nil
isLoading = true
try await processMedia(image: nil, videoURL: video.url)
isLoading = false
} else if let data = try? await newValue.loadTransferable(type: Data.self),
let uiImage = UIImage(data: data) {
image = uiImage
videoURL = nil
isLoading = true
if let ciImage = CIImage(image: uiImage) {
try await processMedia(image: ciImage, videoURL: nil)
}
isLoading = false
}
}
}
}
if isLoading {
ProgressView()
} else {
Text(result)
.padding()
}
}
.task {
do {
modelContainer = try await VLMModelFactory.shared.loadContainer(configuration: VLMRegistry.qwen3VL4BInstruct4Bit)
} catch {
debugPrint(error)
}
}
}
}
extension ContentView {
private func processMedia(image: CIImage?, videoURL: URL?) async throws {
guard let container = modelContainer else { return }
await MainActor.run { result = "" }
let images: [UserInput.Image] = if let image { [.ciImage(image)] } else { [] }
let videos: [UserInput.Video] = if let videoURL { [.url(videoURL)] } else { [] }
let prompt = if videoURL != nil {
"What's happening in this video?"
} else {
"Describe the image in English"
}
var input = UserInput(prompt: prompt, images: images, videos: videos)
let result = try await container.perform { [input] context in
let input = try await context.processor.prepare(input: input)
return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
Task { @MainActor in
self.result += context.tokenizer.decode(tokens: tokens)
}
return tokens.count >= 800 ? .stop : .more
}
}
}
}
#if os(iOS) || os(visionOS)
struct TransferableVideo: Transferable {
let url: URL
static var transferRepresentation: some TransferRepresentation {
FileRepresentation(contentType: .movie) { movie in
SentTransferredFile(movie.url)
} importing: { received in
let sandboxURL = try SandboxFileTransfer.transferFileToTemp(from: received.file)
return .init(url: sandboxURL)
}
}
}
#endif
struct SandboxFileTransfer {
static func transferFileToTemp(from sourceURL: URL) throws -> URL {
let tempDir = FileManager.default.temporaryDirectory
let sandboxURL = tempDir.appendingPathComponent(sourceURL.lastPathComponent)
if FileManager.default.fileExists(atPath: sandboxURL.path()) {
try FileManager.default.removeItem(at: sandboxURL)
}
try FileManager.default.copyItem(at: sourceURL, to: sandboxURL)
return sandboxURL
}
}Moving Forward
With Qwen3-VL-4B-Instruct, you have seen how to integrate vision models into your iOS app using MLX Swift. To take it further:
- Adjust preprocessing and decoding parameters to see what works best.
- Try other models from the
VLMRegistrylikeSmolVLM2for faster inference orGemma3variants if you want to experiment. - Check out the VLMEval example app for a complete implementation with memory management and video support.
- Look at the MLXVLM documentation for advanced features and model-specific configurations.
If you have questions or want to share your experiments, reach out on Twitter @rudrankriyam!
Happy MLXing!
Post Topics
Explore more in these categories: