·7 min read

Exploring MLX Swift: Adding On-Device Vision Models to Your App

In my previous post, we explored integrating large language models with MLX Swift.

Exploring MLX Swift: Adding On-Device Inference to your App

Now let's add vision capabilities. MLX Swift recently added support for Vision Language Models through PR #187, bringing image and video understanding to on-device inference. This means your app can now describe photos, answer questions about videos, or extract text from images—all running locally on Apple Silicon.


What Are Vision Models?

Vision Language Models (VLMs), a subset of multimodal machine learning models, can interpret both text and images (or videos). They handle tasks such as:

  • Describing images: Generating captions for pictures.
  • Answering visual questions: Responding to queries about an image or video.
  • Detecting objects: Identifying entities and their locations in images.
  • Performing OCR (Optical Character Recognition): Recognizing and extracting text from images.
  • Understanding videos: Analyzing video content and describing events or actions.

TL;DR: Getting Started

Here is a quick roadmap:

  • Add the MLX Vision Examples Package: Include the prebuilt utilities for vision models with MLXVLM and MLXLMCommon packages.
  • Select a Pre-Trained Vision Model: Use a model from MLX's registry. We will go with Qwen3-VL-4B-Instruct model.
  • Load the Model: Download and set up the model weights.
  • Prepare Input: Preprocess images or videos for inference.
  • Run Inference: Generate results and display them.

Note: The given code is not the best practices for iOS development. The idea is to get started with MLX Swift in 20 lines of code, and see the ouput of visual on-device inference.

Adding the MLX Vision Examples Package

Start by adding the MLX Vision package to your project:

  • In Xcode, open your project and go to the Project Settings -> Package Dependencies
  • Press + and paste this: https://github.com/ml-explore/mlx-swift-examples/ and select Add Package
  • Set the Dependency Rule to Branch and the main branch
  • Add MLXVLM to your desired target

Next, import the required modules:

import MLXVLM
import MLXLMCommon
import PhotosUI
import AVKit

Choosing a Model

The VLMRegistry provides convenient access to pre-configured vision language models. For this tutorial, we'll use Qwen3-VL-4B-Instruct.

static public let qwen3VL4BInstruct4Bit = ModelConfiguration(
    id: "lmstudio-community/Qwen3-VL-4B-Instruct-MLX-4bit",
    defaultPrompt: "Describe the image in English"
)

This setup points to a pre-trained model hosted on Hugging Face, making it easy to download and use directly in your app. Qwen3-VL also handles both images and videos.

Loading the Model

Once the model is defined, load it with the following code:

let modelContainer = try await VLMModelFactory.shared.loadContainer(
    configuration: VLMRegistry.qwen3VL4BInstruct4Bit
) { progress in
    debugPrint("Downloading \(VLMRegistry.qwen3VL4BInstruct4Bit.name): \(Int(progress.fractionCompleted * 100))%")
}

The VLMModelFactory handles:

  • Downloading model weights,
  • Preparing the model for on-device inference, and
  • Providing progress updates for UI feedback.

Preparing Input

Images and videos need preprocessing before inference. The PhotosPicker component enables users to select an image or video, which the model will automatically resize to its expected dimensions.

For videos, you'll need a custom Transferable type to handle video file selection:

#if os(iOS) || os(visionOS)
struct TransferableVideo: Transferable {
    let url: URL
 
    static var transferRepresentation: some TransferRepresentation {
        FileRepresentation(contentType: .movie) { movie in
            SentTransferredFile(movie.url)
        } importing: { received in
            let sandboxURL = try SandboxFileTransfer.transferFileToTemp(from: received.file)
            return .init(url: sandboxURL)
        }
    }
}
#endif
 
struct SandboxFileTransfer {
    static func transferFileToTemp(from sourceURL: URL) throws -> URL {
        let tempDir = FileManager.default.temporaryDirectory
        let sandboxURL = tempDir.appendingPathComponent(sourceURL.lastPathComponent)
 
        if FileManager.default.fileExists(atPath: sandboxURL.path()) {
            try FileManager.default.removeItem(at: sandboxURL)
        }
 
        try FileManager.default.copyItem(at: sourceURL, to: sandboxURL)
        return sandboxURL
    }
}

Then use PhotosPicker to select media:

PhotosPicker(
    selection: $mediaSelection,
    matching: PHPickerFilter.any(of: [PHPickerFilter.images, PHPickerFilter.videos]),
    photoLibrary: .shared()
) {
    Text("Select Image or Video")
}
.onChange(of: mediaSelection) { _, newValue in
    Task {
        if let newValue {
            // Try to load as video first
            if let video = try? await newValue.loadTransferable(type: TransferableVideo.self) {
                videoURL = video.url
                image = nil
            } else if let data = try? await newValue.loadTransferable(type: Data.self),
                      let uiImage = UIImage(data: data) {
                image = uiImage
                videoURL = nil
                if let ciImage = CIImage(image: uiImage) {
                    try await processMedia(image: ciImage, videoURL: nil)
                }
            }
        }
    }
}

Running Inference

To generate predictions for images:

await MainActor.run { result = "" }
 
var input = UserInput(prompt: "Describe the image in English", images: [.ciImage(ciImage)])
 
let result = try await container.perform { [input] context in
    let input = try await context.processor.prepare(input: input)
 
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        Task { @MainActor in
            self.result += context.tokenizer.decode(tokens: tokens)
        }
        return tokens.count >= 800 ? .stop : .more
    }
}

For videos, pass the video URL instead:

await MainActor.run { result = "" }
 
var input = UserInput(prompt: "What's happening in this video?", videos: [.url(videoURL)])
 
let result = try await container.perform { [input] context in
    let input = try await context.processor.prepare(input: input)
 
    return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
        Task { @MainActor in
            self.result += context.tokenizer.decode(tokens: tokens)
        }
        return tokens.count >= 800 ? .stop : .more
    }
}

You can also combine both images and videos in a single input if needed. The model will automatically handle preprocessing and resizing for both media types.

Complete Code Example

Here is how you can implement this in SwiftUI with support for both images and videos:

import SwiftUI
import MLXVLM
import MLXLMCommon
import PhotosUI
import AVKit
 
struct ContentView: View {
    @State private var image: UIImage?
    @State private var videoURL: URL?
    @State private var result: String = ""
    @State private var isLoading = false
    @State private var mediaSelection: PhotosPickerItem?
    @State private var modelContainer: ModelContainer?
 
    var body: some View {
        VStack {
            if let videoURL {
                VideoPlayer(player: AVPlayer(url: videoURL))
                    .frame(height: 300)
                    .cornerRadius(12)
            } else if let image {
                Image(uiImage: image)
                    .resizable()
                    .scaledToFit()
                    .cornerRadius(12)
                    .padding()
                    .frame(height: 300)
            }
 
            PhotosPicker(
                selection: $mediaSelection,
                matching: PHPickerFilter.any(of: [PHPickerFilter.images, PHPickerFilter.videos])
            ) {
                Text("Select Image or Video")
            }
            .onChange(of: mediaSelection) { _, newValue in
                Task {
                    if let newValue {
                        if let video = try? await newValue.loadTransferable(type: TransferableVideo.self) {
                            videoURL = video.url
                            image = nil
                            isLoading = true
                            try await processMedia(image: nil, videoURL: video.url)
                            isLoading = false
                        } else if let data = try? await newValue.loadTransferable(type: Data.self),
                                  let uiImage = UIImage(data: data) {
                            image = uiImage
                            videoURL = nil
                            isLoading = true
                            if let ciImage = CIImage(image: uiImage) {
                                try await processMedia(image: ciImage, videoURL: nil)
                            }
                            isLoading = false
                        }
                    }
                }
            }
 
            if isLoading {
                ProgressView()
            } else {
                Text(result)
                    .padding()
            }
        }
        .task {
            do {
                modelContainer = try await VLMModelFactory.shared.loadContainer(configuration: VLMRegistry.qwen3VL4BInstruct4Bit)
            } catch {
                debugPrint(error)
            }
        }
    }
}
 
extension ContentView {
    private func processMedia(image: CIImage?, videoURL: URL?) async throws {
        guard let container = modelContainer else { return }
 
        await MainActor.run { result = "" }
 
        let images: [UserInput.Image] = if let image { [.ciImage(image)] } else { [] }
        let videos: [UserInput.Video] = if let videoURL { [.url(videoURL)] } else { [] }
 
        let prompt = if videoURL != nil {
            "What's happening in this video?"
        } else {
            "Describe the image in English"
        }
 
        var input = UserInput(prompt: prompt, images: images, videos: videos)
 
        let result = try await container.perform { [input] context in
            let input = try await context.processor.prepare(input: input)
 
            return try MLXLMCommon.generate(input: input, parameters: .init(), context: context) { tokens in
                Task { @MainActor in
                    self.result += context.tokenizer.decode(tokens: tokens)
                }
 
                return tokens.count >= 800 ? .stop : .more
            }
        }
    }
}
 
#if os(iOS) || os(visionOS)
struct TransferableVideo: Transferable {
    let url: URL
 
    static var transferRepresentation: some TransferRepresentation {
        FileRepresentation(contentType: .movie) { movie in
            SentTransferredFile(movie.url)
        } importing: { received in
            let sandboxURL = try SandboxFileTransfer.transferFileToTemp(from: received.file)
            return .init(url: sandboxURL)
        }
    }
}
#endif
 
struct SandboxFileTransfer {
    static func transferFileToTemp(from sourceURL: URL) throws -> URL {
        let tempDir = FileManager.default.temporaryDirectory
        let sandboxURL = tempDir.appendingPathComponent(sourceURL.lastPathComponent)
 
        if FileManager.default.fileExists(atPath: sandboxURL.path()) {
            try FileManager.default.removeItem(at: sandboxURL)
        }
 
        try FileManager.default.copyItem(at: sourceURL, to: sandboxURL)
        return sandboxURL
    }
}

Moving Forward

With Qwen3-VL-4B-Instruct, you have seen how to integrate vision models into your iOS app using MLX Swift. To take it further:

  • Adjust preprocessing and decoding parameters to see what works best.
  • Try other models from the VLMRegistry like SmolVLM2 for faster inference or Gemma3 variants if you want to experiment.
  • Check out the VLMEval example app for a complete implementation with memory management and video support.
  • Look at the MLXVLM documentation for advanced features and model-specific configurations.

If you have questions or want to share your experiments, reach out on Twitter @rudrankriyam!

Happy MLXing!

Post Topics

Explore more in these categories:

Related Articles

Exploring AI: Cosine Similarity for RAG using Accelerate and Swift

Learn how to implement cosine similarity using Accelerate framework for iOS and macOS apps. Build Retrieval-Augmented Generation (RAG) systems breaking down complex mathematics into simple explanations and practical Swift code examples. Optimize document search with vector similarity calculations.

Exploring App Intents: Creating Your First App Intent

App Intents expose your app's actions to iOS, Siri, and Shortcuts, making it accessible & discoverable. This guide introduces the basics of App Intents, explaining what they are, why they're important, & how to create a simple AppIntent. Learn to extend app's functionality beyond its boundaries.

Exploring Cursor: Accessing External Documentation using @Doc

Boost coding productivity with Cursor's @Doc feature. Learn how to index external documentation directly in your workspace, eliminating tab-switching and keeping you in flow.