Exploring MLX Swift: Porting Qwen 3VL 4B from Python to Swift

I was looking forward to Qwen 3 VL 4B since the vision variant dropped, as I think it is the best vision model out there right now to run on the iPhone. The text-only Qwen 3 models were already running on my iPhone well enough, but the MLX Swift repository had support up to Qwen 2 and Qwen 2.5 Vision. Shiny object syndrome kicked in immediately when I saw the release announcement from the Qwen team about the vision variant and decided to port it to Swift.

Thanks to Prince Canuma for the implementation of the model in MLX Python in his repository mlx-vlm, half of the work was already done.

Porting the Model

I pulled the latest MLX Swift sources and peeked at the MLXLM package, which already contained most of the architecture definitions Swift would need. Earlier, I would wait for somebody to leave a PR to add support for the new model, but this time I was willing to try out my own hands at it.

The problem: it was night-time, and I was (and still am) addicted to this new game, Ghost of Yotei. I had zero interest in porting the model writing the code myself, but I was curious to see how AI driven coding would fare.

Over the night and morning, I used FactoryAI's Droid to port the model to Swift, while I micromanaged each patch to keep the port aligned with Python, as my name was on the line, and I cannot afford to submit AI slop to the repository.

Note: This post is not sponsored by FactoryAI, nor am I affiliated with them.

I initially used GPT-5 Codex for the implementation, but it did not work out as expected. Then, I used Sonnet 4.5 to fix the issues one by one, as OpenAI's model was too stubborn to follow the instructions.

Finally, I used Haiku 4.5 to fix a bit of performance regressions, where the port was using Swift native code instead of MLX optimized code for the loops and matrix operations.

Porting Guide

The major work was building confidence that the Swift port behaves exactly like the Python original. I leaned on the official Porting guide for implementing new models in MLX Swift, which is an excellent read for me and provides useful context for the AI to follow.

Before I wrote any line of code, I made GPT 5 Codex go through the current implementation of Qwen 3 VL in the mlx-vlm repository and provide me with a summary of how the model works and is implemented in Python.

Then, I asked it to go through the implementation of Qwen 2 and Qwen 2.5 VL in the MLX Swift examples repository to get a sense of the Swift-related code with the existing packages in MLX Swift.

Then, I asked it to map a plan to port the Python code to the Swift code, and I was surprised to see that it did a pretty good job at it.

This single file is the most important file in the porting process, which let me play the game peacefully while the model was being ported by a differentmodel, hah.

Here is the plan.md file that I used to guide the AI:

# Port Qwen3-VL into MLX Swift Examples
 
## IMPORTANT
Reference the python implementation for the language stack and vision tower in Developer/Samples/mlx-vlm/ for Qwen3-VL.
 
## Goals
1. Implement Qwen3-VL language stack (attention, mRope, deepstack, KV caches).
2. Implement Qwen3-VL vision tower and connectors.
3. Implement user input processor for images/videos + prompts.
4. Register the model/processor in `VLMModelFactory` and make it selectable.
5. Validate VLMEval app end-to-end with Qwen3-VL.
 
## Detailed Tasks
 
### 1. Port Language Stack (`Libraries/MLXVLM/Models/Qwen3VLLanguage.swift`)
- [ ] Rotary embedding (3-axis mRoPE) from python `language.py`
- [ ] Attention with RMSNorm, cache update, rotary application
- [ ] MLP (gate/down/up) and decoder layers
- [ ] Deepstack visual embeddings and `LanguageModel` wrapper
 
### 2. Port Vision Tower (`Libraries/MLXVLM/Models/Qwen3VLVision.swift`)
- [ ] Patch embedding (3D conv), positional embeddings, transformer blocks
- [ ] Patch merger and deepstack outputs
- [ ] Verify outputs feed correctly into language module
 
### 3. Combine Model (`Libraries/MLXVLM/Models/Qwen3VL.swift`)
- [ ] Wire vision + language components together
- [ ] Implement `prepare`, `callAsFunction`, LoRA hooks, sanitizer
- [ ] Ensure visual token insertion and merged embeddings work
 
### 4. Processor (`Libraries/MLXVLM/Models/Qwen3VLProcessor.swift`)
- [ ] Image/video preprocessing using `MediaProcessing`
- [ ] Chat template and padding replacement
- [ ] Position id calculations
- [ ] Mirror python processor behavior
 
### 5. Factory & Registry Updates
- [ ] Register model/processor in `VLMModelFactory`
- [ ] Add `VLMRegistry.qwen3VL4BInstruct4Bit` with default prompt/EOS tokens
- [ ] Make app functional with Qwen3-VL as selectable model
 
### 6. Validation & Testing
- [ ] `swift build --target MLXVLM` passes
- [ ] `swift build --target VLMEval` passes
- [ ] VLMEval with text-only generates output
- [ ] VLMEval with images works end-to-end

After a lot of back and forth over the night, I was able to get garbage output from the model, but it was a good start. There were a lot of issues with the port, as GPT 5 Codex took some shortcuts while doing the porting, and I should have been more careful with the instructions. And it is not like with this particular model, Sonnet or Gemini Pro would have performed similar.

I learned it was much better to go through file by file instead of asking the AI to do the entire porting in one go, and then fixing the issues one by one. I would have saved myself hours of debugging if I had done it this way because it was finding a needle in a haystack for each of the issues that were subsequently discovered.

I also did not have the patience to go through the code line by line and fix the issues, as I had a Yotei Six to kill. If you have played the game, you know what I am talking about.

Switching the AI Model

I decided to switch to Sonnet 4.5 to fix the issues one by one, and get some "you're absolutely right!" validation along the way, because I was disheartened by the garbage output I was getting from that first Qwen 3 VL port.

Here are the list of issues that ended up being resolved over the whole morning:

## Known Challenges
- Weight key remapping needed: `"model.language_model"` → `"language_model.model"`
- lm_head mapping requires stripping `"model.lm_head"` prefix
- Visual mask dimension must preserve `[batch, seq]` shape for deepstack
- Text-only embeddings handling with proper `nil` passing
- Causal mask creation logic needs careful implementation
 
## Implementation Notes
- MRoPE uses 3D position IDs for temporal/height/width dimensions

Also, one amazing thing that I saw Sonnet 4.5 and GPT 5 Codex do was find the JSON configuration files for the model in the HuggingFace repository to cross check the values, and then log the values from the model to cross-check for any issues.

The issue in the end turned out to be a new method that was specific to the Qwen 3 VL model, and was not present in the Qwen 2 or Qwen 2.5 VL models in MLX Swift examples repository. After fixing the 3D position ID calculation, when I saw the first Haiku it generated about Swift programming, my face lit up with joy and excitement, only to be followed by a sense of disbelief when I saw the crash when I tried it for an image.

It turned out to be an error in the visual embeddings dimension calculation, another thing that was ignored in the porting process.

The vision model outputs deepstackOutputs at certain layer depths, and we need to inject them into the language model at exactly the right positions. The visualMask tells us where those positions are:

     private func applyDeepstack(
         hiddenStates: MLXArray,      // [batch, seq, hidden]
         visualMask: MLXArray,         // Should be [batch, seq] 
         visualEmbeds: MLXArray        // [num_visual_tokens, hidden]
     ) -> MLXArray {
         let indices = maskIndices(visualMask)
         guard !indices.isEmpty else { return hiddenStates }
 
         let indexArray = MLXArray(indices.map { UInt32($0) })
         var result = hiddenStates
         result[0..., indexArray, 0...] = result[0..., indexArray, 0...] + visualEmbeds
         //                  ↑ This crashes if indices are garbage
         return result
     }

The problem was in how I squeezed the mask dimensions. The naive approach was:

// WRONG - squeezes ALL dimensions of size 1
let visualMask = specialMask.squeezed().asType(.bool)
// If specialMask is [1, seq, 1], this becomes [seq] 
// We lose the batch dimension!

The fix was simple:

// CORRECT - squeeze only the last axis
// CRITICAL: Python does image_mask[..., 0] which keeps batch dim
// specialMask is [batch, seq, hidden], squeeze only last axis to get [batch, seq]
let visualMask = specialMask.squeezed(axis: -1).asType(.bool)

That is it. One axis parameter. Without it, maskIndices() treats the mask as 1D instead of [batch, seq], pulls garbage indices, and the advanced indexing write crashes.

After the fixes, I got the model to run for video input as well, but I was not happy with the performance. It was not as fast as the Python original, and I was not sure why.

Performance Regression

I opened a PR to the MLX Swift examples repository for the port, but kept the PR in draft mode until I was sure that the model was working as expected. Again, as I have mentioned earlier, I cannot afford to submit AI slop to the repository, especially when I did not write a single line of code myself and was busy drinking Sake in the game.

I decided to use Haiku 4.5 to profile the porting process and see what was going on. It turns out that there were a few places where the port was using Swift native code instead of MLX optimized code for matrix operations!

That would have been a disaster if I had not caught it in time, and I confirmed it with the other Qwen 2 and Qwen 2.5 VL ports for much better optimized code.

After getting a clear validation from this series of the Claude model, I released the PR and got it merged into the MLX Swift examples repository! I went back to playing the game to help Clan Matsumae.

Moving forward

Porting a model implemented in Python to MLX Swift was a good learning experience for me, and I know I can do it again with other models in the future, albeit much faster and with less manual effort of micromanaging the AI.

My only regret is that I should have gone through the code line by line and fix the issues myself, instead of relying on the AI to do it for me. I would have saved myself hours of debugging if I had done it this way, while learning the inner-workings of the model much better.

I neither got into the flow of the porting process, nor truly enjoyed the game, and I will prefer to work on one thing at a time in the future.

This article itself is a reward so that I can go back to the game to finish off Saito and his sons!

If you end up following this guide (or hit a weird error that I should explore), reach out on Twitter @rudrankriyam or Bluesky @rudrankriyam.bsky.social to discuss the porting process, or the game!

Happy MLXing!