Exploring MLX: Quantizing a Model with MLX for iOS Developers

When DeepSeek R1 dropped, I went to its repository and found more distilled models. I am working on my local chat app, Polarixy, so I wanted to try the Qwen 1.5B model. At that time, no available model was available in MLX format with quantization. So, I decided to do it myself.

DeepSeek's new R1 Distill Qwen 1.5B outperforms GPT-4o and Claude-3.5-Sonnet with 28.9% on AIME and 83.9% on MATH!!

Available on MLX, thanks to yours truly 🤗https://t.co/91jNC0bFVX
— Rudrank Riyam (@rudrankriyam) January 20, 2025

Yesterday night, I found another model I wanted to convert. This is a step-by-step guide on using a Hugging Face model with MLX and ensuring it is optimized for performance on Apple devices, especially iPhones.

Do not worry if Python is not your strong suit; I will keep it simple.

The model we will be converting today is mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0. Let’s dive in!

mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Prerequisites: Setting Up Your Environment

Before we get started, make sure you have Python installed. If you are not sure, open your terminal and type:

python --version

If Python is not installed, download it from python.org and follow the installation instructions.

Next, create a virtual environment to keep things tidy. A virtual environment ensures that the install dependencies do not affect other projects.

# Create a virtual environment
python -m venv mlx-env

# Activate the environment
source mlx-env/bin/activate

Once activated, your terminal prompt should indicate that you are in the mlx-env environment. Now, we are ready to install the required tools.

Installing MLX-LM

The mlx-lm package helps to use models in the MLX format. You can install it using eitherpip or conda. For this guide, I will use pip.

pip install mlx-lm

If you are using Conda, the command is:

conda install -c conda-forge mlx-lm

That is it for the setup!

Logging to Hugging Face

Before uploading the model to the Hugging Face Hub, ensure you are logged in with your Hugging Face account. If you are not logged in, use the following command:

pip install -U "huggingface_hub[cli]" && huggingface-cli login

This will prompt you to enter your Hugging Face token, found on the Hugging Face settings page.

Converting the Model to MLX Format

The mlx-lm package provides a simple command-line tool for model conversion. Here’s how you can do it:

mlx_lm.convert \
    --hf-path mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0 \
    -q \
    --upload-repo rudrankriyam/deepseek-r1-redistill-qwen-1.5b

Breaking it down:

Model ID: Replace --hf-path with the ID of the model you want to convert (in our case, mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0).
Quantization: Adding the—q flag ensures the model is quantized to the default 4-bit format, which reduces its size for better performance on edge devices like iPhones.
Upload Repository: Use —-upload-repo to specify the target repository on Hugging Face to which the converted model will be uploaded. You can upload it directly to the mix-community for goodwill, but I prefer to keep it under my account because I am not good with naming conventions.

Once the command completes, the converted model will be saved in the mlx_model directory by default.

Here is what the terminal output will look like when downloading the model:

[INFO] Loading
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 181/181 [00:00<00:00, 1.32MB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.75k/6.75k [00:00<00:00, 30.1MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 768/768 [00:00<00:00, 4.83MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 485/485 [00:00<00:00, 1.97MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:03<00:00, 3.51MB/s]
model.safetensors:   3%|███▍                                                                                                                               | 94.4M/3.56G [00:03<01:57, 29.4MB/s]
model.safetensors:  34%|████████████████████████████████████████████▍                                                                                      | 1.21G/3.56G [00:43<01:35, 24.6MB/s]

After the quantization process is done, you will see the following output:

Fetching 6 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 66752.85it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.

When uploading, your terminal should look similar to this:

Repo created: https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b
Found 7 candidate files to upload
Recovering from metadata files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 6835.89it/s]



---------- 2025-01-25 13:48:00 (0:00:00) ----------
Files:   hashed 3/7 (8.2K/1.0G) | pre-uploaded: 0/0 (0.0/1.0G) (+7 unsure) | committed: 0/7 (0.0/1.0G) | ignored: 0
Workers: hashing: 4 | get upload mode: 2 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:03<00:00, 3.29MB/s]
model.safetensors:   1%|█▉                                                                                                                                 | 14.6M/1.00G [00:01<01:48, 9.07MB/s]
model.safetensors:  44%|██████████████████████████████████████████████████████████▋                                                                         | 444M/1.00G [00:29<00:25, 21.5MB/s]

After the upload process is complete, this should be the final response:

All files have been processed! Exiting worker.█████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 999M/1.00G [00:57<00:00, 17.6MB/s]
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.

---------- 2025-01-25 13:49:10 (0:01:10) ----------
Files:   hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
INFO:huggingface_hub._upload_large_folder:
---------- 2025-01-25 13:49:10 (0:01:10) ----------
Files:   hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
Upload successful, go to https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b for details.

And we have our model quantized and available to use with MLX!

rudrankriyam/deepseek-r1-redistill-qwen-1.5b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Moving Forward

With this guide, you have learned how to quantize a Hugging Face model to the MLX format and upload it to the Hub.

Next, I recommend exploring how to fine-tune these models for specific use cases or integrate them directly into your iOS app. You can experiment with quantization levels to find the perfect setup for your needs.

If you encounter any issues or have some fun experiences to share, contact me at Twitter @rudrankriyam. I would love to hear about your projects and experiments!

x.com

X (formerly Twitter)

Happy MLXing!