When DeepSeek R1 dropped, I went to its repository and found more distilled models. I am working on my local chat app, Polarixy, so I wanted to try the Qwen 1.5B model. At that time, no available model was available in MLX format with quantization. So, I decided to do it myself.
Yesterday night, I found another model I wanted to convert. This is a step-by-step guide on using a Hugging Face model with MLX and ensuring it is optimized for performance on Apple devices, especially iPhones.
Do not worry if Python is not your strong suit; I will keep it simple.
The model we will be converting today is mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0. Letβs dive in!
Prerequisites: Setting Up Your Environment
Before we get started, make sure you have Python installed. If you are not sure, open your terminal and type:
python --version
If Python is not installed, download it from python.org and follow the installation instructions.
Next, create a virtual environment to keep things tidy. A virtual environment ensures that the install dependencies do not affect other projects.
# Create a virtual environment
python -m venv mlx-env
# Activate the environment
source mlx-env/bin/activate
Once activated, your terminal prompt should indicate that you are in the mlx-env
environment. Now, we are ready to install the required tools.
Installing MLX-LM
The mlx-lm package helps to use models in the MLX format. You can install it using eitherpip
or conda
. For this guide, I will use pip
.
pip install mlx-lm
If you are using Conda, the command is:
conda install -c conda-forge mlx-lm
That is it for the setup!
Logging to Hugging Face
Before uploading the model to the Hugging Face Hub, ensure you are logged in with your Hugging Face account. If you are not logged in, use the following command:
pip install -U "huggingface_hub[cli]" && huggingface-cli login
This will prompt you to enter your Hugging Face token, found on the Hugging Face settings page.
Converting the Model to MLX Format
The mlx-lm package provides a simple command-line tool for model conversion. Hereβs how you can do it:
mlx_lm.convert \
--hf-path mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0 \
-q \
--upload-repo rudrankriyam/deepseek-r1-redistill-qwen-1.5b
Breaking it down:
- Model ID: Replace
--hf-path
with the ID of the model you want to convert (in our case,mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0
). - Quantization: Adding theβq flag ensures the model is quantized to the default 4-bit format, which reduces its size for better performance on edge devices like iPhones.
- Upload Repository: Use
β-upload-repo
to specify the target repository on Hugging Face to which the converted model will be uploaded. You can upload it directly to themix-community
for goodwill, but I prefer to keep it under my account because I am not good with naming conventions.
Once the command completes, the converted model will be saved in the mlx_model
directory by default.
Here is what the terminal output will look like when downloading the model:
[INFO] Loading
generation_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 181/181 [00:00<00:00, 1.32MB/s]
tokenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6.75k/6.75k [00:00<00:00, 30.1MB/s]
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 768/768 [00:00<00:00, 4.83MB/s]
special_tokens_map.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 485/485 [00:00<00:00, 1.97MB/s]
tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 11.4M/11.4M [00:03<00:00, 3.51MB/s]
model.safetensors: 3%|ββββ | 94.4M/3.56G [00:03<01:57, 29.4MB/s]
model.safetensors: 34%|βββββββββββββββββββββββββββββββββββββββββββββ | 1.21G/3.56G [00:43<01:35, 24.6MB/s]
After the quantization process is done, you will see the following output:
Fetching 6 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:00<00:00, 66752.85it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.
When uploading, your terminal should look similar to this:
Repo created: https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b
Found 7 candidate files to upload
Recovering from metadata files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:00<00:00, 6835.89it/s]
---------- 2025-01-25 13:48:00 (0:00:00) ----------
Files: hashed 3/7 (8.2K/1.0G) | pre-uploaded: 0/0 (0.0/1.0G) (+7 unsure) | committed: 0/7 (0.0/1.0G) | ignored: 0
Workers: hashing: 4 | get upload mode: 2 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 11.4M/11.4M [00:03<00:00, 3.29MB/s]
model.safetensors: 1%|ββ | 14.6M/1.00G [00:01<01:48, 9.07MB/s]
model.safetensors: 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 444M/1.00G [00:29<00:25, 21.5MB/s]
After the upload process is complete, this should be the final response:
All files have been processed! Exiting worker.ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 999M/1.00G [00:57<00:00, 17.6MB/s]
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
---------- 2025-01-25 13:49:10 (0:01:10) ----------
Files: hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
INFO:huggingface_hub._upload_large_folder:
---------- 2025-01-25 13:49:10 (0:01:10) ----------
Files: hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
---------------------------------------------------
Upload successful, go to https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b for details.
And we have our model quantized and available to use with MLX!
Moving Forward
With this guide, you have learned how to quantize a Hugging Face model to the MLX format and upload it to the Hub.
Next, I recommend exploring how to fine-tune these models for specific use cases or integrate them directly into your iOS app. You can experiment with quantization levels to find the perfect setup for your needs.
If you encounter any issues or have some fun experiences to share, contact me at Twitter @rudrankriyam. I would love to hear about your projects and experiments!
Happy MLXing!