Exploring MLX: Quantizing a Model with MLX for iOS Developers
When DeepSeek R1 dropped, I went to its repository and found more distilled models. I am working on my local chat app, Polarixy, so I wanted to try the Qwen 1.5B model. At that time, no available model was available in MLX format with quantization. So, I decided to do it myself.
DeepSeek's new R1 Distill Qwen 1.5B outperforms GPT-4o and Claude-3.5-Sonnet with 28.9% on AIME and 83.9% on MATH!!
Available on MLX, thanks to yours truly π€https://t.co/91jNC0bFVX
β Rudrank Riyam (@rudrankriyam) January 20, 2025
Yesterday night, I found another model I wanted to convert. This is a step-by-step guide on using a Hugging Face model with MLX and ensuring it is optimized for performance on Apple devices, especially iPhones.
Do not worry if Python is not your strong suit; I will keep it simple.
The model we will be converting today isΒ mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0. Letβs dive in!
mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0 Β· Hugging Face
Prerequisites: Setting Up Your Environment
Before we get started, make sure you haveΒ PythonΒ installed. If you are not sure, open your terminal and type:
python --versionIf Python is not installed, download it fromΒ python.orgΒ and follow the installation instructions.
Next, create a virtual environment to keep things tidy. A virtual environment ensures that the install dependencies do not affect other projects.
# Create a virtual environment
python -m venv mlx-env
# Activate the environment
source mlx-env/bin/activateOnce activated, your terminal prompt should indicate that you are in theΒ mlx-envΒ environment. Now, we are ready to install the required tools.
Installing MLX-LM
TheΒ mlx-lmΒ package helps to use models in the MLX format. You can install it using eitherpipΒ orΒ conda. For this guide, I will useΒ pip.
pip install mlx-lmIf you are using Conda, the command is:
conda install -c conda-forge mlx-lmThat is it for the setup!
Logging to Hugging Face
Before uploading the model to the Hugging Face Hub, ensure you are logged in with your Hugging Face account. If you are not logged in, use the following command:
pip install -U "huggingface_hub[cli]" && huggingface-cli loginThis will prompt you to enter your Hugging Face token, found on theΒ Hugging Face settings page.
Converting the Model to MLX Format
TheΒ mlx-lmΒ package provides a simple command-line tool for model conversion. Hereβs how you can do it:
mlx_lm.convert \
- -hf-path mobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0 \
- q \
- -upload-repo rudrankriyam/deepseek-r1-redistill-qwen-1.5bBreaking it down:
- Model ID: ReplaceΒ
--hf-pathΒ with the ID of the model you want to convert (in our case,Βmobiuslabsgmbh/DeepSeek-R1-ReDistill-Qwen-1.5B-v1.0). - Quantization: Adding theβqΒ flag ensures the model is quantized to the default 4-bit format, which reduces its size for better performance on edge devices like iPhones.
- Upload Repository: Use
β-upload-repoΒ to specify the target repository on Hugging Face to which the converted model will be uploaded. You can upload it directly to themix-communityfor goodwill, but I prefer to keep it under my account because I am not good with naming conventions.
Once the command completes, the converted model will be saved in theΒ mlx_modelΒ directory by default.
Here is what the terminal output will look like when downloading the model:
[INFO] Loading
generation_config.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 181/181 [00:00<00:00, 1.32MB/s]
tokenizer_config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6.75k/6.75k [00:00<00:00, 30.1MB/s]
config.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 768/768 [00:00<00:00, 4.83MB/s]
special_tokens_map.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 485/485 [00:00<00:00, 1.97MB/s]
tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 11.4M/11.4M [00:03<00:00, 3.51MB/s]
model.safetensors: 3%|ββββ | 94.4M/3.56G [00:03<01:57, 29.4MB/s]
model.safetensors: 34%|βββββββββββββββββββββββββββββββββββββββββββββ | 1.21G/3.56G [00:43<01:35, 24.6MB/s]After the quantization process is done, you will see the following output:
Fetching 6 files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:00<00:00, 66752.85it/s]
[INFO] Quantizing
[INFO] Quantized model with 4.501 bits per weight.When uploading, your terminal should look similar to this:
Repo created: https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b
Found 7 candidate files to upload
Recovering from metadata files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:00<00:00, 6835.89it/s]
- --------- 2025-01-25 13:48:00 (0:00:00) ----------
Files: hashed 3/7 (8.2K/1.0G) | pre-uploaded: 0/0 (0.0/1.0G) (+7 unsure) | committed: 0/7 (0.0/1.0G) | ignored: 0
Workers: hashing: 4 | get upload mode: 2 | pre-uploading: 0 | committing: 0 | waiting: 0
- --
tokenizer.json: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 11.4M/11.4M [00:03<00:00, 3.29MB/s]
model.safetensors: 1%|ββ | 14.6M/1.00G [00:01<01:48, 9.07MB/s]
model.safetensors: 44%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 444M/1.00G [00:29<00:25, 21.5MB/s]After the upload process is complete, this should be the final response:
All files have been processed! Exiting worker.ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 999M/1.00G [00:57<00:00, 17.6MB/s]
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
All files have been processed! Exiting worker.
INFO:huggingface_hub._upload_large_folder:All files have been processed! Exiting worker.
- --------- 2025-01-25 13:49:10 (0:01:10) ----------
Files: hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
- --
INFO:huggingface_hub._upload_large_folder:
- --------- 2025-01-25 13:49:10 (0:01:10) ----------
Files: hashed 7/7 (1.0G/1.0G) | pre-uploaded: 2/2 (1.0G/1.0G) | committed: 7/7 (1.0G/1.0G) | ignored: 0
Workers: hashing: 0 | get upload mode: 0 | pre-uploading: 0 | committing: 0 | waiting: 0
- --
Upload successful, go to https://huggingface.co/rudrankriyam/deepseek-r1-redistill-qwen-1.5b for details.And we have our model quantized and available to use with MLX!
rudrankriyam/deepseek-r1-redistill-qwen-1.5b Β· Hugging Face
Moving Forward
With this guide, you have learned how to quantize a Hugging Face model to the MLX format and upload it to the Hub.
Next, I recommend exploring how to fine-tune these models for specific use cases or integrate them directly into your iOS app. You can experiment with quantization levels to find the perfect setup for your needs.
If you encounter any issues or have some fun experiences to share, contact me at Twitter @rudrankriyam. I would love to hear about your projects and experiments!
Happy MLXing!
Post Topics
Explore more in these categories: