This chapter is taken from my book on "Exploring Cursor for iOS Development". You can read more about it here:
I recently gifted myself a device with an A18 Pro chip, and I want to see how much I can utilize its power for on-device intelligence. No, I am not talking about Apple Intelligence. I want to see how it works with OpenAI’s Whisper model for speech-to-text and different LLMs small but powerful enough to run on the 6.9” display monster.
I came across WhisperKit, an open-source Swift package integrating the speech recognition model with Apple’s CoreML framework for efficient and local inference on Apple devices.
I tried the sample project and could feel the difference between using it on 14 Pro Max and 16 Pro Max, with the latter being much faster. I decided to learn how it works and create a barebones sample project called Whispering for anyone who wants a starting point for working with WhisperKit.
But I do not have any idea where to start from! A few lines are mentioned in the README.md
on how to get started, but not enough for my understanding. The sample project is one 1,000+ complex file, which is too difficult for me to comprehend.
So, I decided to see how much I could use Cursor to learn from the sample project while creating Whispering in the editor. This post is about my journey, and I hope you can pick up a few tricks and tips to make the most of this feature!
Note: While I am using Cursor for it, you can also use the same methodology with Copilot with VS Code or Windsurf.
Getting Started
This is the current content of ContentView
:
import SwiftUI
import WhisperKit
struct ContentView: View {
var body: some View {
VStack {
Image(systemName: "globe")
.imageScale(.large)
.foregroundStyle(.tint)
Text("Hello, world!")
}
.padding()
}
}
The first trick I have used with open-source projects is to change the GitHub URL from “g” to “u”. This will give you the whole LLLM context and the number of tokens for that repository or set of files to know how many tokens you will use. For example, you can do a long context chat with 200K tokens on Claude with Cursor Chat, sufficient for a small—to medium-sized open-sourced project.
First, I will use the README.md
to make Chat summarise on how to use the package for my project. The URL changes from:
https://github.com/argmaxinc/WhisperKit/blob/main/README.md
to:
https://uithub.com/argmaxinc/WhisperKit/blob/main/README.md
The whole project at the time of writing this chapter is around 176K tokens, but I do not need that much context initially.
Using Chat
To open the chat window, you can use the keyboard shortcut Shift
+ Command
+ L
on Cursor or Control
+ Command
+ I
on Copilot Chat.
I started with a simple prompt, giving it the contents of README.md
and then asking it:
<instruction>
Please explain in detail how to get started with WhisperKit
</instruction>
It gave me a summary of WhisperKit and a simple implementation based on the few lines of code in README.md.
I decided just to mess around and try it out myself. I would usually directly apply the code, but this time, I wrote the code so I have the context about what is happening.
Over the months of using Cursor and AI-assisted coding, I have realized that taking the shortcut of AI helping you code is fun, but there should be a balance so you are not alienated from your codebase.
Initial Code
However, I love the tab tab tab, so while writing the code myself, I utilized the fast autocompletion of the Cursor Tab.
After some tab-driven development, this is what my class looks like:
struct ContentView: View {
@StateObject private var transcriptionService = TranscriptionService()
@State private var transcriptionResult = ""
var body: some View {
VStack {
Image(systemName: "globe")
.imageScale(.large)
.foregroundStyle(.tint)
Text(transcriptionResult)
}
.padding()
.task {
await transcriptionService.setup()
let result = await transcriptionService.transcribe(audio: URL(fileURLWithPath: Bundle.main.path(forResource: "test", ofType: "mp3")!))
transcriptionResult = result
}
}
}
#Preview {
ContentView()
}
class TranscriptionService: ObservableObject {
private var whisperKit: WhisperKit?
init() {
}
func setup() async {
do {
whisperKit = try await WhisperKit()
} catch {
print("Error setting up WhisperKit: \(error)")
}
}
func transcribe(audio: URL) async -> String {
do {
let result = try await whisperKit?.transcribe(audioPath: audio.absoluteString)
return result?.first?.text ?? ""
} catch {
print("Error transcribing audio: \(error)")
return ""
}
}
}
I create an instance of WhisperKit
and transcribe the audio at a particular path. That’s a good start. Based on the documentation, this should show me how WhisperKit works.
Adding Test Audio
I want Cursor to add a random speech audio file called test to the project. I will add @ContentView.swift
for context to the chat conversation:
@ContentView.swift Please add the test file to the project and ensure that the code I have written is correct and will work well according to the documentation of WhisperKit.
Claude gave me an improved version of the code but said I should add the file myself. I asked again, this time with @Web
so that it could access a file:
Can you add the file from the @Web to the project yourself and to the xcodeproj?
The response was disappointing that it could not directly add files to the Xcode project or modify the file system. However, it gave me links to a clear man’s speech, an audio from Mozilla’s dataset, and a short TED talk sample!
I ran the command to download the LibriSpeech clear English speech example:
curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
I know that Xcode 16 uses the folder structure, so I can change the directory into the Whispering folder and run the command; it should automatically add it. So, in the command line that can be opened via Control
+ Shift
+ `````, I used the inline feature (Command
+ K
) to generate the command:
Change directory to inside Whispering folder and run the
curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav command
It generated the following command:
cd Whispering && curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
I could have written cd Whispering &&
myself, but I did it for those unaware that you can use the inline feature directly to generate more complex commands, too. The command added the test.mp3
file inside Xcode, with the target membership of Whispering
too!
Improved Code Suggestions
Coming back to the suggestions that I got in the Chat, it added a waveform image and a progress view when it transcribes the given audio file:
struct ContentView: View {
@StateObject private var transcriptionService = TranscriptionService()
@State private var transcriptionResult = ""
@State private var isTranscribing = false
var body: some View {
VStack {
Image(systemName: "waveform")
.imageScale(.large)
.foregroundStyle(.tint)
if isTranscribing {
ProgressView("Transcribing...")
}
Text(transcriptionResult)
.padding()
}
.padding()
.task {
isTranscribing = true
do {
await transcriptionService.setup()
if let path = Bundle.main.path(forResource: "test", ofType: "mp3") {
let url = URL(fileURLWithPath: path)
transcriptionResult = await transcriptionService.transcribe(audio: url)
} else {
transcriptionResult = "Error: Audio file not found"
}
} catch {
transcriptionResult = "Error: \(error.localizedDescription)"
}
isTranscribing = false
}
}
}
class TranscriptionService: ObservableObject {
private var whisperKit: WhisperKit?
func setup() async {
do {
let config = WhisperKitConfig(model: "base")
whisperKit = try await WhisperKit(config)
} catch {
print("Error setting up WhisperKit: \(error)")
}
}
func transcribe(audio: URL) async -> String {
do {
let result = try await whisperKit?.transcribe(audioPath: audio.path)
return result?.text ?? ""
} catch {
print("Error transcribing audio: \(error)")
return "Error: \(error.localizedDescription)"
}
}
}
It also explicitly added the model “base” to fetch in the configuration before initializing the instance of WhisperKit
. When I ran the project, I got the following error:
/Users/rudrankriyam/Downloads/Whispering/Whispering/ContentView.swift:71:22: error: value of type '[TranscriptionResult]' has no member 'text'
return result?.text ?? ""
~~~~~~~ ^~~~
When you see the error in Terminal, you can directly add those particular lines in Chat using Shift
+ Command
+ L
or the equivalent in VS Code. And then hit enter:
Ah, I see the error. The transcription result structure has changed in WhisperKit. Let's fix the code:
It gave me the solution to join the text from the transcription result:
if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
// Join all segments into a single string
return results.map { $0.text }.joined(separator: " ")
}
I investigated why it went wrong and realized that the older method that just returns TranscriptionResult
is deprecated, and the newer one returns an array of TranscriptionResult
instead. The Chat only relied on the README.md
information, so it followed the older method. But, in README.md,
the example was in one line, where Swift implicitly took the type of result
as TranscriptionResult,
which works, but Claude did it in two lines. AI is not going to take over the world yet, haha.
And with that, we successfully built SweetPad inside Cursor/ VS Code! (covered in subsequent chapters)
** BUILD SUCCEEDED ** [11.178 sec]
First Build
After running the first build and waiting a while, I got the error message:
Error transcribing audio: Error Domain=com.apple.coreaudio.avfaudio Code=1685348671 "(null)" UserInfo={failed call=ExtAudioFileOpenURL((__bridge CFURLRef)fileURL, &_extAudioFile)}
The first instinct out of habit is to search it on Google (yeah, I am not used to Perplexity) but I want to use the debug with AI feature instead. I added those lines in Chat using Shift
+ Command
+ L
or the equivalent in VS Code and then hit enter:
This error occurs because there's an issue with the audio file format or access. Let's fix this by ensuring proper audio file handling:
It added an error handling and better file validation to the class:
class TranscriptionService: ObservableObject {
private var whisperKit: WhisperKit?
func setup() async {
do {
let config = WhisperKitConfig(model: "base")
whisperKit = try await WhisperKit(config)
} catch {
print("Error setting up WhisperKit: \(error)")
}
}
func transcribe(audio: URL) async -> String {
do {
// Verify file exists and is readable
guard FileManager.default.fileExists(atPath: audio.path) else {
return "Error: Audio file not found at path"
}
// Print file information for debugging
print("Attempting to transcribe file at: \(audio.path)")
print("File exists: \(FileManager.default.fileExists(atPath: audio.path))")
// Try to transcribe
if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
return results.map { $0.text }.joined(separator: " ")
}
return "No transcription results"
} catch {
print("Error transcribing audio: \(error)")
return "Error: \(error.localizedDescription)"
}
}
}
Then, it modified ContentView
to ensure proper file handling:
struct ContentView: View {
@StateObject private var transcriptionService = TranscriptionService()
@State private var transcriptionResult = ""
@State private var isTranscribing = false
var body: some View {
VStack {
Image(systemName: "waveform")
.imageScale(.large)
.foregroundStyle(.tint)
if isTranscribing {
ProgressView("Transcribing...")
}
Text(transcriptionResult)
.padding()
}
.padding()
.task {
isTranscribing = true
// Get the file URL and verify it exists
if let audioURL = Bundle.main.url(forResource: "test", withExtension: "mp3") {
print("Found audio file at: \(audioURL.path)")
// Verify file exists
if FileManager.default.fileExists(atPath: audioURL.path) {
transcriptionResult = await transcriptionService.transcribe(audio: audioURL)
} else {
transcriptionResult = "Error: Audio file not found in bundle"
}
} else {
transcriptionResult = "Error: Could not create URL for audio file"
}
isTranscribing = false
}
}
}
Running the app again got me the following in the console:
Found audio file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
Attempting to transcribe file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
File exists: true
But I still got the audio related error. Asking it again, it recommended me to convert to wav
format instead:
This error typically occurs with audio format issues. Let's try converting the audio to a WAV format with specific parameters that WhisperKit works well with. Here's what we'll do:
I ran the given command, and it automatically added the test.wav
to the project:
ffmpeg -i test.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav
This created a WAV file with:
- 16kHz sample rate
- Mono channel
- 16-bit PCM encoding
Then, it updated the ContentView
as:
.task {
isTranscribing = true
await transcriptionService.setup()
if let audioPath = Bundle.main.path(forResource: "test", ofType: "wav") { // Changed to wav
print("Audio path: \(audioPath)")
transcriptionResult = await transcriptionService.transcribe(audioPath: audioPath)
} else {
transcriptionResult = "Error: Audio file not found"
}
isTranscribing = false
}
Running the project again gave me a successful transcription, on-device!
Improving with Speech Recognition
The next step in improving the app is to have an option to directly recognize the speech and then feed the buffer directly to WhisperKit. I do not see any mention of how to do it, but I know the sample project does it.
So, this time, we provide it with the context of the sample app instead:
https://uithub.com/argmaxinc/WhisperKit/tree/main/Examples/WhisperAX/WhisperAX?accept=text%2Fhtml&maxTokens=10000000&ext=swift
At the time of writing, it is roughly 22.6K tokens, which is good enough for a normal chat. I will create a new chat and provide it with the above context:
<instruction>
I want to extend the app to accept the user's speech when they click the start recording button. When they click the stop button, it should send the audio buffer to WhisperKit to transcribe and show the result in a view, a scrollable view for larger texts.
Follow Apple's Human Interface Guidelines to create a clean user interface with accessibility in mind.
</instruction>
It provided me with the given code:
// ... existing code ...
struct ContentView: View {
// Add new state variables for recording
@State private var audioRecorder: AVAudioRecorder?
@State private var recordingSession: AVAudioSession?
@State private var isRecording = false
@State private var transcribedText = ""
@State private var recordingURL: URL?
// ... existing state variables ...
var body: some View {
NavigationSplitView(columnVisibility: $columnVisibility) {
// ... existing sidebar code ...
} detail: {
VStack {
// Transcription results view
ScrollView {
Text(transcribedText)
.frame(maxWidth: .infinity, alignment: .leading)
.padding()
.accessibilityLabel("Transcription results")
}
// Recording controls
HStack(spacing: 20) {
Button(action: {
if isRecording {
stopRecording()
} else {
startRecording()
}
}) {
Image(systemName: isRecording ? "stop.circle.fill" : "mic.circle.fill")
.resizable()
.frame(width: 44, height: 44)
.foregroundColor(isRecording ? .red : .blue)
}
.accessibilityLabel(isRecording ? "Stop recording" : "Start recording")
.disabled(modelState != .loaded)
if isRecording {
// Show recording indicator
Text("Recording...")
.foregroundColor(.red)
}
}
.padding()
}
}
.onAppear {
setupAudioSession()
}
}
// Setup audio recording
private func setupAudioSession() {
recordingSession = AVAudioSession.sharedInstance()
do {
try recordingSession?.setCategory(.playAndRecord, mode: .default)
try recordingSession?.setActive(true)
} catch {
print("Failed to set up recording session: \(error)")
}
}
private func startRecording() {
let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.wav")
recordingURL = audioFilename
let settings = [
AVFormatIDKey: Int(kAudioFormatLinearPCM),
AVSampleRateKey: 16000,
AVNumberOfChannelsKey: 1,
AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue
]
do {
audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
audioRecorder?.record()
isRecording = true
} catch {
print("Could not start recording: \(error)")
}
}
private func stopRecording() {
audioRecorder?.stop()
isRecording = false
// Transcribe the recorded audio
if let url = recordingURL {
Task {
if let results = try await whisperKit?.transcribe(audioPath: url.path) {
transcribedText = results.map { $0.text }.joined(separator: " ")
}
}
}
}
private func getDocumentsDirectory() -> URL {
FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
}
}
Key changes made:
- Added state variables for managing audio recording and the transcribed text
- Added a ScrollView to display transcription results
- Added recording controls with start/stop functionality
- Implemented audio session setup and recording functions
- Added accessibility labels for better VoiceOver support
- Added error handling for audio setup and recording
- Integrated with WhisperKit for transcription when recording stops
The UI follows Apple's Human Interface Guidelines by:
- Using standard system icons
- Providing clear visual feedback for recording state
- Including accessibility labels
- Using appropriate spacing and padding
- Implementing a scrollable view for longer transcriptions
- Disabling recording when the model isn't loaded
- Using semantic colors (red for recording, blue for start)
The code handles the complete flow:
- Setting up audio session on app launch
- Starting/stopping recording with proper audio settings
- Saving recorded audio to a temporary file
- Transcribing the audio using WhisperKit
- Displaying results in a scrollable view
It also mentioned that to add the following keys to the Info.plist
:
<key>NSMicrophoneUsageDescription</key>
<string>We need access to your microphone to record audio for transcription.</string>
Xcode no longer provides a file Info.plist
that you can directly open as source code unless you add a key to it, so I manually added it under Info under the target Whispering.
The build succeeded in 1 attempt, so time to build it on the device! Claude took the “clean UI” part to its heart, and there is only one button to record. It records well and transcribes as expected!
Extending to macOS
When I added the target macOS, I remembered that AVAudioSession
is unavailable on this platform:
/Users/rudrankriyam/Downloads/Whispering/Whispering/ContentView.swift:96:33: error: 'AVAudioSession' is unavailable in macOS
private var recordingSession: AVAudioSession?
This gives us another excellent opportunity to utilize the Chat feature. I added the above lines in Chat using Shift
+ Command
+ L
or the equivalent in VS Code and then hit enter.
I like how Claude is smart enough to add conditionals:
Ah, I see you're developing for macOS. Let's modify the code to handle audio recording for both macOS and iOS using conditional compilation:
The major changes are in the TranscriptionService
:
class TranscriptionService: ObservableObject {
private var whisperKit: WhisperKit?
private var audioRecorder: AVAudioRecorder?
#if os(iOS)
private var recordingSession: AVAudioSession?
#endif
private var recordingURL: URL?
init() {
setupAudioSession()
}
private func setupAudioSession() {
#if os(iOS)
recordingSession = AVAudioSession.sharedInstance()
do {
try recordingSession?.setCategory(.playAndRecord, mode: .default)
try recordingSession?.setActive(true)
} catch {
print("Failed to set up recording session: \(error)")
}
#endif
}
func setup() async {
do {
let config = WhisperKitConfig(model: "base")
whisperKit = try await WhisperKit(config)
} catch {
print("Error setting up WhisperKit: \(error)")
}
}
func startRecording() async {
let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.wav")
recordingURL = audioFilename
let settings: [String: Any] = [
AVFormatIDKey: Int(kAudioFormatLinearPCM),
AVSampleRateKey: 16000,
AVNumberOfChannelsKey: 1,
AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue,
AVLinearPCMBitDepthKey: 16,
AVLinearPCMIsFloatKey: false,
AVLinearPCMIsBigEndianKey: false
]
do {
#if os(macOS)
// Request microphone access on macOS
if let accessGranted = try? await AVCaptureDevice.requestAccess(for: .audio) {
guard accessGranted else {
print("Microphone access denied")
return
}
}
#endif
audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
audioRecorder?.record()
} catch {
print("Could not start recording: \(error)")
}
}
// rest remains same
}
And that was the only change required to have a macOS app as well!
Moving Forward
I was able to take this sample project further to create an on-device transcription menu bar app called Whispering:
Play with it, and let me know what you think!
Happy whispering!