This chapter is taken from my book on "Exploring Cursor for iOS Development". You can read more about it here:

Exploring AI Assisted Coding for iOS Development
Note: Use the discount code STUDENT for a 44% discount only if you are a student. :]This book offers a practical guide to using Cursor and VS Code Copilot in your daily Swift and SwiftUI workflows.Includes ready-to-use scripts for creating Xcode projectsXcursor: CustomGPT Chat about anything related to Cursor, VS Code Copilot and iOS development (Beta)This book is for:iOS and Apple platform developers are looking for ways to improve their workflow with AI assistance and reduce friction. Table of Contents- Prologue- Privacy and Security- Getting Started with Cursor Installing Cursor on macOS Getting Started with GitHub Copilot in VS Code Downloading and Installing Visual Studio Code Setting Up GitHub Copilot Install the GitHub Copilot Extension Managing Your GitHub Copilot Subscription Xcode Midnight Theme Tips for Prompting Designing Prompts Using Action Verbs System Prompt - Cursor Tab Cursor Tab & External Package How Cursor Tab Works The Future of Cursor Tab Predicting the Next Action Surface Information When You Need It Perfect Edits with Large-Scale Inference- Edit for Code Generation How it Works Example of Using Edit Quickly Change Models - Cursor Chat How It Works Using Chat to Learn a New Framework Using Chat Initial Code Adding Test Audio Improved Code Suggestions First Build Improving with Speech Recognition Extending to macOS - Basic Example of Using Composer Getting Started Using Composer - Debug with AI Getting Started When to Rely on AI Suggestions - Cursor and Rules for AI Global AI Rules in Cursor Settings Project-Specific Cursor Rules Setting Up Rules for AI vs. cursorrules File Understanding the System Prompt What Happens During a Conflict? Toggling .cursorrules Usage Tips on Cursor Rules Define Your Project’s Specific Context Use Chain-of-Thought Reasoning Adapt the AI’s Tone and Style Cursor Rules Directory Background and Context Objectives of the Rule Response Format Why This Example is Effective - Introducing SweetPad Autocompletion Autocomplete in Action Building and Running Setting Up Build and Run Keyboard Shortcuts for Build and Run Adding Task Termination to Avoid Conflicts Main Features of the “Build” Section Configuring Your Xcode Workspace Additional Build Settings Debugging Setting Up Debugging Breakpoints Running on Real Device Limitations of SweetPad’s Device Management Running Your App on a Physical Device Adding a New Device How SweetPad Uses devicectl Viewing Logs and Debugging from the Device Formatting Setting Up Code Formatting with swift-format Customizing the Formatter Handling Formatting Errors Choosing the Right Formatter Format on Save - Starting with Hot Reloading Installing Inject Setting Up Inject How Inject Works Importing Inject Starting with InjectionNext Setting Up InjectionNext with SwiftUI Cursor Rules for Hot Reloading - Moving Forward- Introduction to Alex Sidebar for Xcode Overview Getting Started with Alex Sidebar

I recently gifted myself a device with an A18 Pro chip, and I want to see how much I can utilize its power for on-device intelligence. No, I am not talking about Apple Intelligence. I want to see how it works with OpenAI’s Whisper model for speech-to-text and different LLMs small but powerful enough to run on the 6.9” display monster.

I came across WhisperKit, an open-source Swift package integrating the speech recognition model with Apple’s CoreML framework for efficient and local inference on Apple devices.

GitHub - argmaxinc/WhisperKit: On-device Speech Recognition for Apple Silicon
On-device Speech Recognition for Apple Silicon. Contribute to argmaxinc/WhisperKit development by creating an account on GitHub.

I tried the sample project and could feel the difference between using it on 14 Pro Max and 16 Pro Max, with the latter being much faster. I decided to learn how it works and create a barebones sample project called Whispering for anyone who wants a starting point for working with WhisperKit.

GitHub - rudrankriyam/WhisperKit-Sample: A sample project to work with WhisperKit across Apple Platforms
A sample project to work with WhisperKit across Apple Platforms - rudrankriyam/WhisperKit-Sample

But I do not have any idea where to start from! A few lines are mentioned in the README.md on how to get started, but not enough for my understanding. The sample project is one 1,000+ complex file, which is too difficult for me to comprehend.

So, I decided to see how much I could use Cursor to learn from the sample project while creating Whispering in the editor. This post is about my journey, and I hope you can pick up a few tricks and tips to make the most of this feature!

Note: While I am using Cursor for it, you can also use the same methodology with Copilot with VS Code or Windsurf.

Getting Started

This is the current content of ContentView:

import SwiftUI
import WhisperKit

struct ContentView: View {
    var body: some View {
        VStack {
            Image(systemName: "globe")
                .imageScale(.large)
                .foregroundStyle(.tint)
            Text("Hello, world!")
        }
        .padding()
    }
}

The first trick I have used with open-source projects is to change the GitHub URL from “g” to “u”. This will give you the whole LLLM context and the number of tokens for that repository or set of files to know how many tokens you will use. For example, you can do a long context chat with 200K tokens on Claude with Cursor Chat, sufficient for a small—to medium-sized open-sourced project.

First, I will use the README.md to make Chat summarise on how to use the package for my project. The URL changes from:

https://github.com/argmaxinc/WhisperKit/blob/main/README.md

to:

https://uithub.com/argmaxinc/WhisperKit/blob/main/README.md

The whole project at the time of writing this chapter is around 176K tokens, but I do not need that much context initially.

Using Chat

To open the chat window, you can use the keyboard shortcut Shift + Command + L on Cursor or Control + Command + I on Copilot Chat.

I started with a simple prompt, giving it the contents of README.md and then asking it:

<instruction>
Please explain in detail how to get started with WhisperKit
</instruction>

It gave me a summary of WhisperKit and a simple implementation based on the few lines of code in README.md. I decided just to mess around and try it out myself. I would usually directly apply the code, but this time, I wrote the code so I have the context about what is happening.

Over the months of using Cursor and AI-assisted coding, I have realized that taking the shortcut of AI helping you code is fun, but there should be a balance so you are not alienated from your codebase.

Initial Code

However, I love the tab tab tab, so while writing the code myself, I utilized the fast autocompletion of the Cursor Tab.

After some tab-driven development, this is what my class looks like:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""

  var body: some View {
    VStack {
      Image(systemName: "globe")
        .imageScale(.large)
        .foregroundStyle(.tint)

      Text(transcriptionResult)
    }
    .padding()
    .task {
      await transcriptionService.setup()
      let result = await transcriptionService.transcribe(audio: URL(fileURLWithPath: Bundle.main.path(forResource: "test", ofType: "mp3")!))
      transcriptionResult = result
    }
  }
}

#Preview {
  ContentView()
}

class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?

  init() {
  }

  func setup() async {
    do {
      whisperKit = try await WhisperKit()
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }

  func transcribe(audio: URL) async -> String {
    do {
      let result = try await whisperKit?.transcribe(audioPath: audio.absoluteString)
      return result?.first?.text ?? ""
    } catch {
      print("Error transcribing audio: \(error)")
      return ""
    }
  }
}

I create an instance of WhisperKit and transcribe the audio at a particular path. That’s a good start. Based on the documentation, this should show me how WhisperKit works.

Adding Test Audio

I want Cursor to add a random speech audio file called test to the project. I will add @ContentView.swift for context to the chat conversation:

@ContentView.swift Please add the test file to the project and ensure that the code I have written is correct and will work well according to the documentation of WhisperKit.

Claude gave me an improved version of the code but said I should add the file myself. I asked again, this time with @Web so that it could access a file:

Can you add the file from the @Web to the project yourself and to the xcodeproj?

The response was disappointing that it could not directly add files to the Xcode project or modify the file system. However, it gave me links to a clear man’s speech, an audio from Mozilla’s dataset, and a short TED talk sample!

I ran the command to download the LibriSpeech clear English speech example:

curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

I know that Xcode 16 uses the folder structure, so I can change the directory into the Whispering folder and run the command; it should automatically add it. So, in the command line that can be opened via Control + Shift + `````, I used the inline feature (Command + K) to generate the command:

Change directory to inside Whispering folder and run the
curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav command

It generated the following command:

cd Whispering && curl -o test.mp3 https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

I could have written cd Whispering && myself, but I did it for those unaware that you can use the inline feature directly to generate more complex commands, too. The command added the test.mp3 file inside Xcode, with the target membership of Whispering too!

Improved Code Suggestions

Coming back to the suggestions that I got in the Chat, it added a waveform image and a progress view when it transcribes the given audio file:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""
  @State private var isTranscribing = false

  var body: some View {
    VStack {
      Image(systemName: "waveform")
        .imageScale(.large)
        .foregroundStyle(.tint)
      
      if isTranscribing {
        ProgressView("Transcribing...")
      }
      
      Text(transcriptionResult)
        .padding()
    }
    .padding()
    .task {
      isTranscribing = true
      do {
        await transcriptionService.setup()
        
        if let path = Bundle.main.path(forResource: "test", ofType: "mp3") {
          let url = URL(fileURLWithPath: path)
          transcriptionResult = await transcriptionService.transcribe(audio: url)
        } else {
          transcriptionResult = "Error: Audio file not found"
        }
      } catch {
        transcriptionResult = "Error: \(error.localizedDescription)"
      }
      isTranscribing = false
    }
  }
}

class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?

  func setup() async {
    do {
      let config = WhisperKitConfig(model: "base")
      whisperKit = try await WhisperKit(config)
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }

  func transcribe(audio: URL) async -> String {
    do {
      let result = try await whisperKit?.transcribe(audioPath: audio.path)
      return result?.text ?? ""
    } catch {
      print("Error transcribing audio: \(error)")
      return "Error: \(error.localizedDescription)"
    }
  }
}

It also explicitly added the model “base” to fetch in the configuration before initializing the instance of WhisperKit. When I ran the project, I got the following error:

/Users/rudrankriyam/Downloads/Whispering/Whispering/ContentView.swift:71:22: error: value of type '[TranscriptionResult]' has no member 'text'
      return result?.text ?? ""
             ~~~~~~~ ^~~~

When you see the error in Terminal, you can directly add those particular lines in Chat using Shift + Command + L or the equivalent in VS Code. And then hit enter:

Ah, I see the error. The transcription result structure has changed in WhisperKit. Let's fix the code:

It gave me the solution to join the text from the transcription result:

if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
        // Join all segments into a single string
        return results.map { $0.text }.joined(separator: " ")
      }

I investigated why it went wrong and realized that the older method that just returns TranscriptionResult is deprecated, and the newer one returns an array of TranscriptionResult instead. The Chat only relied on the README.md information, so it followed the older method. But, in README.md, the example was in one line, where Swift implicitly took the type of result as TranscriptionResult, which works, but Claude did it in two lines. AI is not going to take over the world yet, haha.

And with that, we successfully built SweetPad inside Cursor/ VS Code! (covered in subsequent chapters)

** BUILD SUCCEEDED ** [11.178 sec]

First Build

After running the first build and waiting a while, I got the error message:

Error transcribing audio: Error Domain=com.apple.coreaudio.avfaudio Code=1685348671 "(null)" UserInfo={failed call=ExtAudioFileOpenURL((__bridge CFURLRef)fileURL, &_extAudioFile)}

The first instinct out of habit is to search it on Google (yeah, I am not used to Perplexity) but I want to use the debug with AI feature instead. I added those lines in Chat using Shift + Command + L or the equivalent in VS Code and then hit enter:

This error occurs because there's an issue with the audio file format or access. Let's fix this by ensuring proper audio file handling:

It added an error handling and better file validation to the class:

class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?
  
  func setup() async {
    do {
      let config = WhisperKitConfig(model: "base")
      whisperKit = try await WhisperKit(config)
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }
  
  func transcribe(audio: URL) async -> String {
    do {
      // Verify file exists and is readable
      guard FileManager.default.fileExists(atPath: audio.path) else {
        return "Error: Audio file not found at path"
      }
      
      // Print file information for debugging
      print("Attempting to transcribe file at: \(audio.path)")
      print("File exists: \(FileManager.default.fileExists(atPath: audio.path))")
      
      // Try to transcribe
      if let results = try await whisperKit?.transcribe(audioPath: audio.path) {
        return results.map { $0.text }.joined(separator: " ")
      }
      return "No transcription results"
    } catch {
      print("Error transcribing audio: \(error)")
      return "Error: \(error.localizedDescription)"
    }
  }
}

Then, it modified ContentView to ensure proper file handling:

struct ContentView: View {
  @StateObject private var transcriptionService = TranscriptionService()
  @State private var transcriptionResult = ""
  @State private var isTranscribing = false
  
  var body: some View {
    VStack {
      Image(systemName: "waveform")
        .imageScale(.large)
        .foregroundStyle(.tint)
      
      if isTranscribing {
        ProgressView("Transcribing...")
      }
      
      Text(transcriptionResult)
        .padding()
    }
    .padding()
    .task {
      isTranscribing = true
      
      // Get the file URL and verify it exists
      if let audioURL = Bundle.main.url(forResource: "test", withExtension: "mp3") {
        print("Found audio file at: \(audioURL.path)")
        
        // Verify file exists
        if FileManager.default.fileExists(atPath: audioURL.path) {
          transcriptionResult = await transcriptionService.transcribe(audio: audioURL)
        } else {
          transcriptionResult = "Error: Audio file not found in bundle"
        }
      } else {
        transcriptionResult = "Error: Could not create URL for audio file"
      }
      
      isTranscribing = false
    }
  }
}

Running the app again got me the following in the console:

Found audio file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
Attempting to transcribe file at: /private/var/containers/Bundle/Application/6CE5A9E7-22AF-4FC5-8154-E1B64B5027A8/Whispering.app/test.mp3
File exists: true

But I still got the audio related error. Asking it again, it recommended me to convert to wav format instead:

This error typically occurs with audio format issues. Let's try converting the audio to a WAV format with specific parameters that WhisperKit works well with. Here's what we'll do:

I ran the given command, and it automatically added the test.wav to the project:

ffmpeg -i test.mp3 -ar 16000 -ac 1 -c:a pcm_s16le test.wav

This created a WAV file with:

  • 16kHz sample rate
  • Mono channel
  • 16-bit PCM encoding

Then, it updated the ContentView as:

.task {
  isTranscribing = true
  
  await transcriptionService.setup()
  
  if let audioPath = Bundle.main.path(forResource: "test", ofType: "wav") {  // Changed to wav
    print("Audio path: \(audioPath)")
    transcriptionResult = await transcriptionService.transcribe(audioPath: audioPath)
  } else {
    transcriptionResult = "Error: Audio file not found"
  }
  
  isTranscribing = false
}

Running the project again gave me a successful transcription, on-device!

Improving with Speech Recognition

The next step in improving the app is to have an option to directly recognize the speech and then feed the buffer directly to WhisperKit. I do not see any mention of how to do it, but I know the sample project does it.

So, this time, we provide it with the context of the sample app instead:

https://uithub.com/argmaxinc/WhisperKit/tree/main/Examples/WhisperAX/WhisperAX?accept=text%2Fhtml&maxTokens=10000000&ext=swift

At the time of writing, it is roughly 22.6K tokens, which is good enough for a normal chat. I will create a new chat and provide it with the above context:

<instruction>
I want to extend the app to accept the user's speech when they click the start recording button. When they click the stop button, it should send the audio buffer to WhisperKit to transcribe and show the result in a view, a scrollable view for larger texts.

Follow Apple's Human Interface Guidelines to create a clean user interface with accessibility in mind.
</instruction>

It provided me with the given code:

// ... existing code ...

struct ContentView: View {
  // Add new state variables for recording
  @State private var audioRecorder: AVAudioRecorder?
  @State private var recordingSession: AVAudioSession?
  @State private var isRecording = false
  @State private var transcribedText = ""
  @State private var recordingURL: URL?
  
  // ... existing state variables ...

  var body: some View {
    NavigationSplitView(columnVisibility: $columnVisibility) {
      // ... existing sidebar code ...
    } detail: {
      VStack {
        // Transcription results view
        ScrollView {
          Text(transcribedText)
            .frame(maxWidth: .infinity, alignment: .leading)
            .padding()
            .accessibilityLabel("Transcription results")
        }
        
        // Recording controls
        HStack(spacing: 20) {
          Button(action: {
            if isRecording {
              stopRecording()
            } else {
              startRecording() 
            }
          }) {
            Image(systemName: isRecording ? "stop.circle.fill" : "mic.circle.fill")
              .resizable()
              .frame(width: 44, height: 44)
              .foregroundColor(isRecording ? .red : .blue)
          }
          .accessibilityLabel(isRecording ? "Stop recording" : "Start recording")
          .disabled(modelState != .loaded)
          
          if isRecording {
            // Show recording indicator
            Text("Recording...")
              .foregroundColor(.red)
          }
        }
        .padding()
      }
    }
    .onAppear {
      setupAudioSession()
    }
  }
  
  // Setup audio recording
  private func setupAudioSession() {
    recordingSession = AVAudioSession.sharedInstance()
    do {
      try recordingSession?.setCategory(.playAndRecord, mode: .default)
      try recordingSession?.setActive(true)
    } catch {
      print("Failed to set up recording session: \(error)")
    }
  }
  
  private func startRecording() {
    let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.wav")
    recordingURL = audioFilename
    
    let settings = [
      AVFormatIDKey: Int(kAudioFormatLinearPCM),
      AVSampleRateKey: 16000,
      AVNumberOfChannelsKey: 1,
      AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue
    ]
    
    do {
      audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
      audioRecorder?.record()
      isRecording = true
    } catch {
      print("Could not start recording: \(error)")
    }
  }
  
  private func stopRecording() {
    audioRecorder?.stop()
    isRecording = false
    
    // Transcribe the recorded audio
    if let url = recordingURL {
      Task {
        if let results = try await whisperKit?.transcribe(audioPath: url.path) {
          transcribedText = results.map { $0.text }.joined(separator: " ")
        }
      }
    }
  }
  
  private func getDocumentsDirectory() -> URL {
    FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
  }
}
Key changes made:
- Added state variables for managing audio recording and the transcribed text
- Added a ScrollView to display transcription results
- Added recording controls with start/stop functionality
- Implemented audio session setup and recording functions
- Added accessibility labels for better VoiceOver support
- Added error handling for audio setup and recording
- Integrated with WhisperKit for transcription when recording stops

The UI follows Apple's Human Interface Guidelines by:
- Using standard system icons
- Providing clear visual feedback for recording state
- Including accessibility labels
- Using appropriate spacing and padding
- Implementing a scrollable view for longer transcriptions
- Disabling recording when the model isn't loaded
- Using semantic colors (red for recording, blue for start)

The code handles the complete flow:
- Setting up audio session on app launch
- Starting/stopping recording with proper audio settings
- Saving recorded audio to a temporary file
- Transcribing the audio using WhisperKit
- Displaying results in a scrollable view

It also mentioned that to add the following keys to the Info.plist:

<key>NSMicrophoneUsageDescription</key>
<string>We need access to your microphone to record audio for transcription.</string>

Xcode no longer provides a file Info.plist that you can directly open as source code unless you add a key to it, so I manually added it under Info under the target Whispering.

The build succeeded in 1 attempt, so time to build it on the device! Claude took the “clean UI” part to its heart, and there is only one button to record. It records well and transcribes as expected!

Extending to macOS

When I added the target macOS, I remembered that AVAudioSession is unavailable on this platform:

/Users/rudrankriyam/Downloads/Whispering/Whispering/ContentView.swift:96:33: error: 'AVAudioSession' is unavailable in macOS
  private var recordingSession: AVAudioSession?

This gives us another excellent opportunity to utilize the Chat feature. I added the above lines in Chat using Shift + Command + L or the equivalent in VS Code and then hit enter.

I like how Claude is smart enough to add conditionals:

Ah, I see you're developing for macOS. Let's modify the code to handle audio recording for both macOS and iOS using conditional compilation:

The major changes are in the TranscriptionService:

class TranscriptionService: ObservableObject {
  private var whisperKit: WhisperKit?
  private var audioRecorder: AVAudioRecorder?
  #if os(iOS)
  private var recordingSession: AVAudioSession?
  #endif
  private var recordingURL: URL?
  
  init() {
    setupAudioSession()
  }
  
  private func setupAudioSession() {
    #if os(iOS)
    recordingSession = AVAudioSession.sharedInstance()
    do {
      try recordingSession?.setCategory(.playAndRecord, mode: .default)
      try recordingSession?.setActive(true)
    } catch {
      print("Failed to set up recording session: \(error)")
    }
    #endif
  }
  
  func setup() async {
    do {
      let config = WhisperKitConfig(model: "base")
      whisperKit = try await WhisperKit(config)
    } catch {
      print("Error setting up WhisperKit: \(error)")
    }
  }
  
  func startRecording() async {
    let audioFilename = getDocumentsDirectory().appendingPathComponent("recording.wav")
    recordingURL = audioFilename
    
    let settings: [String: Any] = [
      AVFormatIDKey: Int(kAudioFormatLinearPCM),
      AVSampleRateKey: 16000,
      AVNumberOfChannelsKey: 1,
      AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue,
      AVLinearPCMBitDepthKey: 16,
      AVLinearPCMIsFloatKey: false,
      AVLinearPCMIsBigEndianKey: false
    ]
    
    do {
      #if os(macOS)
      // Request microphone access on macOS
      if let accessGranted = try? await AVCaptureDevice.requestAccess(for: .audio) {
        guard accessGranted else {
          print("Microphone access denied")
          return
        }
      }
      #endif
      
      audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
      audioRecorder?.record()
    } catch {
      print("Could not start recording: \(error)")
    }
  }
  // rest remains same
}

And that was the only change required to have a macOS app as well!

Moving Forward

I was able to take this sample project further to create an on-device transcription menu bar app called Whispering:

GitHub - rudrankriyam/Whispering
Contribute to rudrankriyam/Whispering development by creating an account on GitHub.

Play with it, and let me know what you think!

Happy whispering!

String Catalog

String Catalog - App Localization on Autopilot

Push to GitHub, and we'll automatically localize your app for 40+ languages, saving you hours of manual work.

Tagged in: