Exploring OpenAI: Building Audio Recorder for Transcription Using Whipser

I am working on a suite of features for my personal app, and one of them is creating an audio recorder and transcription for my daily walks.

I quickly wanted to prototype something, and ended up using Groq and AIProxy for it. As I wanted English only and the cheapest one, I tried OpenAI's distil-whisper-large-v3-en model.

In this post, I build the audio recording feature in SwiftUI using AVFoundation. I walk you through:

Setting up the recording functionality,
Managing audio files, and
Using Groq’s endpoint via AIProxy for converting speech to text.

My goal is quickly drafting blog posts, but this helps you for note-taking app, a podcast tool, or anything that involves audio!

Setting Up the Audio Recorder

The first step is creating an AudioRecorderManager class. This manages everything—starting and stopping the recording, keeping track of the duration, and handling transcription errors. It is my first time using AVAudioRecorder from AVFoundation for this purpose, and I learned how to process the input. I tried.

Here is how I have set up AudioRecorderManager:

import SwiftUI
import AVFoundation
import AIProxy

class AudioRecorderManager: NSObject, ObservableObject, AVAudioRecorderDelegate {
    enum State: Equatable {
        case idle
        case recording(duration: TimeInterval)
        case transcribing
        case error(String)
    }

    @Published private(set) var state: State = .idle
    @Published private(set) var transcribedText = ""

    private var audioRecorder: AVAudioRecorder?
    private var recordingTimer: Timer?

    // Constants based on Groq limitations
    private let maxFileSize: Int64 = 25 * 1024 * 1024 // 25 MB
    private let minimumRecordingDuration: TimeInterval = 0.01
    let minimumBilledDuration: TimeInterval = 10.0
    
    // Not my actual key
    private let groqService = AIProxy.groqService(
        partialKey: "v2|4d4a3384|U3leIUk",
        serviceURL: "https://api.aiproxy.pro/5a2402d8"
    )    
}

Groq has specific file limitations you must consider when designing the feature:

Max File Size: 25 MB per file
Minimum File Length: 0.01 seconds.
Minimum Billed Length: 10 seconds (even if your file is shorter, you will be charged for 10 seconds).
Supported File Types: mp3, mp4, mpeg, mpga, m4a, wav, webm.
Single Audio Track Only: If your file has multiple audio tracks, only the first track is processed.
Supported Response Formats: json, verbose_json, text.

To meet these requirements, we set a max file size of 25 MB and downsample the audio to 16,000 Hz mono, as per Groq’s preprocessing guidelines.

To start recording, I usually mess up AVAudioSession. Remember to set the category to the correct one and activate the session for recording and playback.

func startRecording() {
    guard case .idle = state else { return }

    let audioSession = AVAudioSession.sharedInstance()

    do {
        try audioSession.setCategory(.playAndRecord, mode: .default)
        try audioSession.setActive(true)

        let documentsPath = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
        let audioFilename = documentsPath.appendingPathComponent("recording.m4a")

        // Settings optimized for Groq transcription
        let settings: [String: Any] = [
            AVFormatIDKey: Int(kAudioFormatMPEG4AAC),
            AVSampleRateKey: 16000, // 16 kHz as per Groq's preprocessing
            AVNumberOfChannelsKey: 1, // Mono
            AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue,
            AVEncoderBitRateKey: 32000 // Lower bitrate to help manage file size
        ]

        audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
        audioRecorder?.delegate = self
        audioRecorder?.record()

        startTimer()
        state = .recording(duration: 0)
    } catch {
        state = .error("Could not start recording: \(error.localizedDescription)")
    }
}

Then, I use a timer to track the recording duration and check the file size dynamically. If the file exceeds 25 MB, the recording stops automatically to prevent exceeding Groq’s limit.

private func startTimer() {
    recordingTimer?.invalidate()
    recordingTimer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) { [weak self] _ in
        guard let self = self else { return }
        if case .recording(let duration) = self.state {
            let newDuration = duration + 0.1
            self.state = .recording(duration: newDuration)

            // Check file size
            if let fileSize = try? FileManager.default.attributesOfItem(atPath: self.audioRecorder?.url.path ?? "")[.size] as? Int64,
               fileSize >= self.maxFileSize {
                self.stopRecording()
                self.state = .error("Recording stopped: File size limit reached (25 MB)")
            }
        }
    }
}

Transcribing Audio

Once I have the recorded audio, I can transcribe it using Groq’s service via AIProxy:

func transcribeAudio() async {
    guard case .idle = state, let audioFileURL = audioRecorder?.url else {
        state = .error("No audio file found")
        return
    }

    await MainActor.run {
        state = .transcribing
    }

    do {
        let audioData = try Data(contentsOf: audioFileURL)
        let fileSize = audioData.count

        guard fileSize <= maxFileSize else {
            throw NSError(domain: "AudioRecorder", code: 1, userInfo: [NSLocalizedDescriptionKey: "File size exceeds 25 MB limit"])
        }

        let requestBody = GroqTranscriptionRequestBody(
            file: audioData,
            model: "distil-whisper-large-v3-en",
            responseFormat: "json"
        )

        let response = try await groqService.createTranscriptionRequest(body: requestBody)

        await MainActor.run {
            self.transcribedText = response.text ?? "No transcription available"
            self.state = .idle
        }
    } catch {
        await MainActor.run {
            self.state = .error("Transcription error: \(error.localizedDescription)")
        }
    }
}

Building the UI

Now that the recording and transcription is set up, let's create a simple UI using SwiftUI. For this feature, I will set up a basic view that shows the recording duration, manages start/stop functionality, and displays transcription results or errors.

I will start with a ContentView that uses AudioRecorderManager as its state object. This way, I can bind the state of the recording process directly to the UI components:

struct ContentView: View {
    @StateObject private var audioManager = AudioRecorderManager()

    var body: some View {
        VStack(spacing: 20) {
            // Display the recording duration
            Text(timeString(from: audioManager.recordingDuration))
                .font(.largeTitle)
                .monospacedDigit()

            // Start/Stop Button
            Button(action: {
                if case .recording = audioManager.state {
                    audioManager.stopRecording()
                    Task {
                        await audioManager.transcribeAudio()
                    }
                } else {
                    audioManager.startRecording()
                }
            }) {
                Text(audioManager.state != .idle ? "Stop Recording" : "Start Recording")
                    .padding()
                    .background(audioManager.state == .recording(duration: 0) ? Color.red : Color.blue)
                    .foregroundColor(.white)
                    .cornerRadius(10)
            }

            // Display progress or error messages
            if case .transcribing = audioManager.state {
                ProgressView("Transcribing...")
            }

            if case .error(let errorMessage) = audioManager.state {
                Text(errorMessage)
                    .foregroundColor(.red)
                    .padding()
            }

            // Show the transcription text if available
            if !audioManager.transcribedText.isEmpty {
                Text("Transcription:")
                    .font(.headline)
                Text(audioManager.transcribedText)
                    .padding()
                    .background(Color.gray.opacity(0.2))
                    .cornerRadius(10)
            }

            // Note about minimum billing duration
            if audioManager.recordingDuration < audioManager.minimumBilledDuration && audioManager.state != .recording(duration: 0) {
                Text("Note: Minimum billed duration is 10 seconds")
                    .font(.caption)
                    .foregroundColor(.gray)
            }
        }
        .padding()
    }

    private func timeString(from timeInterval: TimeInterval) -> String {
        let minutes = Int(timeInterval) / 60
        let seconds = Int(timeInterval) % 60
        let tenths = Int((timeInterval * 10).truncatingRemainder(dividingBy: 10))
        return String(format: "%02d:%02d.%d", minutes, seconds, tenths)
    }
}

This setup gives a basic but functional UI that ties everything together. Nothing fancy. (yet)

0:00

/1:07

Moving Forward

You can build on this foundation, adding features like a list of past recordings, more advanced transcription options, or even integrating different models depending on your needs!

Enjoy transcribing!

Tagged in:

OpenAI

Exploring OpenAI: Building Audio Recorder for Transcription Using Whipser

Astro (Affiliate)

Setting Up the Audio Recorder

Transcribing Audio

Building the UI

Moving Forward

Astro (Affiliate)

Rudrank Riyam

Other Stories

Exploring Cursor: Autocompletion with Tab

Exploring OpenAI: Building Audio Recorder for Transcription Using Whipser

Exploring Cursor: Autocompletion with Tab

Exploring Cursor: Accessing External Documentation using @Doc

Press ESC to close

Or check our Popular Categories...

Subscribe to Rudrank's Dispatch!

Setting Up the Audio Recorder

Transcribing Audio

Building the UI

Moving Forward

Share Article:

Related Articles

Other Stories

Exploring Cursor: Autocompletion with Tab