I am working on a suite of features for my personal app, and one of them is creating an audio recorder and transcription for my daily walks.
I quickly wanted to prototype something, and ended up using Groq and AIProxy for it. As I wanted English only and the cheapest one, I tried OpenAI's distil-whisper-large-v3-en
model.
In this post, I build the audio recording feature in SwiftUI using AVFoundation. I walk you through:
- Setting up the recording functionality,
- Managing audio files, and
- Using Groq’s endpoint via AIProxy for converting speech to text.
My goal is quickly drafting blog posts, but this helps you for note-taking app, a podcast tool, or anything that involves audio!
Setting Up the Audio Recorder
The first step is creating an AudioRecorderManager
class. This manages everything—starting and stopping the recording, keeping track of the duration, and handling transcription errors. It is my first time using AVAudioRecorder
from AVFoundation for this purpose, and I learned how to process the input. I tried.
Here is how I have set up AudioRecorderManager
:
import SwiftUI
import AVFoundation
import AIProxy
class AudioRecorderManager: NSObject, ObservableObject, AVAudioRecorderDelegate {
enum State: Equatable {
case idle
case recording(duration: TimeInterval)
case transcribing
case error(String)
}
@Published private(set) var state: State = .idle
@Published private(set) var transcribedText = ""
private var audioRecorder: AVAudioRecorder?
private var recordingTimer: Timer?
// Constants based on Groq limitations
private let maxFileSize: Int64 = 25 * 1024 * 1024 // 25 MB
private let minimumRecordingDuration: TimeInterval = 0.01
let minimumBilledDuration: TimeInterval = 10.0
// Not my actual key
private let groqService = AIProxy.groqService(
partialKey: "v2|4d4a3384|U3leIUk",
serviceURL: "https://api.aiproxy.pro/5a2402d8"
)
}
Groq has specific file limitations you must consider when designing the feature:
- Max File Size: 25 MB per file
- Minimum File Length: 0.01 seconds.
- Minimum Billed Length: 10 seconds (even if your file is shorter, you will be charged for 10 seconds).
- Supported File Types: mp3, mp4, mpeg, mpga, m4a, wav, webm.
- Single Audio Track Only: If your file has multiple audio tracks, only the first track is processed.
- Supported Response Formats: json, verbose_json, text.
To meet these requirements, we set a max file size of 25 MB and downsample the audio to 16,000 Hz mono, as per Groq’s preprocessing guidelines.
To start recording, I usually mess up AVAudioSession
. Remember to set the category to the correct one and activate the session for recording and playback.
func startRecording() {
guard case .idle = state else { return }
let audioSession = AVAudioSession.sharedInstance()
do {
try audioSession.setCategory(.playAndRecord, mode: .default)
try audioSession.setActive(true)
let documentsPath = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask)[0]
let audioFilename = documentsPath.appendingPathComponent("recording.m4a")
// Settings optimized for Groq transcription
let settings: [String: Any] = [
AVFormatIDKey: Int(kAudioFormatMPEG4AAC),
AVSampleRateKey: 16000, // 16 kHz as per Groq's preprocessing
AVNumberOfChannelsKey: 1, // Mono
AVEncoderAudioQualityKey: AVAudioQuality.high.rawValue,
AVEncoderBitRateKey: 32000 // Lower bitrate to help manage file size
]
audioRecorder = try AVAudioRecorder(url: audioFilename, settings: settings)
audioRecorder?.delegate = self
audioRecorder?.record()
startTimer()
state = .recording(duration: 0)
} catch {
state = .error("Could not start recording: \(error.localizedDescription)")
}
}
Then, I use a timer to track the recording duration and check the file size dynamically. If the file exceeds 25 MB, the recording stops automatically to prevent exceeding Groq’s limit.
private func startTimer() {
recordingTimer?.invalidate()
recordingTimer = Timer.scheduledTimer(withTimeInterval: 0.1, repeats: true) { [weak self] _ in
guard let self = self else { return }
if case .recording(let duration) = self.state {
let newDuration = duration + 0.1
self.state = .recording(duration: newDuration)
// Check file size
if let fileSize = try? FileManager.default.attributesOfItem(atPath: self.audioRecorder?.url.path ?? "")[.size] as? Int64,
fileSize >= self.maxFileSize {
self.stopRecording()
self.state = .error("Recording stopped: File size limit reached (25 MB)")
}
}
}
}
Transcribing Audio
Once I have the recorded audio, I can transcribe it using Groq’s service via AIProxy:
func transcribeAudio() async {
guard case .idle = state, let audioFileURL = audioRecorder?.url else {
state = .error("No audio file found")
return
}
await MainActor.run {
state = .transcribing
}
do {
let audioData = try Data(contentsOf: audioFileURL)
let fileSize = audioData.count
guard fileSize <= maxFileSize else {
throw NSError(domain: "AudioRecorder", code: 1, userInfo: [NSLocalizedDescriptionKey: "File size exceeds 25 MB limit"])
}
let requestBody = GroqTranscriptionRequestBody(
file: audioData,
model: "distil-whisper-large-v3-en",
responseFormat: "json"
)
let response = try await groqService.createTranscriptionRequest(body: requestBody)
await MainActor.run {
self.transcribedText = response.text ?? "No transcription available"
self.state = .idle
}
} catch {
await MainActor.run {
self.state = .error("Transcription error: \(error.localizedDescription)")
}
}
}
Building the UI
Now that the recording and transcription is set up, let's create a simple UI using SwiftUI. For this feature, I will set up a basic view that shows the recording duration, manages start/stop functionality, and displays transcription results or errors.
I will start with a ContentView
that uses AudioRecorderManager
as its state object. This way, I can bind the state of the recording process directly to the UI components:
struct ContentView: View {
@StateObject private var audioManager = AudioRecorderManager()
var body: some View {
VStack(spacing: 20) {
// Display the recording duration
Text(timeString(from: audioManager.recordingDuration))
.font(.largeTitle)
.monospacedDigit()
// Start/Stop Button
Button(action: {
if case .recording = audioManager.state {
audioManager.stopRecording()
Task {
await audioManager.transcribeAudio()
}
} else {
audioManager.startRecording()
}
}) {
Text(audioManager.state != .idle ? "Stop Recording" : "Start Recording")
.padding()
.background(audioManager.state == .recording(duration: 0) ? Color.red : Color.blue)
.foregroundColor(.white)
.cornerRadius(10)
}
// Display progress or error messages
if case .transcribing = audioManager.state {
ProgressView("Transcribing...")
}
if case .error(let errorMessage) = audioManager.state {
Text(errorMessage)
.foregroundColor(.red)
.padding()
}
// Show the transcription text if available
if !audioManager.transcribedText.isEmpty {
Text("Transcription:")
.font(.headline)
Text(audioManager.transcribedText)
.padding()
.background(Color.gray.opacity(0.2))
.cornerRadius(10)
}
// Note about minimum billing duration
if audioManager.recordingDuration < audioManager.minimumBilledDuration && audioManager.state != .recording(duration: 0) {
Text("Note: Minimum billed duration is 10 seconds")
.font(.caption)
.foregroundColor(.gray)
}
}
.padding()
}
private func timeString(from timeInterval: TimeInterval) -> String {
let minutes = Int(timeInterval) / 60
let seconds = Int(timeInterval) % 60
let tenths = Int((timeInterval * 10).truncatingRemainder(dividingBy: 10))
return String(format: "%02d:%02d.%d", minutes, seconds, tenths)
}
}
This setup gives a basic but functional UI that ties everything together. Nothing fancy. (yet)
Moving Forward
You can build on this foundation, adding features like a list of past recordings, more advanced transcription options, or even integrating different models depending on your needs!
Enjoy transcribing!