Exploring Foundation Models: Bridging Gemini Video with CustomSegment

Apple added image input to the system language model in Xcode 27, but not video. Gemini takes video. So: can you keep using Apple's LanguageModelSession as your app-facing API while putting Gemini behind it?
Yes. The mechanism is Transcript.CustomSegment plus a custom LanguageModelExecutor.
I built the demo with Codex doing most of the research and Swift. The first answer we landed on was wrong, the neat sample I almost shipped hid four custom types, and a "this is the code" snippet turned out to be pseudocode. I kept those detours in, because you may hit them too.
One thing up front:
Apple's system model is not processing the video. Foundation Models runs the session and the transcript. Gemini does the video understanding, through an executor you write yourself.
TL;DR
SystemLanguageModelsupports image input in Xcode 27. It does not support video.- Gemini 3.5 Flash supports video.
Transcript.CustomSegmentcarries opaque, typed bytes through a transcript. It does not interpret them.- A custom
LanguageModelExecutoris the layer that gives those bytes meaning, by translating them into Geminiinline_data. - Apple's framework owns the session, prompt builder, transcript, executor lifecycle, and
response.content. Gemini owns the understanding. The on-device model is never invoked.
If you just want the code, here is the full Foundation Models Framework Lab PR #147. Everything below is how to get there.
The Question After WWDC 2026
Xcode 27 opens Foundation Models beyond the system model. Apple added image input and shipped the LanguageModel protocol for plugging other local or server models into LanguageModelSession. The docs call it using any large language model with the framework.
The system model got image attachments, but there is no video attachment API. Gemini takes video: per Google's video understanding docs, small videos go straight into generateContent as base64 inline_data next to a text prompt.
Firebase AI Logic already exposes that in Swift:
let model = FirebaseAI.firebaseAI(backend: .googleAI())
.generativeModel(modelName: "gemini-3.5-flash")
let video = InlineDataPart(
data: try Data(contentsOf: videoURL),
mimeType: "video/mp4"
)
let response = try await model.generateContent(
video,
"Describe this video."
)That works, but it is a Firebase request, not a Foundation Models one. I wanted LanguageModelSession as the app-facing API, with Gemini as the model underneath. Xcode 27 has the pieces to build that bridge.
How I Got There
I did not get this on the first try, and each wrong turn ended up being a useful check.
Separate "can the model do it" from "does the SDK expose it"
Two capabilities are easy to conflate:
- whether the Gemini model can understand video, and
- whether a Swift package or adapter exposes that capability.
The docs answered the first immediately: Gemini takes video, and Firebase AI Logic exposes InlineDataPart on Apple platforms. The second was murkier. Firebase's Foundation Models adapter documented images and PDFs, but not video or audio.
The tempting conclusion was "Gemini video works through Firebase's native API, but not through LanguageModelSession." That is a reasonable reading of the capability page, and it is also too final.
Read the source, not just the capability pages
The capability page documents what Firebase chose to expose, not what the underlying types can do. Reading the source changed the conclusion.
Firebase's InlineDataPart and FileDataPart already conform to Transcript.CustomSegment. The plumbing exists. But the preview GeminiLanguageModelExecutor rejects custom prompt segments with unsupportedTranscriptContent.
Apple's foundation-models-utilities package did not close the gap either. Its ChatCompletionsLanguageModel translates image attachments but rejects custom segments in the version inspected.
The source tells a different story:
- the architecture supports custom media
- Firebase already represents its media as custom segments
- the tested executor just stops before forwarding those prompt segments
So this is an adapter gap, not a limit of LanguageModelSession itself.
Design a test that can fail
Instead of building the wrapper and calling it a win the moment it returned text, I ran three routes with the same prompt and the same video:
| Route | The question it answers |
|---|---|
| Direct Gemini API | Does this key, model, and MP4 work with no Foundation Models in the path? |
| Firebase Foundation Models adapter | Does the stock adapter forward its own custom media segment? |
| Custom executor | Can the same video pass through LanguageModelSession when you translate it yourself? |
The direct request is the control: if I only built the wrapper and it failed, I would not know whether the model, key, encoding, networking, or transcript conversion was at fault. A small generated MP4 sent straight to Gemini 3.5 Flash returned HTTP 200, a correct description, and a VIDEO token count, ruling all of that out before any Foundation Models code ran.
A note on keys: keep the Gemini API key out of the repository, and rotate it if it ever shows up in plaintext anywhere, including a chat with an agent.
A green build is not runtime proof
Two environment issues here were easy to misread as product failures.
The first build picked up the installed Xcode 26 SDK, where the Xcode 27 LanguageModel and LanguageModelExecutor protocols do not exist. After switching to the Xcode 27 beta toolchain, the target compiled, but freshly built debug bundles were being launch-suspended by the macOS 27 beta before reaching main. A blank window looks like a broken feature when it is really an OS-beta launch issue.
So I treated "it builds" and "it runs" as separate facts, and compiled the executor source into a small standalone macOS harness that ran the same transcript conversion and generation channel without the flaky GUI launch:
| Route | Result |
|---|---|
| Direct Gemini API | Success, described the clip correctly |
| Firebase adapter | Rejected the custom prompt segment with unsupportedTranscriptContent |
| Custom executor | Success, through LanguageModelSession |
The harness was not the product, just proof the bridge worked while the beta OS broke the normal launch path. The signed app ran the UI fine.
Distinguish "received the video" from "understood the video"
A trivial clip proves transport, not comprehension. Swapping in a deliberately hard 12-second benchmark, with objects crossing in both directions, an occlusion, a rotating shape that changes color, a moving blackout bar, OCR labels, and a final reorder, exposed real model mistakes: a horizontal sweep read as vertical motion, a reversed rotation direction, and a disagreement about layering during the overlap.
Those are not transport failures. A transport failure looks like an unsupported-input error, an HTTP failure, an empty response, or a description of nothing in the clip. The model received and analyzed a hard timeline and got spatial details wrong. That is a different problem, and one an easy clip would have hidden.
Drop dependencies you no longer need
The first wrapper reused Firebase's InlineDataPart, so Firebase stayed in the package graph even after its stock adapter left the screen. By the end it was there for one media type and a couple of type checks, not worth a whole dependency tree for a single endpoint.
Reading the Xcode 27 protocol interface directly, I replaced InlineDataPart with a project-defined VideoSegment, fixed the executor downcast, removed Firebase from the graph, and reran the live request. Still worked.
Let review find the boundary cases
Opening the pull request surfaced four real issues the happy path never exercised:
- imported movie data could remain memory-mapped after the security-scoped file access ended
AVPlayercould open a selected URL without holding its security scope- whitespace around an environment API key needed trimming at the HTTP boundary
- every selected movie was labeled
video/mp4instead of deriving its MIME type fromUTType
Each was fixed, rebuilt, linted, and pushed, with MIME probes covering MP4, MOV, M4V, and WebM and a secret scan confirming the key was not committed.
Trace it until you can explain it
After the demo worked, the call site still hid several custom types I could not yet explain. The last step was not new functionality: I had the agent name every type, open the real file, and trace one request from LanguageModelSession to Gemini and back. The next sections are that walkthrough.
The Call Site
The final usage is compact:
let video = VideoSegment(
data: try Data(contentsOf: videoURL),
mimeType: "video/mp4"
)
let model = GeminiDeveloperVideoLanguageModel(
apiKey: apiKey,
modelName: "gemini-3.5-flash"
)
let session = LanguageModelSession(model: model)
let response = try await session.respond {
video
"Describe this video."
}
print(response.content)That is the whole call site. Behind it are four types you write yourself:
VideoSegmentGeminiDeveloperVideoLanguageModelGeminiDeveloperVideoLanguageModelExecutorGeminiDeveloperAPIClient
None ship with Apple, Google, or Firebase. The snippet just calls them; the four types are where the work is.
Code samples in posts also get simplified, which can mislead. A trimmed version of the executor branch might look like:
case let .custom(segment):
guard let video = segment as? VideoSegment else {
throw UnsupportedSegmentError()
}
parts.append(.inlineData(
data: video.content.data,
mimeType: video.content.mimeType
))That communicates the idea, but it is pseudocode. There is no UnsupportedSegmentError type in the project; the real implementation uses unsupportedTranscriptError(_:detail:), checks whether custom media is allowed for that transcript entry, and returns a framework LanguageModelError.unsupportedTranscriptContent. The actual branch is shown later.
What CustomSegment Actually Does
Foundation Models stores a conversation as a Transcript. Each prompt or response contains ordered segments: text, structured content, attachments, or custom content. The Transcript segment documentation defines the custom case as:
case custom(any Transcript.CustomSegment)A custom segment needs an identifier and a Content value that is Codable, Equatable, and Sendable. The video segment stores the bytes and MIME type:
import Foundation
import FoundationModels
struct VideoSegment: Transcript.CustomSegment {
struct Content: Codable, Equatable, Sendable {
let data: Data
let mimeType: String
}
let id: String
let content: Content
init(
id: String = UUID().uuidString,
data: Data,
mimeType: String
) {
self.id = id
content = Content(data: data, mimeType: mimeType)
}
}CustomSegment does not decode frames, understand motion, or add video support to the system model. It gives the transcript a typed container for content the framework does not define itself.
The protocol also receives a default PromptRepresentable conformance, which is why video can appear directly inside the respond result builder. Foundation Models turns it into a .custom(video) segment next to the text segment. At this point the bytes are only being carried through the transcript. Something still has to interpret them.
Defining the Custom Language Model
The LanguageModel conformance is intentionally small:
struct GeminiDeveloperVideoLanguageModel: LanguageModel {
typealias Executor = GeminiDeveloperVideoLanguageModelExecutor
let capabilities = LanguageModelCapabilities(capabilities: [])
let executorConfiguration: Executor.Configuration
init(apiKey: String, modelName: String) {
executorConfiguration = Executor.Configuration(
apiKey: apiKey,
modelName: modelName
)
}
}The important line is the Executor type alias. It pairs this model description with the type that performs inference, and it supplies the configuration Foundation Models uses when creating that executor.
The capabilities array is empty because this experiment does not implement guided generation, tool calling, reasoning, or Apple's image attachment path. Video is carried as custom transcript content rather than declared as a built-in capability.
Apple describes LanguageModelExecutor as the bridge between framework types and the system that generates tokens. That is exactly what it is here: an adapter from Foundation Models transcripts to the Gemini REST API.
The Executor Is the Real Bridge
The executor owns the Gemini client and implements the method Foundation Models calls for each generation request. This is the real implementation, with only the platform availability attributes removed:
struct GeminiDeveloperVideoLanguageModelExecutor: LanguageModelExecutor {
typealias Model = GeminiDeveloperVideoLanguageModel
struct Configuration: Hashable, Sendable {
let apiKey: String
let modelName: String
}
private let configuration: Configuration
private let client: GeminiDeveloperAPIClient
init(configuration: Configuration) throws {
self.configuration = configuration
client = GeminiDeveloperAPIClient(
apiKey: configuration.apiKey,
modelName: configuration.modelName
)
}
}
extension GeminiDeveloperVideoLanguageModelExecutor {
func respond(
to request: LanguageModelExecutorGenerationRequest,
model: GeminiDeveloperVideoLanguageModel,
streamingInto channel: LanguageModelExecutorGenerationChannel
) async throws {
try validate(request)
let convertedTranscript = try convert(request.transcript)
let response = try await client.generateContent(
contents: convertedTranscript.contents,
systemInstruction: convertedTranscript.systemInstruction
)
guard !response.text.isEmpty else {
throw GeminiDeveloperAPIError.noTextResponse
}
await send(response, for: request.id, into: channel)
}
}There are four steps inside that method:
- Reject the Foundation Models features this experiment does not support.
- Convert the session transcript into Gemini roles and parts.
- Send the HTTP request to Gemini.
- Feed Gemini's response back into the Foundation Models generation channel.
The full path looks like this:
LanguageModelSession
-> Transcript.Prompt
-> .custom(VideoSegment)
-> .text("Describe this video.")
-> GeminiDeveloperVideoLanguageModelExecutor
-> Gemini generateContent REST API
-> LanguageModelExecutorGenerationChannel
-> response.contentTranslating the Transcript
The executor receives the whole Transcript, not just the latest string. It walks each entry and maps Foundation Models roles to Gemini roles:
- instructions become Gemini's
system_instruction - prompts become
usercontent - previous responses become
modelcontent - tool calls and tool output are rejected, because this experiment does not implement them
Inside each entry, text and custom segments become Gemini parts. This is the exact branch that recognizes the video:
case let .custom(customSegment):
guard allowsCustomMedia else {
throw unsupportedTranscriptError(
entry,
detail: "Gemini system instructions support text only."
)
}
if let video = customSegment as? VideoSegment {
parts.append(.inlineData(
data: video.content.data,
mimeType: video.content.mimeType
))
} else {
throw unsupportedTranscriptError(
entry,
detail: "Only VideoSegment custom segments are supported."
)
}This downcast is where the custom segment gains meaning. Foundation Models only knows it is custom content. The executor is the layer that knows this particular custom content should become Gemini inline data.
Sending the Video to Gemini
GeminiDeveloperAPIClient is another project type, not a Google SDK type. It models only the slice of the Gemini API this experiment needs.
The bytes are base64-encoded into an inline_data part:
struct Part: Codable, Sendable {
let text: String?
let inlineData: InlineData?
static func inlineData(data: Data, mimeType: String) -> Part {
Part(
text: nil,
inlineData: .init(
mimeType: mimeType,
data: data.base64EncodedString()
)
)
}
enum CodingKeys: String, CodingKey {
case text
case inlineData = "inline_data"
}
}The resulting request body is equivalent to this JSON:
{
"contents": [
{
"role": "user",
"parts": [
{
"inline_data": {
"mime_type": "video/mp4",
"data": "<base64 video bytes>"
}
},
{
"text": "Describe this video."
}
]
}
]
}The client sends that body with URLSession. This is the real request construction from the project:
let requestAPIKey = apiKey.trimmingCharacters(in: .whitespacesAndNewlines)
guard !requestAPIKey.isEmpty else {
throw GeminiDeveloperAPIError.apiKeyMissing
}
guard let encodedModelName = modelName.addingPercentEncoding(
withAllowedCharacters: .urlPathAllowed
),
let url = URL(
string: "https://generativelanguage.googleapis.com/v1beta/models/\(encodedModelName):generateContent"
) else {
throw GeminiDeveloperAPIError.invalidModelName(modelName)
}
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
request.setValue(requestAPIKey, forHTTPHeaderField: "x-goog-api-key")
request.httpBody = try JSONEncoder().encode(
RequestBody(
contents: contents,
systemInstruction: systemInstruction
)
)
let (data, response) = try await urlSession.data(for: request)No Firebase in the final demo. Firebase would work too, and adds production features like App Check and per-user rate limits, but the experiment needed one endpoint, and writing the translation by hand keeps the Foundation Models boundary visible.
Returning Through LanguageModelSession
After decoding Gemini's response, the executor sends generation events through the channel Foundation Models supplied, rather than returning a String:
await channel.send(
.response(
entryID: requestID.uuidString,
action: .appendText(
response.text,
tokenCount: response.usageMetadata?.candidatesTokenCount ?? 0
)
)
)The implementation also sends provider metadata, the resolved model version, and Gemini's input and output token usage. LanguageModelSession consumes those events, updates its transcript, and produces the familiar response.content at the call site.
This is what makes it genuinely use Foundation Models instead of just naming a network call LanguageModelSession. The framework owns the session, prompt construction, transcript, executor lifecycle, and response assembly; the executor owns the provider translation and the request.
Who Does What
The boundary in one table:
| Layer | Responsibility |
|---|---|
LanguageModelSession | Prompt building, transcript, request lifecycle, response assembly |
VideoSegment | Carries video bytes and MIME type as opaque custom content |
GeminiDeveloperVideoLanguageModel | Describes the model and supplies executor configuration |
GeminiDeveloperVideoLanguageModelExecutor | Converts transcript entries and segments into Gemini request types |
GeminiDeveloperAPIClient | Performs the HTTP request and decodes the response |
| Gemini 3.5 Flash | Interprets the video and generates the answer |
SystemLanguageModel | Nothing. Apple's on-device model is not invoked. |
This is video input at the framework and session layer. It is not native video input for Apple's system model.
What the Demo Does Not Handle
This is a focused experiment, not a full Gemini provider. What it leaves out:
- Inline size limits. Google recommends inline video only for total requests under 20MB. Larger or reusable videos should use the Files API.
- Key security. A production app should not ship an unrestricted Gemini key. Put the request behind a backend, or use something like Firebase AI Logic with App Check.
- Real streaming. The client calls
generateContentand waits for the complete response before appending it. It does not consume Gemini's streaming endpoint. - Repeated video data. The executor converts the whole transcript on every request, so a multi-turn session may resend an earlier inline video, which is expensive.
- Framework features. Guided generation and tool calling are explicitly rejected. The model declares no built-in capabilities.
- Beta APIs.
LanguageModel,LanguageModelExecutor, andCustomSegmentare Xcode 27 beta and may change.
The demo proved the transport and executor architecture; a production provider would also need auth, uploads, streaming, capability mapping, retries, caching, and tests.
Verifying Agent-Written Code
Most of this was built with an agent moving quickly across documentation, source, Swift interfaces, project configuration, HTTP, SwiftUI, a standalone harness, package resolution, and the pull request. The code was working and reviewed before I actually understood it.
The most useful question I asked was "where does this code live?" Asking an agent to explain code tends to produce another clean abstraction. Asking where it lives forces a concrete answer: a file, a symbol, and a line you can open.
Before calling something a small example, I want to be able to answer:
- Which identifiers come from the SDK, and which did I define?
- What control test proves the underlying provider capability?
- Which file and method cross the framework-provider boundary?
- What did the stock path fail with?
- What part of the result came from Apple, from my adapter, and from Gemini?
- Is the snippet complete, call-site-only, or intentionally pseudocode?
- Can I explain the request and response path without asking the agent again?
The Short Version
VideoSegmentconforms toTranscript.CustomSegment. A customLanguageModelExecutorreads it from the session transcript, converts it to Geminiinline_data, callsgenerateContent, and sends Gemini's text back throughLanguageModelExecutorGenerationChannel. Foundation Models manages the session; Gemini, not Apple's system model, handles the video.
Every part of that maps to a concrete type and method in the implementation. The full code is in Foundation Models Framework Lab PR #147; read GeminiDeveloperVideoLanguageModel.swift for the exact executor, transcript conversion, REST client, errors, metadata, and usage reporting.
What's Next
In Xcode 27, LanguageModelSession is becoming an interface over different providers. CustomSegment lets a provider carry content the framework does not define, and LanguageModelExecutor gives it a place to interpret that content. Together they let an app keep one session API while picking a model that can actually do the task. That is what made routing video to Gemini work here.
Post Topics
Explore more in these categories: