Exploring Foundation Models: Bridging Gemini Video with CustomSegment

A developer routes a desert sunset video through transcript layers to a cloud model and receives a structured response — Foundation Models carries the transcript, a custom executor translates the video segment, and Gemini does the actual understanding.

Apple added image input to the system language model in Xcode 27, but not video. Gemini takes video. So: can you keep using Apple's LanguageModelSession as your app-facing API while putting Gemini behind it?

Yes. The mechanism is Transcript.CustomSegment plus a custom LanguageModelExecutor.

I built the demo with Codex doing most of the research and Swift. The first answer we landed on was wrong, the neat sample I almost shipped hid four custom types, and a "this is the code" snippet turned out to be pseudocode. I kept those detours in, because you may hit them too.

One thing up front:

Apple's system model is not processing the video. Foundation Models runs the session and the transcript. Gemini does the video understanding, through an executor you write yourself.

TL;DR

SystemLanguageModel supports image input in Xcode 27. It does not support video.
Gemini 3.5 Flash supports video.
Transcript.CustomSegment carries opaque, typed bytes through a transcript. It does not interpret them.
A custom LanguageModelExecutor is the layer that gives those bytes meaning, by translating them into Gemini inline_data.
Apple's framework owns the session, prompt builder, transcript, executor lifecycle, and response.content. Gemini owns the understanding. The on-device model is never invoked.

If you just want the code, here is the full Foundation Models Framework Lab PR #147. Everything below is how to get there.

The Question After WWDC 2026

Xcode 27 opens Foundation Models beyond the system model. Apple added image input and shipped the LanguageModel protocol for plugging other local or server models into LanguageModelSession. The docs call it using any large language model with the framework.

The system model got image attachments, but there is no video attachment API. Gemini takes video: per Google's video understanding docs, small videos go straight into generateContent as base64 inline_data next to a text prompt.

Firebase AI Logic already exposes that in Swift:

let model = FirebaseAI.firebaseAI(backend: .googleAI())
    .generativeModel(modelName: "gemini-3.5-flash")
 
let video = InlineDataPart(
    data: try Data(contentsOf: videoURL),
    mimeType: "video/mp4"
)
 
let response = try await model.generateContent(
    video,
    "Describe this video."
)

That works, but it is a Firebase request, not a Foundation Models one. I wanted LanguageModelSession as the app-facing API, with Gemini as the model underneath. Xcode 27 has the pieces to build that bridge.

How I Got There

I did not get this on the first try, and each wrong turn ended up being a useful check.

Separate "can the model do it" from "does the SDK expose it"

Two capabilities are easy to conflate:

whether the Gemini model can understand video, and
whether a Swift package or adapter exposes that capability.

The docs answered the first immediately: Gemini takes video, and Firebase AI Logic exposes InlineDataPart on Apple platforms. The second was murkier. Firebase's Foundation Models adapter documented images and PDFs, but not video or audio.

The tempting conclusion was "Gemini video works through Firebase's native API, but not through LanguageModelSession." That is a reasonable reading of the capability page, and it is also too final.

Read the source, not just the capability pages

The capability page documents what Firebase chose to expose, not what the underlying types can do. Reading the source changed the conclusion.

Firebase's InlineDataPart and FileDataPart already conform to Transcript.CustomSegment. The plumbing exists. But the preview GeminiLanguageModelExecutor rejects custom prompt segments with unsupportedTranscriptContent.

Apple's foundation-models-utilities package did not close the gap either. Its ChatCompletionsLanguageModel translates image attachments but rejects custom segments in the version inspected.

The source tells a different story:

the architecture supports custom media
Firebase already represents its media as custom segments
the tested executor just stops before forwarding those prompt segments

So this is an adapter gap, not a limit of LanguageModelSession itself.

Design a test that can fail

Instead of building the wrapper and calling it a win the moment it returned text, I ran three routes with the same prompt and the same video:

Route	The question it answers
Direct Gemini API	Does this key, model, and MP4 work with no Foundation Models in the path?
Firebase Foundation Models adapter	Does the stock adapter forward its own custom media segment?
Custom executor	Can the same video pass through `LanguageModelSession` when you translate it yourself?

The direct request is the control: if I only built the wrapper and it failed, I would not know whether the model, key, encoding, networking, or transcript conversion was at fault. A small generated MP4 sent straight to Gemini 3.5 Flash returned HTTP 200, a correct description, and a VIDEO token count, ruling all of that out before any Foundation Models code ran.

A note on keys: keep the Gemini API key out of the repository, and rotate it if it ever shows up in plaintext anywhere, including a chat with an agent.

A green build is not runtime proof

Two environment issues here were easy to misread as product failures.

The first build picked up the installed Xcode 26 SDK, where the Xcode 27 LanguageModel and LanguageModelExecutor protocols do not exist. After switching to the Xcode 27 beta toolchain, the target compiled, but freshly built debug bundles were being launch-suspended by the macOS 27 beta before reaching main. A blank window looks like a broken feature when it is really an OS-beta launch issue.

So I treated "it builds" and "it runs" as separate facts, and compiled the executor source into a small standalone macOS harness that ran the same transcript conversion and generation channel without the flaky GUI launch:

Route	Result
Direct Gemini API	Success, described the clip correctly
Firebase adapter	Rejected the custom prompt segment with `unsupportedTranscriptContent`
Custom executor	Success, through `LanguageModelSession`

The harness was not the product, just proof the bridge worked while the beta OS broke the normal launch path. The signed app ran the UI fine.

Distinguish "received the video" from "understood the video"

A trivial clip proves transport, not comprehension. Swapping in a deliberately hard 12-second benchmark, with objects crossing in both directions, an occlusion, a rotating shape that changes color, a moving blackout bar, OCR labels, and a final reorder, exposed real model mistakes: a horizontal sweep read as vertical motion, a reversed rotation direction, and a disagreement about layering during the overlap.

Those are not transport failures. A transport failure looks like an unsupported-input error, an HTTP failure, an empty response, or a description of nothing in the clip. The model received and analyzed a hard timeline and got spatial details wrong. That is a different problem, and one an easy clip would have hidden.

Drop dependencies you no longer need

The first wrapper reused Firebase's InlineDataPart, so Firebase stayed in the package graph even after its stock adapter left the screen. By the end it was there for one media type and a couple of type checks, not worth a whole dependency tree for a single endpoint.

Reading the Xcode 27 protocol interface directly, I replaced InlineDataPart with a project-defined VideoSegment, fixed the executor downcast, removed Firebase from the graph, and reran the live request. Still worked.

Let review find the boundary cases

Opening the pull request surfaced four real issues the happy path never exercised:

imported movie data could remain memory-mapped after the security-scoped file access ended
AVPlayer could open a selected URL without holding its security scope
whitespace around an environment API key needed trimming at the HTTP boundary
every selected movie was labeled video/mp4 instead of deriving its MIME type from UTType

Each was fixed, rebuilt, linted, and pushed, with MIME probes covering MP4, MOV, M4V, and WebM and a secret scan confirming the key was not committed.

Trace it until you can explain it

After the demo worked, the call site still hid several custom types I could not yet explain. The last step was not new functionality: I had the agent name every type, open the real file, and trace one request from LanguageModelSession to Gemini and back. The next sections are that walkthrough.

The Call Site

The final usage is compact:

let video = VideoSegment(
    data: try Data(contentsOf: videoURL),
    mimeType: "video/mp4"
)
 
let model = GeminiDeveloperVideoLanguageModel(
    apiKey: apiKey,
    modelName: "gemini-3.5-flash"
)
 
let session = LanguageModelSession(model: model)
 
let response = try await session.respond {
    video
    "Describe this video."
}
 
print(response.content)

That is the whole call site. Behind it are four types you write yourself:

VideoSegment
GeminiDeveloperVideoLanguageModel
GeminiDeveloperVideoLanguageModelExecutor
GeminiDeveloperAPIClient

None ship with Apple, Google, or Firebase. The snippet just calls them; the four types are where the work is.

Code samples in posts also get simplified, which can mislead. A trimmed version of the executor branch might look like:

case let .custom(segment):
    guard let video = segment as? VideoSegment else {
        throw UnsupportedSegmentError()
    }
 
    parts.append(.inlineData(
        data: video.content.data,
        mimeType: video.content.mimeType
    ))

That communicates the idea, but it is pseudocode. There is no UnsupportedSegmentError type in the project; the real implementation uses unsupportedTranscriptError(_:detail:), checks whether custom media is allowed for that transcript entry, and returns a framework LanguageModelError.unsupportedTranscriptContent. The actual branch is shown later.

What CustomSegment Actually Does

Foundation Models stores a conversation as a Transcript. Each prompt or response contains ordered segments: text, structured content, attachments, or custom content. The Transcript segment documentation defines the custom case as:

case custom(any Transcript.CustomSegment)

A custom segment needs an identifier and a Content value that is Codable, Equatable, and Sendable. The video segment stores the bytes and MIME type:

import Foundation
import FoundationModels
 
struct VideoSegment: Transcript.CustomSegment {
    struct Content: Codable, Equatable, Sendable {
        let data: Data
        let mimeType: String
    }
 
    let id: String
    let content: Content
 
    init(
        id: String = UUID().uuidString,
        data: Data,
        mimeType: String
    ) {
        self.id = id
        content = Content(data: data, mimeType: mimeType)
    }
}

CustomSegment does not decode frames, understand motion, or add video support to the system model. It gives the transcript a typed container for content the framework does not define itself.

The protocol also receives a default PromptRepresentable conformance, which is why video can appear directly inside the respond result builder. Foundation Models turns it into a .custom(video) segment next to the text segment. At this point the bytes are only being carried through the transcript. Something still has to interpret them.

Defining the Custom Language Model

The LanguageModel conformance is intentionally small:

struct GeminiDeveloperVideoLanguageModel: LanguageModel {
    typealias Executor = GeminiDeveloperVideoLanguageModelExecutor
 
    let capabilities = LanguageModelCapabilities(capabilities: [])
    let executorConfiguration: Executor.Configuration
 
    init(apiKey: String, modelName: String) {
        executorConfiguration = Executor.Configuration(
            apiKey: apiKey,
            modelName: modelName
        )
    }
}

The important line is the Executor type alias. It pairs this model description with the type that performs inference, and it supplies the configuration Foundation Models uses when creating that executor.

The capabilities array is empty because this experiment does not implement guided generation, tool calling, reasoning, or Apple's image attachment path. Video is carried as custom transcript content rather than declared as a built-in capability.

Apple describes LanguageModelExecutor as the bridge between framework types and the system that generates tokens. That is exactly what it is here: an adapter from Foundation Models transcripts to the Gemini REST API.

The Executor Is the Real Bridge

The executor owns the Gemini client and implements the method Foundation Models calls for each generation request. This is the real implementation, with only the platform availability attributes removed:

struct GeminiDeveloperVideoLanguageModelExecutor: LanguageModelExecutor {
    typealias Model = GeminiDeveloperVideoLanguageModel
 
    struct Configuration: Hashable, Sendable {
        let apiKey: String
        let modelName: String
    }
 
    private let configuration: Configuration
    private let client: GeminiDeveloperAPIClient
 
    init(configuration: Configuration) throws {
        self.configuration = configuration
        client = GeminiDeveloperAPIClient(
            apiKey: configuration.apiKey,
            modelName: configuration.modelName
        )
    }
}
 
extension GeminiDeveloperVideoLanguageModelExecutor {
    func respond(
        to request: LanguageModelExecutorGenerationRequest,
        model: GeminiDeveloperVideoLanguageModel,
        streamingInto channel: LanguageModelExecutorGenerationChannel
    ) async throws {
        try validate(request)
 
        let convertedTranscript = try convert(request.transcript)
        let response = try await client.generateContent(
            contents: convertedTranscript.contents,
            systemInstruction: convertedTranscript.systemInstruction
        )
 
        guard !response.text.isEmpty else {
            throw GeminiDeveloperAPIError.noTextResponse
        }
 
        await send(response, for: request.id, into: channel)
    }
}

There are four steps inside that method:

Reject the Foundation Models features this experiment does not support.
Convert the session transcript into Gemini roles and parts.
Send the HTTP request to Gemini.
Feed Gemini's response back into the Foundation Models generation channel.

The full path looks like this:

LanguageModelSession
    -> Transcript.Prompt
        -> .custom(VideoSegment)
        -> .text("Describe this video.")
    -> GeminiDeveloperVideoLanguageModelExecutor
    -> Gemini generateContent REST API
    -> LanguageModelExecutorGenerationChannel
    -> response.content

Translating the Transcript

The executor receives the whole Transcript, not just the latest string. It walks each entry and maps Foundation Models roles to Gemini roles:

instructions become Gemini's system_instruction
prompts become user content
previous responses become model content
tool calls and tool output are rejected, because this experiment does not implement them

Inside each entry, text and custom segments become Gemini parts. This is the exact branch that recognizes the video:

case let .custom(customSegment):
    guard allowsCustomMedia else {
        throw unsupportedTranscriptError(
            entry,
            detail: "Gemini system instructions support text only."
        )
    }
 
    if let video = customSegment as? VideoSegment {
        parts.append(.inlineData(
            data: video.content.data,
            mimeType: video.content.mimeType
        ))
    } else {
        throw unsupportedTranscriptError(
            entry,
            detail: "Only VideoSegment custom segments are supported."
        )
    }

This downcast is where the custom segment gains meaning. Foundation Models only knows it is custom content. The executor is the layer that knows this particular custom content should become Gemini inline data.

Sending the Video to Gemini

GeminiDeveloperAPIClient is another project type, not a Google SDK type. It models only the slice of the Gemini API this experiment needs.

The bytes are base64-encoded into an inline_data part:

struct Part: Codable, Sendable {
    let text: String?
    let inlineData: InlineData?
 
    static func inlineData(data: Data, mimeType: String) -> Part {
        Part(
            text: nil,
            inlineData: .init(
                mimeType: mimeType,
                data: data.base64EncodedString()
            )
        )
    }
 
    enum CodingKeys: String, CodingKey {
        case text
        case inlineData = "inline_data"
    }
}

The resulting request body is equivalent to this JSON:

{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "inline_data": {
            "mime_type": "video/mp4",
            "data": "<base64 video bytes>"
          }
        },
        {
          "text": "Describe this video."
        }
      ]
    }
  ]
}

The client sends that body with URLSession. This is the real request construction from the project:

let requestAPIKey = apiKey.trimmingCharacters(in: .whitespacesAndNewlines)
guard !requestAPIKey.isEmpty else {
    throw GeminiDeveloperAPIError.apiKeyMissing
}
 
guard let encodedModelName = modelName.addingPercentEncoding(
    withAllowedCharacters: .urlPathAllowed
),
    let url = URL(
        string: "https://generativelanguage.googleapis.com/v1beta/models/\(encodedModelName):generateContent"
    ) else {
    throw GeminiDeveloperAPIError.invalidModelName(modelName)
}
 
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("application/json", forHTTPHeaderField: "Content-Type")
request.setValue(requestAPIKey, forHTTPHeaderField: "x-goog-api-key")
request.httpBody = try JSONEncoder().encode(
    RequestBody(
        contents: contents,
        systemInstruction: systemInstruction
    )
)
 
let (data, response) = try await urlSession.data(for: request)

No Firebase in the final demo. Firebase would work too, and adds production features like App Check and per-user rate limits, but the experiment needed one endpoint, and writing the translation by hand keeps the Foundation Models boundary visible.

Returning Through LanguageModelSession

After decoding Gemini's response, the executor sends generation events through the channel Foundation Models supplied, rather than returning a String:

await channel.send(
    .response(
        entryID: requestID.uuidString,
        action: .appendText(
            response.text,
            tokenCount: response.usageMetadata?.candidatesTokenCount ?? 0
        )
    )
)

The implementation also sends provider metadata, the resolved model version, and Gemini's input and output token usage. LanguageModelSession consumes those events, updates its transcript, and produces the familiar response.content at the call site.

This is what makes it genuinely use Foundation Models instead of just naming a network call LanguageModelSession. The framework owns the session, prompt construction, transcript, executor lifecycle, and response assembly; the executor owns the provider translation and the request.

Who Does What

The boundary in one table:

Layer	Responsibility
`LanguageModelSession`	Prompt building, transcript, request lifecycle, response assembly
`VideoSegment`	Carries video bytes and MIME type as opaque custom content
`GeminiDeveloperVideoLanguageModel`	Describes the model and supplies executor configuration
`GeminiDeveloperVideoLanguageModelExecutor`	Converts transcript entries and segments into Gemini request types
`GeminiDeveloperAPIClient`	Performs the HTTP request and decodes the response
Gemini 3.5 Flash	Interprets the video and generates the answer
`SystemLanguageModel`	Nothing. Apple's on-device model is not invoked.

This is video input at the framework and session layer. It is not native video input for Apple's system model.

What the Demo Does Not Handle

This is a focused experiment, not a full Gemini provider. What it leaves out:

Inline size limits. Google recommends inline video only for total requests under 20MB. Larger or reusable videos should use the Files API.
Key security. A production app should not ship an unrestricted Gemini key. Put the request behind a backend, or use something like Firebase AI Logic with App Check.
Real streaming. The client calls generateContent and waits for the complete response before appending it. It does not consume Gemini's streaming endpoint.
Repeated video data. The executor converts the whole transcript on every request, so a multi-turn session may resend an earlier inline video, which is expensive.
Framework features. Guided generation and tool calling are explicitly rejected. The model declares no built-in capabilities.
Beta APIs. LanguageModel, LanguageModelExecutor, and CustomSegment are Xcode 27 beta and may change.

The demo proved the transport and executor architecture; a production provider would also need auth, uploads, streaming, capability mapping, retries, caching, and tests.

Verifying Agent-Written Code

Most of this was built with an agent moving quickly across documentation, source, Swift interfaces, project configuration, HTTP, SwiftUI, a standalone harness, package resolution, and the pull request. The code was working and reviewed before I actually understood it.

The most useful question I asked was "where does this code live?" Asking an agent to explain code tends to produce another clean abstraction. Asking where it lives forces a concrete answer: a file, a symbol, and a line you can open.

Before calling something a small example, I want to be able to answer:

Which identifiers come from the SDK, and which did I define?
What control test proves the underlying provider capability?
Which file and method cross the framework-provider boundary?
What did the stock path fail with?
What part of the result came from Apple, from my adapter, and from Gemini?
Is the snippet complete, call-site-only, or intentionally pseudocode?
Can I explain the request and response path without asking the agent again?

The Short Version

VideoSegment conforms to Transcript.CustomSegment. A custom LanguageModelExecutor reads it from the session transcript, converts it to Gemini inline_data, calls generateContent, and sends Gemini's text back through LanguageModelExecutorGenerationChannel. Foundation Models manages the session; Gemini, not Apple's system model, handles the video.

Every part of that maps to a concrete type and method in the implementation. The full code is in Foundation Models Framework Lab PR #147; read GeminiDeveloperVideoLanguageModel.swift for the exact executor, transcript conversion, REST client, errors, metadata, and usage reporting.

What's Next

In Xcode 27, LanguageModelSession is becoming an interface over different providers. CustomSegment lets a provider carry content the framework does not define, and LanguageModelExecutor gives it a place to interpret that content. Together they let an app keep one session API while picking a model that can actually do the task. That is what made routing video to Gemini work here.