Every collaborative session in a Spacebar Space is, by construction, a labeled multi-turn trajectory: voice, canvas, browser interaction, tool use, and natural recovery — captured as a byproduct of real work. The bottleneck for frontier model training isn't compute. It's exactly this kind of data.
The Salesforce APIGen-MT paper opens by stating directly that high-quality data capturing realistic human-agent dynamics is scarce and expensive to collect manually. PC Agent-E identifies computer-use trajectories as a critical bottleneck. Fireworks's multi-turn RL guide is blunt: supervised fine-tuning on golden trajectories breaks down because the model never sees recovery paths from failure.
That recovery data only exists in sessions where humans and agents are working together on real tasks, in a real multimodal environment, with the freedom to make mistakes and correct them. Spacebar generates exactly that as a byproduct of collaboration.
Sessions span voice, canvas, browser, and chat. They include ambiguity, clarification turns, mistakes, and corrections — exactly the failure modes current frontier models undertrain.
APIGen-MT · tau-bench · BFCL
Synthetic data gets you a long way. Spacebar doesn't replace it — it provides the real distribution that synthetic pipelines use as a seed and validation set.
Fireworks multi-turn RL guide · PC Agent-E
The move is toward high-realism, multimodal, human-agent interaction data. Spacebar generates it as a byproduct of real work — not in a lab, not synthetically.
Every session is replayable frame by frame and exportable in formats compatible with standard RL training pipelines. Here is the field-level schema.
Timestamped voice turns per participant: raw audio (WebM/Opus), ASR transcript, speaker identity (role-scoped), turn boundaries, and VAD confidence. Overlapping speech captured separately per stream — not mixed.
Full CRDT event log: every object create, update, delete, and move with precise timestamps and actor identity. Vector-level fidelity — strokes, shapes, and annotations are structured data, not raster images. Replayable to any point in session history.
Continuous video of the human's browser session — not screenshots, not accessibility tree snapshots. Mouse trajectories, scroll behavior, hover states, click positions, and abandoned inputs visible. Synchronized with canvas and voice streams by wall-clock timestamp.
Every tool call: name, input parameters, execution result, latency, and whether the result was accepted, modified, or retried. Full loop structure (observe → call → result → observe again) preserved at call-graph level.
Session ID, participant roles (not PII), duration, space configuration, model provider, memory tier states at session start and end, consent tier, and licensing classification.
In-browser perception stream results: face landmark positions (478 points), hand keypoints (21 per hand), gesture classifications, attention score, and engagement signal — derived locally, only results transmitted. Synchronized with all other streams.
Frontier labs currently have three options for human-agent interaction data. Here is what each actually provides.
Natural — humans are using a real product, not performing for a system
Structured but performed — annotators following task scripts
Plausible but not real — model-generated approximations of human behavior
Voice + canvas + browser video + tool calls + perception signals — all synchronized
Typically text or structured annotation; some include screen recordings
Text-only or text + tool call; no video, no voice, no canvas
Naturally occurring — mistakes, corrections, and abandoned attempts captured as they happen
Can be scripted but expensive; rarely includes genuine error recovery
Structurally absent — synthetic pipelines optimize for golden paths, not failure modes
Long-horizon — full work sessions, not isolated task clips
Typically task-scoped; multi-turn but bounded
Bounded by prompt length and generation cost
Byproduct of real product usage — no separate collection pipeline
Dedicated annotation workforce; separate ingestion pipeline
Automated generation; no human labor after initial design
High-realism anchor and validation set; seed for synthetic pipelines
Large-volume training data; good coverage but limited behavioral depth
Scalable volume at low cost; known distribution bias
Frontier labs will not touch data that cannot answer these questions cleanly. Here are the answers.
Every trajectory is collected under explicit user consent at the session level. The consent flow specifies exactly what data is captured, how it will be used, and the licensing tier. Users can mark sessions as private (never used for training), internal (available for their organization's own fine-tuning), or licensed (available to approved third parties under signed agreements). No session data is used for training without affirmative consent.
Session-level licensing with three tiers: private, internal, and licensed. Licensed sessions are available to approved third parties under executed data licensing agreements. Agreements specify permitted use, territory, sublicensing rights, and audit access. We do not license data to parties we cannot audit.
Voice transcripts are run through PII detection before delivery. Canvas state is delivered with user identity replaced by role-scoped identifiers. Raw audio and video are never transmitted without explicit per-session consent. Perception signals (face landmarks, gesture data) are derived in-browser and only the computed results are transmitted — raw video never leaves the device unless a session is explicitly licensed for video delivery.
Licensed trajectories may be used only for the purposes specified in the agreement. Repurposing, sublicensing without consent, and use for identity inference are prohibited. Labs receive structured trajectory files, not access to our systems. We maintain a complete audit log of all data deliveries.
Before any trajectory is made available for licensing, it passes through a redaction pipeline: PII scrubbing on transcripts, replacement of identifying metadata, removal of any canvas objects flagged as sensitive by the session owner, and quality review. Labs can request additional redaction passes under agreement.
For labs that need high-volume, prescribed-scenario data — specific task types, specific tool combinations, specific error-recovery sequences — we support structured data collection programs with scenario templates, facilitator training, quality review, and delivery pipelines. This is a separate enterprise engagement. Contact partnerships@spacebar.ai.
Pencil Spaces — our own platform, built entirely on Spacebar — has run over 200 million minutes of paid customer sessions across four years. The licensable subset depends on consent tier, which we are actively expanding. Volume at any given consent tier is disclosed under NDA as part of the initial conversation. The growth curve follows active users, not a fixed production schedule — which is the structural advantage over annotation-based pipelines.
Yes. We provide a sample anonymized trajectory in the standard delivery format — including voice, canvas event log, browser interaction metadata, and tool call records — under a simple NDA. This is the right first step. Contact partnerships@spacebar.ai and we will send it within 48 hours.
Trajectories are delivered as structured JSON with separate media attachments (audio as WebM/Opus, browser video as MP4, perception signals as timestamped CSV). The schema is documented and versioned. We can deliver to S3, GCS, or Azure Blob under a signed agreement. Custom format requirements for specific training pipelines are handled under enterprise agreements.
Scale and Surge produce volume at controlled quality — annotators following task scripts, typically text or structured annotation, with some screen recordings. Spacebar trajectories are behaviorally richer: continuous browser video, synchronized voice, canvas state at vector fidelity, and natural recovery paths that annotation pipelines can script but cannot genuinely reproduce. The two are complementary: Scale for volume and coverage, Spacebar for the high-realism anchor distribution. Most labs will want both.
Genuinely continuous video — the full WebRTC stream of the human's browser session, not screenshots taken before and after each action. This is the key difference from annotation-based pipelines and from screenshot-based computer-use agents. The video captures mouse trajectories, hover states, scroll behavior, and abandoned inputs — the behavioral signal that exists between actions, not just the states before and after. See the shared browser section on the main page for the technical detail.
Initial conversation to sample delivery: 48 hours. Sample to executed agreement: depends on legal review on your side, typically 2–4 weeks. First data delivery after agreement: 1 week. We are designed to move fast on this — the operational work is already done; the licensing agreement is the bottleneck, not the data preparation.