Every other approach to shared browser infrastructure is a takeover model: the agent drives, the human steps in when something breaks, control passes back. Spacebar inverts both halves. The human and the agent are co-present in the same live tab — simultaneously, natively, with full behavioral signal flowing in both directions.
Browserbase, Browser Use, and every headless-browser agent stack built on the same primitives share a fundamental architectural assumption: the agent is in control, the human is a backup. That assumption produces two problems — one product, one technical.
Screenshot-based agents observe two states: before an action, after an action. Everything between — the hover that revealed a tooltip, the scroll that paused at a section, the field that was almost filled before reconsidering — is invisible. The agent learns from a flipbook. The behavioral signal that actually encodes how humans navigate and decide is precisely what gets discarded.
This is not a UX comparison. It is an architectural one. The differences below are structural — they follow from the design of the system, not from implementation quality.
The shared browser in a Spacebar Space is a server-provisioned Chrome instance streamed over WebRTC. Both the human and the agent connect to the same session through the Spacebar event system. Here is the stack.
Continuous WebRTC video stream of the human's browser interaction — mouse, scroll, typing, hover states — plus live DOM access via Chrome DevTools Protocol. Both streams are timestamped and synchronized with the Space event log. This is what makes the observation model fundamentally different from screenshots.
Symmetric input routing — human inputs (mouse events, keyboard events, scroll events) and agent inputs are both routed to the same Chrome instance through the same event system. Neither party has priority. Both streams are multiplexed in real time. The human's browser behavior is native because the events are native.
Server-provisioned Chrome instance managed as a Space resource. Session state (URL, DOM, cookies, local storage) persists across participant joins and leaves. The session can be replayed frame-by-frame from the event log. Isolated per Space; no cross-tenant state.
Headless browser infrastructure — similar to what Browserbase, Browser Use, and Playwright use at the primitive level. The distinction is not here. It is in the observation and control layers above.
The Space event stream — the same CRDT-backed, per-space-ownership-locked substrate that handles canvas state, voice turns, and tool calls. Browser events are first-class citizens of the event log alongside everything else. This is what enables synchronized multi-modal context for the agent.
The behavioral signal available to an agent running in a Spacebar shared browser is categorically richer than what screenshot-based systems provide. This is not a matter of degree. It is a structural difference in what information exists.
Screenshots — PNG frames of the DOM state before and after each action. ~10,000 tokens each; expensive and limited.
Accessibility tree snapshots — structured but lossy. Visual presentation, hover states, and animation states are not captured.
Action results — the outcome of tool calls (click, type, scroll). Not the process. Not what the human considered before acting.
No hover state — tooltips, dropdowns, and previews triggered by mouse proximity are invisible to the agent.
No trajectory — the path between states is not recorded. Only the states themselves.
No abandoned inputs — actions that were started and cancelled leave no trace. The agent sees only completed actions.
No behavioral timing — there is no signal about hesitation, reconsideration, or uncertainty in the interaction.
The behavioral richness of Spacebar's browser signal is the structural reason the shared browser is central to the training data thesis. A human working in a Spacebar Space is not performing for a system — they are working naturally. That produces the high-realism anchor distribution that frontier model training needs but cannot get from annotation pipelines. See spacebar.ai/labs for the full argument →
Yes. The embedded browser in a Space is a server-provisioned Chrome instance that both the human and the agent connect to through the Space event system. Both can scroll, click, and type — their actions are multiplexed to the same instance in real time. Neither party is watching a video of the other's screen; they are operating the same live session. This is the same multiplayer model as cursor positions on the Spacebar canvas — both parties are participants, not drivers and passengers.
Browserbase Live View is a remote desktop window onto an agent-controlled Chrome session — the human sees what the agent is doing and can "take over" by sending inputs through the same remote stream. In Spacebar, there is no takeover. Neither party has default control. The human is not observing the agent's session — they are co-present in a shared session. The agent observes a continuous video feed rather than screenshots. The human interacts natively rather than through a remote stream. These are structural differences, not UX refinements.
Yes. The agent receives a continuous WebRTC video stream of the human's browser interaction — the full session video, not screenshots taken before and after each action. Mouse trajectories, scroll pauses, hover states, abandoned form inputs, and typing rhythm are all visible. This is the behavioral telemetry that screenshot-based agents structurally cannot capture. It is also the primary reason Spacebar trajectories are valuable as training data — they contain the recovery signals and behavioral nuance that synthetic and annotation-based pipelines cannot produce.
The architecture difference sits above the headless-browser primitive, not within it. Spacebar uses similar infrastructure at the base layer — a server-provisioned, programmatically controlled browser instance. The distinction is in the observation layer (continuous video vs. screenshots), the control layer (multiplexed simultaneous input vs. sequential handoff), and the integration layer (the browser session is a first-class citizen of the Space event stream, synchronized with voice, canvas, and tool calls). The limitation of the screenshot model is not specific to Browserbase's implementation — it is inherent to any screenshot-based observation architecture.
The server-provisioned Chrome instance captures its display output as a WebRTC stream, which is delivered to both the human participant (as the browser they see and interact with) and the agent (as a video input alongside DOM access via Chrome DevTools Protocol). The video is not encoded for minimal latency in the human-facing direction; it is encoded for full behavioral fidelity in the agent-facing direction. Both streams are timestamped in the Space event log, synchronized with audio and canvas events to millisecond precision.
Both inputs are routed to the browser instance and both are reflected in the video stream. There is no locking mechanism — the same multiplayer model as the canvas. If the agent clicks a button while the human is typing in a field, both actions occur. For most real-time collaboration scenarios, this is desirable: the human and agent are working together, not taking turns. For scenarios requiring strict turn-taking, the agent's permission model can be configured to be observer-only during human turns.
The shared browser is a standard Space primitive, available through the same SDK as canvas objects, voice, and presence. space.browser.open(url) provisions the session; the resulting browser object exposes the video stream, DOM access, and input routing through the standard event API. No custom deployment required. For specialized configurations — multi-party supervised sessions, prescribed-scenario training runs, custom input routing policies — contact engineering@spacebar.ai.