Can the human and agent really interact with the same browser tab simultaneously?

Yes. The embedded browser in a Space is a server-provisioned Chrome instance that both the human and the agent connect to through the Space event system. Both can scroll, click, and type — their actions are multiplexed to the same instance in real time. Neither party is watching a video of the other's screen; they are operating the same live session.

How does this compare to Browserbase's Live View?

Browserbase Live View is a remote desktop window onto an agent-controlled Chrome session — the human sees what the agent is doing and can take over by sending inputs through the same remote stream. In Spacebar, there is no takeover. Neither party has default control. The human is not observing the agent's session — they are co-present in a shared session. The agent observes a continuous video feed rather than screenshots.

Does the agent see the human's mouse movements and hesitations?

Yes. The agent receives a continuous WebRTC video stream of the human's browser interaction — the full session video, not screenshots taken before and after each action. Mouse trajectories, scroll pauses, hover states, abandoned form inputs, and typing rhythm are all visible.

Does Spacebar use Browserbase or similar infrastructure under the hood?

The architecture difference sits above the headless-browser primitive, not within it. Spacebar uses similar infrastructure at the base layer — a server-provisioned, programmatically controlled browser instance. The distinction is in the observation layer (continuous video vs. screenshots), the control layer (multiplexed simultaneous input vs. sequential handoff), and the integration layer.

How does the continuous video stream work technically?

The server-provisioned Chrome instance captures its display output as a WebRTC stream, which is delivered to both the human participant (as the browser they see and interact with) and the agent (as a video input alongside DOM access via Chrome DevTools Protocol). Both streams are timestamped in the Space event log, synchronized with audio and canvas events to millisecond precision.

What happens when the human and agent both try to interact at the same moment?

Both inputs are routed to the browser instance and both are reflected in the video stream. There is no locking mechanism — the same multiplayer model as the canvas. If the agent clicks a button while the human is typing in a field, both actions occur. For scenarios requiring strict turn-taking, the agent's permission model can be configured to be observer-only during human turns.

Is this available in the standard SDK or only through a custom deployment?

Available via space.browser.open(url) in the standard SDK. Contact engineering@spacebar.ai for implementation details.

Not a remote window.
A shared tab.

Every other approach to shared browser infrastructure is a takeover model: the agent drives, the human steps in when something breaks, control passes back. Spacebar inverts both halves. The human and the agent are co-present in the same live tab — simultaneously, natively, with full behavioral signal flowing in both directions.

The problem with every other approach

The takeover model and why it fails.

Browserbase, Browser Use, and every headless-browser agent stack built on the same primitives share a fundamental architectural assumption: the agent is in control, the human is a backup. That assumption produces two problems — one product, one technical.

The product problem

Remote control produces guarded behavior.

When a human enters a Browserbase Live View session, they know they are remote-controlling something fragile. They slow down. They retry less. They hesitate. They perform for the system rather than using it. Guarded behavior is not representative behavior. The resulting interaction is distorted — not useless, but not what humans actually do when they work on the web.

The technical problem

Discrete screenshots discard most of the signal.

Screenshot-based agents observe two states: before an action, after an action. Everything between — the hover that revealed a tooltip, the scroll that paused at a section, the field that was almost filled before reconsidering — is invisible. The agent learns from a flipbook. The behavioral signal that actually encodes how humans navigate and decide is precisely what gets discarded.

Head-to-head comparison

Every dimension that matters.

This is not a UX comparison. It is an architectural one. The differences below are structural — they follow from the design of the system, not from implementation quality.

SpacebarBrowserbase / Browser Use / Cloudflare Browser Run

Who drivesControl model

Human and agent share control simultaneously. Neither party is "in control" by default. Both can click, scroll, and type on the same live instance at the same time. Neither blocks the other.

Agent drives. Human intervenes via remote desktop overlay (Live View) when the agent gets stuck — typically on MFA, CAPTCHAs, or ambiguous UI states. After intervention, control passes back to the agent. One driver at a time.

What the agent observesPerception model

Continuous video feed of the human's browser interaction — the full WebRTC stream, not frames. Every hover state, scroll position, mouse trajectory, typing rhythm, and abandoned input is visible. Plus live DOM access for structural context alongside the video.

Discrete screenshots (typically PNG frames) before and after each action, or accessibility tree snapshots. The agent sees states, not transitions. Agent Browser's wireframe approach reduces token cost but at the expense of behavioral signal. ~10,000 tokens per screenshot.

What the human experiencesUX model

Native browsing. The embedded browser in a Space behaves exactly like a local browser tab. Clicks, scrolls, and typing feel immediate because they are — the human is operating a server-provisioned browser, not watching a video of one.

Remote control. Browserbase Live View is a remote desktop window onto an agent-controlled Chrome instance. Latency is inherent to the stream. Users report guarded, cautious behavior — they know they are driving something they do not own.

Observation fidelitySignal richness

Film. Continuous signal. The mouse hover that revealed a tooltip, the scroll that paused, the field the user almost filled before reconsidering — all visible, all timestamped, all synchronized with canvas and voice streams.

Flipbook. Discrete states. Information between states is discarded. The behavioral signal that encodes how humans navigate, decide, and recover — the most valuable data for agent training — is precisely what the screenshot model cannot capture.

Concurrent occupancyMulti-party model

True simultaneous occupancy. Human and agent interact with the same live instance at the same time. Actions stream to both parties in real time — the same multiplayer model as cursor positions on the Spacebar canvas.

Sequential handoff. One party holds control while the other watches. The Convergence pattern and similar approaches automate the handoff timing, but the underlying model is still "now you, now me." True simultaneous co-presence is not supported.

Multi-party supportNumber of actors

Any number of humans and agents can be present in the same shared browser session simultaneously. Each participant's actions are individually attributed and streamed. Useful for collaborative review, supervised agent runs, and training scenarios.

Designed for one agent plus one human override. Multi-party scenarios require custom orchestration outside the standard API.

Integration surfaceWhat it connects to

The shared browser is one surface in a live Space — alongside voice, canvas, documents, and tool calls. The agent's browser actions, voice turns, and canvas changes are all timestamped in the same event stream. Context is never lost between modalities.

Standalone browser infrastructure. Integration with other modalities (voice, canvas, documents) requires external orchestration. Browser state and LLM context exist in separate systems.

Architecture positionWhere it sits

Above the headless-browser primitive. Spacebar uses infrastructure like Browserbase under the hood. The distinction is in the observation and control layer built on top — multiplayer by design, continuous video by design, multimodal by design.

The headless-browser primitive itself. Browserbase is excellent infrastructure for programmatic browser automation. The limitations described here are inherent to that layer, not to Browserbase's implementation quality.

Architecture

How the shared browser actually works.

The shared browser in a Spacebar Space is a server-provisioned Chrome instance streamed over WebRTC. Both the human and the agent connect to the same session through the Spacebar event system. Here is the stack.

Spacebar shared browser architecture

Observation layer

Continuous WebRTC video stream of the human's browser interaction — mouse, scroll, typing, hover states — plus live DOM access via Chrome DevTools Protocol. Both streams are timestamped and synchronized with the Space event log. This is what makes the observation model fundamentally different from screenshots.

Control layer

Symmetric input routing — human inputs (mouse events, keyboard events, scroll events) and agent inputs are both routed to the same Chrome instance through the same event system. Neither party has priority. Both streams are multiplexed in real time. The human's browser behavior is native because the events are native.

Session layer

Server-provisioned Chrome instance managed as a Space resource. Session state (URL, DOM, cookies, local storage) persists across participant joins and leaves. The session can be replayed frame-by-frame from the event log. Isolated per Space; no cross-tenant state.

Infrastructure layer

Headless browser infrastructure — similar to what Browserbase, Browser Use, and Playwright use at the primitive level. The distinction is not here. It is in the observation and control layers above.

Multiplayer layer

The Space event stream — the same CRDT-backed, per-space-ownership-locked substrate that handles canvas state, voice turns, and tool calls. Browser events are first-class citizens of the event log alongside everything else. This is what enables synchronized multi-modal context for the agent.

Signal comparison

What the agent actually receives.

The behavioral signal available to an agent running in a Spacebar shared browser is categorically richer than what screenshot-based systems provide. This is not a matter of degree. It is a structural difference in what information exists.

Spacebar — what the agent sees

Full video stream — continuous WebRTC feed, not frames. Every second of the human's browser interaction is visible.

Mouse trajectories— not just click positions. Where the cursor traveled, what it hovered over, how long it paused.

Scroll behavior— velocity, direction, pauses. The scroll that slowed at a section before continuing encodes attention.

Hover states— tooltips revealed, dropdowns exposed, link previews shown. The human's exploration of the UI is fully visible.

Abandoned inputs— fields that were focused and cleared, text typed and deleted, submissions started and cancelled.

Typing rhythm — keystroke cadence, backspace frequency, hesitation patterns. Encodes confidence and uncertainty.

Live DOM access — structured page state alongside the video. The agent has both the visual signal and the semantic structure simultaneously.

Synchronized context — browser actions are timestamped alongside voice turns, canvas changes, and tool calls in the same event stream.

Browserbase / Browser Use — what the agent sees

Screenshots — PNG frames of the DOM state before and after each action. ~10,000 tokens each; expensive and limited.

Accessibility tree snapshots — structured but lossy. Visual presentation, hover states, and animation states are not captured.

Action results — the outcome of tool calls (click, type, scroll). Not the process. Not what the human considered before acting.

No hover state — tooltips, dropdowns, and previews triggered by mouse proximity are invisible to the agent.

No trajectory — the path between states is not recorded. Only the states themselves.

No abandoned inputs — actions that were started and cancelled leave no trace. The agent sees only completed actions.

No behavioral timing — there is no signal about hesitation, reconsideration, or uncertainty in the interaction.

→

Why this matters for training data

The behavioral richness of Spacebar's browser signal is the structural reason the shared browser is central to the training data thesis. A human working in a Spacebar Space is not performing for a system — they are working naturally. That produces the high-realism anchor distribution that frontier model training needs but cannot get from annotation pipelines. See spacebar.ai/labs for the full argument →

Why native operability matters

Guarded behavior is not representative behavior.

When a human enters a Browserbase Live View session, they know they are remote-controlling something fragile. The feedback loop is slightly off — events are routed through a stream, not executed locally. Users report slowing down, retrying less, and generally performing careful, deliberate actions rather than natural, habitual ones.

This is not a latency problem that can be engineered away. It is a behavioral consequence of the remote control model itself. The human is not working. They are demonstrating. The resulting interaction data reflects demonstration behavior, not work behavior.

Spacebar's embedded browser executes events natively on the server-provisioned instance. The human's clicks and keystrokes are not routed through a remote desktop stream — they are the direct input to the browser. The human does not feel like they are remote-controlling something fragile, because they are not. They are browsing. That produces natural behavior — which is the only kind of behavior worth recording.

This is also why the training data argument holds. A human working naturally in a Spacebar Space is generating trajectories that reflect how humans actually use the web. A human performing carefully in a Live View session is generating trajectories that reflect how humans behave when they know they are being observed. These are different distributions. For training a model to work with humans on real tasks, only one of them is useful.

Frequently asked questions

Not a remote window.A shared tab.