KasmVNC + Docker: Building Isolated GUI Environments at Scale

Running a single ROS2 development environment with Gazebo, RViz2, and rqt in a browser is a solved problem. There are several open-source projects that do it reasonably well. Running hundreds of them simultaneously, isolated from each other, on shared infrastructure, with per-tool display routing and sub-second session startup times — that's a different problem entirely.

This is a technical deep-dive into the architecture we built for RoboLab. It's long, it's detailed, and it's aimed at people who are building similar systems or who want to understand how browser-native simulation actually works under the hood.

Prerequisites: Familiarity with Docker, Linux display servers (X11/Xvfb), VNC/WebSocket protocols, and basic ROS2 concepts. This is not an introductory post.

The Problem Space

ROS2 tooling is designed for desktop Linux. Gazebo renders via OpenGL. RViz2 needs an X11 display. rqt uses Qt widgets. None of these were designed with "stream to a browser" in mind.

The naive approach is to run a full virtual desktop environment (XFCE, LXDE) inside a container, stick VNC on top of it, and stream the whole thing via noVNC. This works. It's also wasteful, slow to start, and gives you one monolithic stream that students have to navigate like a remote desktop session rather than a clean web interface.

We needed something different: per-tool display isolation, where each GUI application gets its own virtual display, its own VNC session, and its own stream that can be embedded independently in the browser UI.

Why KasmVNC

KasmVNC is a fork of TigerVNC with a built-in WebSocket server and a modern web client. The key difference from standard VNC+WebSocket proxies (noVNC, etc.) is that KasmVNC handles the WebSocket transport natively — no separate websockify process needed, lower latency, better compression.

It also supports running in a headless mode with Xvfb as the display backend, which is exactly what we need: no physical display required, fully containerisable, fully scriptable.

# Per-tool display setup inside each container
# Each tool gets its own DISPLAY number

export DISPLAY=:10  # Gazebo
Xvfb :10 -screen 0 1280x720x24 &
kasmvnc -display :10 -rfbport 5910 &
DISPLAY=:10 gz sim worlds/turtlebot3.world &

export DISPLAY=:11  # RViz2
Xvfb :11 -screen 0 1280x720x24 &
kasmvnc -display :11 -rfbport 5911 &
DISPLAY=:11 rviz2 &

export DISPLAY=:12  # rqt
Xvfb :12 -screen 0 1280x720x24 &
kasmvnc -display :12 -rfbport 5912 &
DISPLAY=:12 rqt &

Each tool runs on its own X display number, with its own KasmVNC instance listening on its own port. The browser embeds each stream in its own iframe, completely independently. A crash in Gazebo doesn't kill RViz2's stream. A heavy render in one tool doesn't affect frame rate in another.

Container Architecture

Each user session gets one Docker container. The container image is built from Ubuntu 22.04 with ROS2 Humble and all tools pre-installed. Session startup time is the container cold-start time — we target under 8 seconds.

Per-user container (Docker)

ROS2 Humble

Gazebo :10

RViz2 :11

rqt :12

Theia IDE :3000

→

nginx reverse proxy

→

WebSocket router

Host: K8s pod

→

Ingress controller

→

Browser client

The container runs a lightweight nginx reverse proxy that routes incoming WebSocket connections to the correct KasmVNC port based on the URL path. /stream/gazebo → port 5910, /stream/rviz → port 5911, and so on. This lets the browser embed each tool by connecting to a different path on the same container endpoint.

The GPU Problem

Gazebo's renderer uses OpenGL. In a containerised environment without a physical GPU, you have two options: software rendering via Mesa's LLVMpipe (CPU-based OpenGL), or GPU passthrough via NVIDIA Container Toolkit or similar.

Our infrastructure uses a mix. GPU nodes run NVIDIA A10G instances with the container toolkit — these get hardware-accelerated Gazebo at full frame rate. CPU-only nodes fall back to LLVMpipe, which is adequate for the complexity of most teaching scenarios (a TurtleBot3 in an empty room is very different from a warehouse with hundreds of dynamic objects).

# GPU-accelerated container startup
docker run --gpus all \
  --device /dev/dri \
  -e DISPLAY=:10 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=graphics,utility \
  ayroxlabs/robolab:latest

# Mesa software fallback (no GPU required)
docker run \
  -e DISPLAY=:10 \
  -e LIBGL_ALWAYS_SOFTWARE=1 \
  -e GALLIUM_DRIVER=llvmpipe \
  ayroxlabs/robolab:latest

Session routing assigns GPU nodes preferentially to sessions running active Gazebo simulation. Learning path sessions that haven't launched the simulation yet can run on CPU nodes and migrate to GPU nodes on demand — though in practice the startup overhead of migration makes it simpler to just provision generously.

Session Isolation and the Shared Display Problem

X11 display isolation between containers is automatic — each container has its own namespace and its own Xvfb instances on whatever display numbers it chooses. The display numbers don't leak between containers; they're local to the container's process namespace.

What's trickier is ROS2 network isolation. By default, ROS2 uses DDS (Data Distribution Service) for communication, and DDS discovery is multicast-based. Without isolation, all ROS2 nodes across all containers on the same host would see each other's topics — a significant security and correctness problem.

Solution: ROS_DOMAIN_ID. Each container gets a unique ROS_DOMAIN_ID environment variable (0–101). DDS partitions communication by domain ID, so nodes in different containers are completely invisible to each other even on the same host network.

# Container launch with isolated ROS2 domain
# domain_id is assigned per-session by the orchestrator

docker run \
  -e ROS_DOMAIN_ID=${SESSION_DOMAIN_ID} \
  -e ROS_LOCALHOST_ONLY=1 \
  ayroxlabs/robolab:latest

ROS_LOCALHOST_ONLY=1 further restricts DDS discovery to the loopback interface, so even with a misconfigured domain ID, cross-container communication is blocked at the network level.

Latency Tuning

For a streaming GUI, latency is the dominant user experience metric. The pipeline from GPU framebuffer to pixels in the student's browser is:

Gazebo renders frame to X framebuffer (GPU → VRAM)
Xvfb reads framebuffer from VRAM (or system RAM for software rendering)
KasmVNC captures the changed regions (using XDamage extension for efficiency)
KasmVNC encodes changed regions (JPEG for complex scenes, ZRLE for UI elements)
Encoded data transmitted via WebSocket to browser
Browser decodes and paints to canvas

Steps 3–4 are where the most tuning is available. KasmVNC's --jpeg-quality and --jpeg-subsampling parameters trade image quality for bandwidth and encode time. For Gazebo (highly textured, lots of motion), we use aggressive JPEG compression — users are watching a simulation, not inspecting a photograph. For RViz2 and rqt (mostly UI chrome with sparse data), we use ZRLE encoding which handles solid-color regions efficiently.

# Gazebo stream — prioritise low latency over quality
kasmvnc -display :10 \
  -rfbport 5910 \
  -jpeg-quality 60 \
  -jpeg-subsampling 4X \
  -frame-rate 30

# RViz2 stream — prioritise visual fidelity
kasmvnc -display :11 \
  -rfbport 5911 \
  -jpeg-quality 85 \
  -jpeg-subsampling 2X \
  -frame-rate 20

The Theia IDE Integration

Eclipse Theia is a browser-native IDE built on the same extension protocol as VS Code. It runs as a Node.js server inside the container and exposes a web interface — no VNC involved, it's native WebSocket from the start.

The ROS2 language support is provided by a combination of the Python language server (Pylance-compatible) and a custom ROS2 package graph extension that parses the active workspace's package.xml files and colcon build output to provide package-aware autocomplete for rclpy APIs.

The IDE and the simulation tools share the same container filesystem, so code written in Theia is immediately available to ROS2 nodes — no sync step, no file transfer. The student writes a node in the editor, runs ros2 run in the terminal, and the node launches in the same environment.

What We'd Do Differently

Some things we'd reconsider if starting from scratch:

Container-per-user is expensive. The isolation model is clean, but the resource overhead is significant. We're exploring a pod-per-user model with more aggressive resource limits and process-level isolation via user namespaces — keeping the isolation guarantees at lower per-session cost.

KasmVNC's frame rate cap. KasmVNC isn't built for the frame rates that make Gazebo feel fluid. At 30fps over a lossy connection, fast camera movements in Gazebo feel sluggish. We've been evaluating WebRTC-based alternatives (specifically the open-source WebRTC streaming from NVIDIA's CloudXR work) but haven't shipped a replacement yet.

Cold-start latency. 8 seconds is good but not great. The dominant cost is the ROS2 daemon startup and Gazebo model loading. We're experimenting with pre-warmed container pools that have ROS2 initialised but not Gazebo running — shaving another 3–4 seconds off the perceived startup time.

This architecture is open to discussion. If you're building something similar — cloud robotics platforms, browser-based simulation tools, containerised GUI streaming — we'd love to compare notes. Reach out.

The full RoboLab platform is live at app.ayroxlabs.com. The Explorer tier is free — you can see the architecture in action without a credit card.

AyroX Labs

Building embodied AI infrastructure — simulation platforms, digital twins, and real robot systems.