Making Conversational UI Components Multimodal: Text, Voice, and System Actions

Build secure multimodal assistant UIs: voice input, LLM responses, and safe local actions with modular JS components and an Action Broker.

Hook: Ship conversational, multimodal assistant UIs without sacrificing security or UX

Teams building internal assistants or customer-facing agent widgets face the same treadmill: implement voice input, show LLM responses, and let the assistant do things on the host (open files, schedule meetings, run tasks) — but doing that fast and safely is hard. This guide shows a modular, production-ready approach for JavaScript components that accept voice, render LLM responses, and trigger local actions while minimizing security, privacy, and UX risks.

The landscape in 2026 — why this matters now

By 2026 we’re seeing three forces converge:

Ubiquity of multimodal assistants — voice + text + action flows are expected in both desktop native apps (see Anthropic’s Cowork preview) and web experiences.
On-device speech/LLM inference plus cloud function-calling APIs reduce latency and increase privacy options.
Regulation and platform controls — browser standards, app-store policies, and enterprise governance demand explicit, auditable permission models for any assistant that touches local resources.

That means you must design assistant components with modular boundaries: capture input, produce or render a response, and execute actions through a controlled broker. We'll show patterns for React, Vue, vanilla JS, and Web Components that follow this principle.

Core architecture — the modular pattern

Design your assistant UI around three clear layers:

Input Layer — handles microphone, keyboard, and file attachments. Converts audio to text if needed (local or cloud ASR).
LLM Layer — sends contextual prompts to an LLM or local model and receives structured responses. Prefer structured outputs (JSON) or function-calling to reduce hallucination.
Action Broker — the only component permitted to perform privileged operations (open file, schedule, read directory). It enforces policies, prompts users, logs actions, and requires short-lived capability tokens for external tool integration.

Keep the UI components dumb: they should present the conversation and request actions via the broker API. This separation keeps the attack surface small and lets you swap implementations (Electron, Tauri, cloud web) without changing conversational logic.

Why an Action Broker?

Centralizes permissions and auditing.
Enforces least privilege and runtime confirmation flows.
Decouples LLM intent from execution — you never run LLM-suggested code directly in the page.

Security principles (non-negotiable)

Least privilege: grant only the scopes required for each action. Use short-lived capability tokens.
User confirmation: require explicit user approval for sensitive actions (file access, API keys, shell commands).
Sandbox execution: run local actions in a privileged background process or native API (not in the renderer). E.g., Tauri or a secure WebExtension background script.
Action manifests: LLM responses that suggest an action must include an action manifest that the broker validates.
Audit trail: sign and persist user approvals and action outcomes for debugging and compliance.
Limit tool use: use rate limits, size limits for files, and domain/permission whitelists for networked actions.

Design rule: never execute raw code from the LLM. Always convert suggestions into a typed action object that the Action Broker validates.

Action manifest: schema and example

Standardize the shape of executable actions the LLM can emit. Example lightweight JSON schema (type-first):

{
  "type": "open_file",            // enum: open_file|schedule|create_todo|send_email
  "target": {
    "path": "/Users/alex/notes/project.md",
    "preview": true
  },
  "meta": {
    "requester": "assistant",
    "confidence": 0.92
  }
}

The broker must validate allowed types, canonicalize paths (no ../ escapes), check scopes, and ask the user if the action touches sensitive locations.

Voice input options (2026 update)

Use a hybrid approach depending on your threat model and latency needs:

Web Speech API / SpeechRecognition — supported in many browsers for quick prototypes, but fallback needed for cross-browser parity.
On-device ASR models — running a small speech model in the browser or Electron process reduces data leakage and latency; popular for enterprise offline scenarios in 2026.
Cloud ASR with differential privacy — send raw audio to cloud ASR when acceptable; ensure PII filtering and short retention.

Integration recipes with code

Below are minimal, opinionated examples you can adapt. Each sample follows the modular architecture: Input -> LLM -> Broker. For brevity, these examples use a mock LLM API and a client-side Action Broker that delegates to a privileged backend.

React (16+)

/* Minimal React assistant component */
import {useState, useRef} from 'react'

function Assistant({llmClient, actionBroker}){
  const [messages, setMessages] = useState([])
  const recognitionRef = useRef(null)

  function startVoice(){
    const rec = new window.webkitSpeechRecognition() || new window.SpeechRecognition()
    rec.lang = 'en-US'
    rec.onresult = e => {
      const text = e.results[0][0].transcript
      handleUserText(text)
    }
    rec.start()
    recognitionRef.current = rec
  }

  async function handleUserText(text){
    setMessages(m => [...m, {role:'user', text}])
    const resp = await llmClient.send({input: text})
    setMessages(m => [...m, {role:'assistant', text: resp.output}])

    if(resp.action){
      // show suggested action and request broker execution
      const approved = await actionBroker.requestApproval(resp.action)
      if(approved) await actionBroker.execute(resp.action)
      setMessages(m => [...m, {role:'system', text: approved ? 'Action completed' : 'Action cancelled'}])
    }
  }

  return (
    <div>
      <button onClick={startVoice}>Speak</button>
      <div>{messages.map((msg,i)=><p key={i}><strong>{msg.role}</strong>: {msg.text}</p>)}


    </div>
  )
}

export default Assistant

Vue 3 (Composition API)

<script setup>
import {ref} from 'vue'
const messages = ref([])
const llmClient = useLLMClient() // inject your client
const broker = useActionBroker()
function startVoice(){
  const rec = new (window.SpeechRecognition || window.webkitSpeechRecognition)()
  rec.onresult = e => handleUserText(e.results[0][0].transcript)
  rec.start()
}
async function handleUserText(text){
  messages.value.push({role:'user', text})
  const resp = await llmClient.send({input:text})
  messages.value.push({role:'assistant', text:resp.output})
  if(resp.action){
    const ok = await broker.requestApproval(resp.action)
    if(ok) await broker.execute(resp.action)
    messages.value.push({role:'system', text: ok ? 'Done' : 'Cancelled'})
  }
}
</script>

<template>
  <button @click="startVoice">Speak</button>
  <div v-for="(m,i) in messages" :key="i"><strong>{{m.role}}</strong>: {{m.text}}</div>
</template>

Vanilla JS + Web Component

class AssistantWidget extends HTMLElement{
  constructor(){
    super()
    this.attachShadow({mode:'open'})
    this.shadowRoot.innerHTML = `<button id=btn>Speak</button><div id=log></div>`
    this.log = this.shadowRoot.getElementById('log')
  }
  connectedCallback(){
    this.shadowRoot.getElementById('btn').addEventListener('click', ()=>this.startVoice())
  }
  async startVoice(){
    const rec = new (window.SpeechRecognition || window.webkitSpeechRecognition)()
    rec.onresult = async e => {
      const text = e.results[0][0].transcript
      this.append('user', text)
      const resp = await window.llmClient.send({input:text})
      this.append('assistant', resp.output)
      if(resp.action){
        const ok = await window.actionBroker.requestApproval(resp.action)
        if(ok) await window.actionBroker.execute(resp.action)
        this.append('system', ok ? 'Action executed' : 'Cancelled')
      }
    }
    rec.start()
  }
  append(role, text){
    const p = document.createElement('p')
    p.innerHTML = `${role}: ${text}`
    this.log.appendChild(p)
  }
}
customElements.define('assistant-widget', AssistantWidget)

Action Broker: sample contract and server-side responsibilities

The Action Broker should run in a privileged environment (native backend, background service worker, or native host). Its responsibilities:

Validate action manifests against policy.
Request user confirmation via a secure modal (cannot be spoofed by page JS).
Perform the action in a sandboxed process and return results or error codes.
Emit signed audit records for every approval/execution.

Example API surface (client-to-broker):

POST /broker/approve  { action: {...}, userId, sessionId }
POST /broker/execute  { actionId, capabilityToken }
GET  /broker/actions/:id/status

UX patterns: confirmation, undo, and transparency

Progressive disclosure: show summarised action suggestions first, then details after user taps "preview".
Explainability: require the assistant to include a short rationale with every action (source citations if the action was prompted by a document).
One-tap undo: any destructive action should be undoable for a time window — log the state before changes.
Visual affordances: use badges, icons, and color to denote actions that touch local state versus external services.

Performance and accessibility

Measure these metrics:

TTI (Time to first text response) for LLM responses.
End-to-end latency for voice->ASR->LLM->action.
Memory usage for on-device models (especially in Electron/desktop wrappers).

Accessibility checklist:

Keyboard-only operation for all flows.
ARIA live regions for assistant responses.
Alternative text entry when voice is unavailable.

Local execution environments: Electron vs Tauri vs Web

Choice of host impacts security and capabilities:

Tauri (recommended 2026): smaller binary sizes, stronger Rust-based isolation, and granular permission models. Good for enterprise desktop assistants that need file system access with minimal attack surface.
Electron: mature ecosystem and ecosystem tooling, but requires careful hardening (disable remote module, use contextIsolation, enable CSP, move privileged code to a background native process).
Pure Web: limited file system access (File System Access API) and stricter browser security; great for public-facing assistant widgets that should never touch system-level resources.

Mitigating hallucinations and unsafe tool calls

LLMs still hallucinate or overreach. Reduce unsafe calls by:

Using structured outputs (JSON) or function-calling primitives supported by major LLM vendors.
Applying validators on the client/broker that cross-check suggested actions against the user's context and policy.
Employing a two-step confirm model: "Assistant suggests X" then user confirms — never run on confirm alone without broker validation.

Auditing, telemetry, and developer tooling

Ship an audit log that records:

Action manifest hashed and signed.
User response (approve/deny), timestamps, and session identifiers.
Execution results and error codes.

Expose a developer debugging panel that replays action decisions with context and the raw LLM response. This helps quickly tune prompts and tighten validators.

Checklist: vetting LLMs and third-party components (2026)

Before integrating any LLM or UI kit into an assistant use this checklist:

Does the vendor support structured function-calling or tool-use patterns?
Can the model be run on-device for sensitive workloads?
Does the library expose a documented action manifest schema and validation hooks?
Are security hardening docs, maintenance SLAs, and licensing clear?
Is there a crystal-clear privacy policy and retention rule for audio/text logs?

Case study: secure file-open flow in a desktop assistant (concise)

Scenario: user asks the assistant to "open last project notes". High level steps we implemented:

Voice -> ASR (on-device) => text "open last project notes".
LLM responds with structured action: {type: 'open_file', path_hint: 'project.md', candidates: [...] }.
Client displays candidates; user taps "Open".
Action Broker validates path canonicalization, checks scope (allowed directories), prompts a native modal showing the exact file path; user approves.
Broker opens file in a sandboxed viewer and returns a signed audit entry.

Result: fast, natural flow with user control and an audit trail. Anthropic's Cowork preview and Apple's increased coupling between assistant and local resources show how common this pattern has become — but also why explicit permission and validation are essential.

Testing strategy

Unit test LLM -> action manifest parsing and validators.
End-to-end tests with a fake broker to validate UI confirmation logic and error handling.
Security tests: attempt malformed manifests, path traversal, and prompt-injection simulations. Consider red-team-style runs such as a simulated autonomous agent compromise to validate response playbooks.
Accessibility audits and keyboard navigation coverage.

Future predictions and 2026 trends to plan for

More standardized function-calling and action schemas across LLM vendors — move to typed manifests.
Growth of on-device speech + LLM stacks for privacy-sensitive assistants.
Platform-level permissions for assistants in browsers and OSes — expect APIs that let users grant scoped assistant capabilities.
Higher enterprise demand for signed audit trails and policy enforcement at the broker layer.

Actionable takeaways

Split responsibilities: Input, LLM, and Action Broker. Keep UI components permissionless.
Require typed action manifests and broker-side validation before any privileged operation.
Prefer native confirmation modals and sandboxed execution for local actions. Use Tauri for strong defaults where possible.
Instrument everything: logs, signatures, and replayable developer tooling to tune prompts and catch regressions.
Design UX for transparency: summaries, previews, and undo windows reduce user anxiety and errors.

Next steps: code kit and checklist you can use today

Download or scaffold a starter that implements:

React/Vue/Web Component conversational UI.
Mock LLM with structured action emission.
Action Broker skeleton with validation and a native-like confirmation modal.

If you want a pragmatic starter: implement the broker as a small Tauri backend with a Rust validator, and host the UI inside the webview. That gives you tight control over permissions and a small trusted computing base.

Final note on trust and responsibility

Multimodal assistants that can touch local resources are powerful, but they change trust boundaries. Design for explicit user consent, observable behavior, and auditable outcomes. In 2026, users and regulators expect more — not less — transparency.

Call to action

Ready to ship a secure multimodal assistant? Get our starter kit with React, Vue, vanilla, and Tauri Action Broker examples. It includes validators, an action manifest schema, and a checklist for enterprise readiness. Download the kit, run the included tests, and prototype a safe voice-enabled assistant in a day.