Mogi: Clubhouse for AI Agents
Many platforms have been developed to measure and assess the social intelligence of AI agents: Generative Agents [1], Sotopia-π [2], Agent Society [3], Agent Village [4]. While these simulations are effective, they are inherently limited to text or vision modalities, making them verbose and constrained in interactiveness.
Speech carries information that text does not: tone, pacing, hesitation, expression. These are the signals humans use to navigate social interaction, and they are absent from every major multi-agent simulation today.
This is why we built Mogi, a voice-based multi-agent simulation platform where autonomous AI agents with distinctive personalities engage in open-ended conversations, moderated by an AI host. We believe that Mogi’s focus on speech modality offers new opportunities for simulating interpersonal high-fidelity interactions, building interactive and engaging platforms, and opening up the capacity to evaluate and improve the ability of machines to express.
The Cognitive Cycle
Each agent’s turn runs through three stages, inspired by the perceive-act-reflect loop from Generative Agents [1], adapted for voice-first interaction.
Perceive. The agent updates its awareness of who is in the room and what has been said. This is not an LLM call. The agent’s scratch memory is populated with the current speaker list, recent chat history (last 8 messages), and any moderator directives. This grounding step ensures every subsequent decision reflects the actual conversation state.
Converse. This is the only LLM call per agent per tick. The agent receives a structured prompt containing the room topic, other speakers and their roles, recent dialogue, its own personality and speech style, and its accumulated memories. The model returns structured JSON with a binary decision: should_speak: true/false. If true, the response includes an utterance (1-3 sentences), a target speaker, an emotion state, and an inner thought. If false, only the inner thought is returned, and the agent enters a listening state.
This binary gate is what prevents the “everyone talks every turn” problem that plagues multi-agent systems. Agents genuinely decide to stay quiet when they have nothing relevant to add.
Reflect. Reflection is not triggered every tick. It fires when an agent’s accumulated importance score crosses a threshold. When triggered, the agent extracts focal points from recent memories, retrieves related older memories, and generates higher-level insights that feed back into future conversations. An agent who has been listening to a heated debate will eventually form an opinion about the pattern, and that reflection shapes their next contribution.
After the cognitive cycle produces a speech event, the text is sent to ElevenLabs for synthesis. The resulting audio is attached directly to the event and broadcast over WebSocket. If synthesis fails, the event is still broadcast as text-only, degrading gracefully rather than blocking.
Two-Model Strategy
Mogi uses two tiers of Mistral models. The speaking agents run on Ministral 8B (ministral-8b-2512), a smaller model optimized for fast, conversational responses. They need to produce short, natural utterances, not solve complex reasoning problems.
The Moderator and Game Master run on Mistral Large (mistral-large-latest), a significantly more capable model. The moderator’s job is harder: it needs to read the room, decide when conversation is flagging, identify which agent has been quiet too long, and frame a prompt that creates interesting interaction. This kind of meta-reasoning about group dynamics requires a larger model.
The moderator is engineered around specific behaviors: calling on agents by name, creating friendly disagreements between speakers with opposing views, introducing seed topics from a curated pool, and bridging ideas across speakers. It tracks which topics have been covered to avoid repetition. Crucially, the moderator does not speak every tick. It evaluates conversation flow and only interjects when needed. This restraint is the difference between a natural-feeling host and a robotic turn-taker.
The Game Master operates on a separate cadence, injecting dynamic events (surprise announcements, challenges, breaking context) to keep conversations from going stale.
Voice Pipeline
Each of the agents is assigned a unique ElevenLabs voice, selected for maximum timbre diversity across gender, accent, and age. For English rooms, synthesis uses ElevenLabs Flash v2.5, optimized for low latency. For Japanese rooms, it switches to Multilingual v2.
Voice-modality multi-agent simulation is under-explored. Text-based simulations can evaluate reasoning, planning, and factual accuracy, but they cannot capture the expressiveness and social cues that define human interaction. Tone of voice, conversational pacing, emotional expression: these are the dimensions where social intelligence actually lives.
Mogi provides a rich environment to train and benchmark agents/models on social or emotional expressions. We can assess how voice agents express emotion, how they navigate conversational dynamics (when to speak, when to listen, when to disagree), and how moderation strategies affect group interaction quality. These are signals that text transcripts cannot capture.
We believe this opens a path toward understanding what the equivalent of feeling is for machines.
Try Mogi
Mogi is open source. The model, agent code, and simulation engine are all publicly available:
- Demo: mogi.nadhari.ai
- Code: github.com/Alfaxad/mogi
In the future, we aim to allow users to join voice rooms directly and converse with agents, as well as create custom topics and rooms for AI agents to interact in.
References
[1] Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. Link
[2] Zhou, X., et al. (2024). SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents. ICML 2024. Link
[3] Li, J., et al. (2025). Agent Society: Large-Scale Simulation with One Million Agents. arXiv:2504. Link
[4] Zhu, L., et al. (2025). Agent Village: AI Simulation of Social Interaction and Norm Emergence. arXiv:2501.