❮ Back to all roast
UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi
UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi
UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi
Kashikoi is a simulation engine to benchmark AI agents. We redesigned Kashikoi's AI agent comparison interface to streamline multi-model evaluation workflows and reduce cognitive friction for AI teams.

Kashikoi
Simulation Engine for Benchmarking AI Products

YC Batch
YC Batch
Spring 2025
Industry
Industry
Artificial Intelligence, Developer Tools
Challenge
Kashikoi's simulation engine interface presented AI teams with viewport constraints when comparing multiple LLM responses simultaneously. The interface accommodated only three to four models in vertical comparison, limiting scalability as teams needed to benchmark five or more agents concurrently. Performance scores appeared disconnected from their corresponding answers, requiring users to mentally map results across separate interface sections, adding unnecessary cognitive burden during evaluation sessions. According to recent benchmarking research, teams evaluate agents across diverse environments requiring sophisticated comparison capabilities, with the AI agent market projected to grow at 45.8% annually through 2030 as autonomous systems proliferate across industries.
Challenge
Kashikoi's simulation engine interface presented AI teams with viewport constraints when comparing multiple LLM responses simultaneously. The interface accommodated only three to four models in vertical comparison, limiting scalability as teams needed to benchmark five or more agents concurrently. Performance scores appeared disconnected from their corresponding answers, requiring users to mentally map results across separate interface sections, adding unnecessary cognitive burden during evaluation sessions. According to recent benchmarking research, teams evaluate agents across diverse environments requiring sophisticated comparison capabilities, with the AI agent market projected to grow at 45.8% annually through 2030 as autonomous systems proliferate across industries.
Our Approach
We implemented a horizontal scrolling architecture applying Progressive Disclosure and Law of Proximity principles to display unlimited LLM responses side-by-side, each paired with its performance metrics. The redesign introduced tab-based navigation separating comparison view from graph analytics, leveraging Chunking to reduce cognitive load. Contextual scoring placement directly beneath each response eliminated mental mapping overhead using Spatial Contiguity from cognitive load theory. Research on agent evaluation frameworks demonstrates that standardized comparison interfaces provide quantitative feedback for systematic improvement, transforming optimization from an art into science. We optimized the dashboard screen UX for conversions by creating question-by-question sections with horizontally scrollable responses, implementing Visual Hierarchy to guide attention and applying Recognition Over Recall through consistent model iconography.
Our Approach
We implemented a horizontal scrolling architecture applying Progressive Disclosure and Law of Proximity principles to display unlimited LLM responses side-by-side, each paired with its performance metrics. The redesign introduced tab-based navigation separating comparison view from graph analytics, leveraging Chunking to reduce cognitive load. Contextual scoring placement directly beneath each response eliminated mental mapping overhead using Spatial Contiguity from cognitive load theory. Research on agent evaluation frameworks demonstrates that standardized comparison interfaces provide quantitative feedback for systematic improvement, transforming optimization from an art into science. We optimized the dashboard screen UX for conversions by creating question-by-question sections with horizontally scrollable responses, implementing Visual Hierarchy to guide attention and applying Recognition Over Recall through consistent model iconography.
Outcomes
The redesign achieved scalable comparison capabilities supporting 5+ concurrent LLM evaluations without viewport constraints, while contextual score placement eliminated cognitive switching between answers and metrics. Tab-based separation improved information architecture clarity, applying Progressive Disclosure to reduce overwhelm during complex benchmarking sessions. Side-by-side horizontal layout enabled direct visual comparison following natural left-to-right scanning patterns, consistent with Serial Position Effect research. The consolidated interview details section provided essential context at a glance using Chunking principles, while the scalable architecture demonstrated through DeepSeek addition proved the interface's flexibility for evolving AI agent benchmarking needs and reduced user dropoff on the setup screen.
Outcomes
The redesign achieved scalable comparison capabilities supporting 5+ concurrent LLM evaluations without viewport constraints, while contextual score placement eliminated cognitive switching between answers and metrics. Tab-based separation improved information architecture clarity, applying Progressive Disclosure to reduce overwhelm during complex benchmarking sessions. Side-by-side horizontal layout enabled direct visual comparison following natural left-to-right scanning patterns, consistent with Serial Position Effect research. The consolidated interview details section provided essential context at a glance using Chunking principles, while the scalable architecture demonstrated through DeepSeek addition proved the interface's flexibility for evolving AI agent benchmarking needs and reduced user dropoff on the setup screen.
Before | After | Why |
|---|---|---|
Vertical stacking with 3-4 model limit | Horizontal scrolling supporting 5+ models | Fitts's Law & Cognitive Load - Expanding viewport horizontally enables unlimited comparisons without scrolling fatigue |
Performance metrics separated from answers in right panel | Scores positioned directly beneath each response | Law of Proximity - Related information grouped together reduces mental mapping and improves comprehension speed |
Combined graph and comparison in single view | Separated tabs for Comparison View and Graph View | Progressive Disclosure - Users access complex analytics separately, reducing initial cognitive burden |
Fixed layout accommodating maximum 4 LLMs | Flexible architecture demonstrated with DeepSeek addition | Mental Model - Interface adapts to growing benchmark requirements without structural redesign |
Dual back buttons and mixed context sections | Unified "All Images" gallery and consolidated interview details | Hick's Law - Reducing navigation choices accelerates decision-making and clarifies user paths |
Sequential question-answer blocks for each model | Question-centric sections with side-by-side model responses | Chunking - Organizing by question rather than model enables direct performance comparison per prompt |
Vertical scrolling between disconnected elements | Left-to-right horizontal scanning with model icons | Serial Position Effect & Visual Anchors - Natural reading pattern improves retention and comparison efficiency |
Before |
|---|
Vertical stacking with 3-4 model limit |
Performance metrics separated from answers in right panel |
Combined graph and comparison in single view |
Fixed layout accommodating maximum 4 LLMs |
Dual back buttons and mixed context sections |
Sequential question-answer blocks for each model |
Vertical scrolling between disconnected elements |
Before |
|---|
Vertical stacking with 3-4 model limit |
Performance metrics separated from answers in right panel |
Combined graph and comparison in single view |
Fixed layout accommodating maximum 4 LLMs |
Dual back buttons and mixed context sections |
Sequential question-answer blocks for each model |
Vertical scrolling between disconnected elements |








