UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi

Kashikoi is a simulation engine to benchmark AI agents. We redesigned Kashikoi's AI agent comparison interface to streamline multi-model evaluation workflows and reduce cognitive friction for AI teams.

Kashikoi

Simulation Engine for Benchmarking AI Products

YC Batch

Spring 2025

Industry

Artificial Intelligence, Developer Tools

Challenge

Kashikoi's simulation engine interface presented AI teams with viewport constraints when comparing multiple LLM responses simultaneously. The interface accommodated only three to four models in vertical comparison, limiting scalability as teams needed to benchmark five or more agents concurrently. Performance scores appeared disconnected from their corresponding answers, requiring users to mentally map results across separate interface sections, adding unnecessary cognitive burden during evaluation sessions. According to recent benchmarking research, teams evaluate agents across diverse environments requiring sophisticated comparison capabilities, with the AI agent market projected to grow at 45.8% annually through 2030 as autonomous systems proliferate across industries.

Challenge

Our Approach

We implemented a horizontal scrolling architecture applying Progressive Disclosure and Law of Proximity principles to display unlimited LLM responses side-by-side, each paired with its performance metrics. The redesign introduced tab-based navigation separating comparison view from graph analytics, leveraging Chunking to reduce cognitive load. Contextual scoring placement directly beneath each response eliminated mental mapping overhead using Spatial Contiguity from cognitive load theory. Research on agent evaluation frameworks demonstrates that standardized comparison interfaces provide quantitative feedback for systematic improvement, transforming optimization from an art into science. We optimized the dashboard screen UX for conversions by creating question-by-question sections with horizontally scrollable responses, implementing Visual Hierarchy to guide attention and applying Recognition Over Recall through consistent model iconography.

Our Approach

Outcomes

The redesign achieved scalable comparison capabilities supporting 5+ concurrent LLM evaluations without viewport constraints, while contextual score placement eliminated cognitive switching between answers and metrics. Tab-based separation improved information architecture clarity, applying Progressive Disclosure to reduce overwhelm during complex benchmarking sessions. Side-by-side horizontal layout enabled direct visual comparison following natural left-to-right scanning patterns, consistent with Serial Position Effect research. The consolidated interview details section provided essential context at a glance using Chunking principles, while the scalable architecture demonstrated through DeepSeek addition proved the interface's flexibility for evolving AI agent benchmarking needs and reduced user dropoff on the setup screen.

Outcomes

Before	After	Why
Vertical stacking with 3-4 model limit	Horizontal scrolling supporting 5+ models	Fitts's Law & Cognitive Load - Expanding viewport horizontally enables unlimited comparisons without scrolling fatigue
Performance metrics separated from answers in right panel	Scores positioned directly beneath each response	Law of Proximity - Related information grouped together reduces mental mapping and improves comprehension speed
Combined graph and comparison in single view	Separated tabs for Comparison View and Graph View	Progressive Disclosure - Users access complex analytics separately, reducing initial cognitive burden
Fixed layout accommodating maximum 4 LLMs	Flexible architecture demonstrated with DeepSeek addition	Mental Model - Interface adapts to growing benchmark requirements without structural redesign
Dual back buttons and mixed context sections	Unified "All Images" gallery and consolidated interview details	Hick's Law - Reducing navigation choices accelerates decision-making and clarifies user paths
Sequential question-answer blocks for each model	Question-centric sections with side-by-side model responses	Chunking - Organizing by question rather than model enables direct performance comparison per prompt
Vertical scrolling between disconnected elements	Left-to-right horizontal scanning with model icons	Serial Position Effect & Visual Anchors - Natural reading pattern improves retention and comparison efficiency

Before
Vertical stacking with 3-4 model limit
Performance metrics separated from answers in right panel
Combined graph and comparison in single view
Fixed layout accommodating maximum 4 LLMs
Dual back buttons and mixed context sections
Sequential question-answer blocks for each model
Vertical scrolling between disconnected elements

Before
Vertical stacking with 3-4 model limit
Performance metrics separated from answers in right panel
Combined graph and comparison in single view
Fixed layout accommodating maximum 4 LLMs
Dual back buttons and mixed context sections
Sequential question-answer blocks for each model
Vertical scrolling between disconnected elements

Similar Roast

Add Value Machine

Leverage our AI apps to automate analysis andboost accuracy

View details

Aravolta

Software to monitor, control, and optimize data centers.

View details