Back to all roast

UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi

UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi

UX Redesign and Simplifying Multi-Model Comparisons for AI Product – Kashikoi

Kashikoi is a simulation engine to benchmark AI agents. We redesigned Kashikoi's AI agent comparison interface to streamline multi-model evaluation workflows and reduce cognitive friction for AI teams.

Kashikoi

Simulation Engine for Benchmarking AI Products

YC Batch

YC Batch

Spring 2025

Industry

Industry

Artificial Intelligence, Developer Tools

Challenge

Kashikoi's simulation engine interface presented AI teams with viewport constraints when comparing multiple LLM responses simultaneously. The interface accommodated only three to four models in vertical comparison, limiting scalability as teams needed to benchmark five or more agents concurrently. Performance scores appeared disconnected from their corresponding answers, requiring users to mentally map results across separate interface sections, adding unnecessary cognitive burden during evaluation sessions. According to recent benchmarking research, teams evaluate agents across diverse environments requiring sophisticated comparison capabilities, with the AI agent market projected to grow at 45.8% annually through 2030 as autonomous systems proliferate across industries.

Challenge

Kashikoi's simulation engine interface presented AI teams with viewport constraints when comparing multiple LLM responses simultaneously. The interface accommodated only three to four models in vertical comparison, limiting scalability as teams needed to benchmark five or more agents concurrently. Performance scores appeared disconnected from their corresponding answers, requiring users to mentally map results across separate interface sections, adding unnecessary cognitive burden during evaluation sessions. According to recent benchmarking research, teams evaluate agents across diverse environments requiring sophisticated comparison capabilities, with the AI agent market projected to grow at 45.8% annually through 2030 as autonomous systems proliferate across industries.

Our Approach

We implemented a horizontal scrolling architecture applying Progressive Disclosure and Law of Proximity principles to display unlimited LLM responses side-by-side, each paired with its performance metrics. The redesign introduced tab-based navigation separating comparison view from graph analytics, leveraging Chunking to reduce cognitive load. Contextual scoring placement directly beneath each response eliminated mental mapping overhead using Spatial Contiguity from cognitive load theory. Research on agent evaluation frameworks demonstrates that standardized comparison interfaces provide quantitative feedback for systematic improvement, transforming optimization from an art into science. We optimized the dashboard screen UX for conversions by creating question-by-question sections with horizontally scrollable responses, implementing Visual Hierarchy to guide attention and applying Recognition Over Recall through consistent model iconography.


Our Approach

We implemented a horizontal scrolling architecture applying Progressive Disclosure and Law of Proximity principles to display unlimited LLM responses side-by-side, each paired with its performance metrics. The redesign introduced tab-based navigation separating comparison view from graph analytics, leveraging Chunking to reduce cognitive load. Contextual scoring placement directly beneath each response eliminated mental mapping overhead using Spatial Contiguity from cognitive load theory. Research on agent evaluation frameworks demonstrates that standardized comparison interfaces provide quantitative feedback for systematic improvement, transforming optimization from an art into science. We optimized the dashboard screen UX for conversions by creating question-by-question sections with horizontally scrollable responses, implementing Visual Hierarchy to guide attention and applying Recognition Over Recall through consistent model iconography.


Outcomes

The redesign achieved scalable comparison capabilities supporting 5+ concurrent LLM evaluations without viewport constraints, while contextual score placement eliminated cognitive switching between answers and metrics. Tab-based separation improved information architecture clarity, applying Progressive Disclosure to reduce overwhelm during complex benchmarking sessions. Side-by-side horizontal layout enabled direct visual comparison following natural left-to-right scanning patterns, consistent with Serial Position Effect research. The consolidated interview details section provided essential context at a glance using Chunking principles, while the scalable architecture demonstrated through DeepSeek addition proved the interface's flexibility for evolving AI agent benchmarking needs and reduced user dropoff on the setup screen.

Outcomes

The redesign achieved scalable comparison capabilities supporting 5+ concurrent LLM evaluations without viewport constraints, while contextual score placement eliminated cognitive switching between answers and metrics. Tab-based separation improved information architecture clarity, applying Progressive Disclosure to reduce overwhelm during complex benchmarking sessions. Side-by-side horizontal layout enabled direct visual comparison following natural left-to-right scanning patterns, consistent with Serial Position Effect research. The consolidated interview details section provided essential context at a glance using Chunking principles, while the scalable architecture demonstrated through DeepSeek addition proved the interface's flexibility for evolving AI agent benchmarking needs and reduced user dropoff on the setup screen.


Before

After

Why

Vertical stacking with 3-4 model limit

Horizontal scrolling supporting 5+ models

Fitts's Law & Cognitive Load - Expanding viewport horizontally enables unlimited comparisons without scrolling fatigue

Performance metrics separated from answers in right panel

Scores positioned directly beneath each response

Law of Proximity - Related information grouped together reduces mental mapping and improves comprehension speed

Combined graph and comparison in single view

Separated tabs for Comparison View and Graph View

Progressive Disclosure - Users access complex analytics separately, reducing initial cognitive burden

Fixed layout accommodating maximum 4 LLMs

Flexible architecture demonstrated with DeepSeek addition

Mental Model - Interface adapts to growing benchmark requirements without structural redesign

Dual back buttons and mixed context sections

Unified "All Images" gallery and consolidated interview details

Hick's Law - Reducing navigation choices accelerates decision-making and clarifies user paths

Sequential question-answer blocks for each model

Question-centric sections with side-by-side model responses

Chunking - Organizing by question rather than model enables direct performance comparison per prompt

Vertical scrolling between disconnected elements

Left-to-right horizontal scanning with model icons

Serial Position Effect & Visual Anchors - Natural reading pattern improves retention and comparison efficiency


Before

Vertical stacking with 3-4 model limit

Performance metrics separated from answers in right panel

Combined graph and comparison in single view

Fixed layout accommodating maximum 4 LLMs

Dual back buttons and mixed context sections

Sequential question-answer blocks for each model

Vertical scrolling between disconnected elements

Before

Vertical stacking with 3-4 model limit

Performance metrics separated from answers in right panel

Combined graph and comparison in single view

Fixed layout accommodating maximum 4 LLMs

Dual back buttons and mixed context sections

Sequential question-answer blocks for each model

Vertical scrolling between disconnected elements

Similar Roast

Similar Roast

Add Value Machine

Leverage our AI apps to automate analysis andboost accuracy

View details

Aravolta

Software to monitor, control, and optimize data centers.

View details