The History Drawer / A Study of LLM Chat Interfaces

§ 01

Motivation

The trash-can problem.

Most apps auto-title conversations from the first few exchanges, before the discussion has gone anywhere. Titles drift toward each other. Eventually, finding an old thread costs more than re-asking, so users re-ask. The history drawer turns into a wastebasket no one opens.

We wanted to understand the lived shape of that decay. How do people actually use these tools? How do they handle the slow accumulation of threads? And which interface changes do they themselves wish for? A small mixed-methods study, survey at the population level, interviews at the motivational level, was designed to find some empirical footing for the next generation of LLM chat UIs.

Recruitment ran through public social channels in three languages. We deliberately balanced veterans against first-month newcomers: only talking to power users would miss the friction beginners hit on day one; only talking to beginners would miss the structural failures that surface at 100 conversations. Every respondent gave informed consent; interview participants received a privacy briefing before recording.

The survey targeted retrieval behavior as its core construct: when a user wants old content back, do they go dig in the sidebar, or do they open a fresh window and re-ask? Subjective dimensions (interface satisfaction, retrieval difficulty, perceived organization) were measured on 5-point Likert scales.

Interviews were semi-structured. Each participant walked us through their daily management ritual: how they start a thread, when they rename, whether they delete, the hesitation before deletion. We also probed the affective layer, that hard-to-name irritation when the drawer becomes a mess. The closing prompt asked each interviewee to describe (or sketch) their ideal interface.

§ 02

Method

Two instruments, one drawer.

01 / RECRUITMENT

A balanced sample, not a heroic one.

Recruited via public posts in CN/EN/ES. Veterans and newcomers in deliberate balance, because each group sees a different failure mode of the same UI.

48 valid surveys
7 deep-dive interviews
3 language communities

02 / QUANTITATIVE

Likert + behavioral counts.

Demographics, frequency, and a battery of 5-point scales targeting interface satisfaction, retrieval difficulty, and perceived organization. Eleven concrete pain points selectable as multi-check.

Cluster analysis on feature preference
Logistic regression on abandonment
Equivalence testing (TOST) for null effects

03 / QUALITATIVE

Watch the ritual, then ask for the ideal.

Semi-structured interviews where each participant demoed their actual sidebar, opening, renaming, deleting, hesitating, and finished by drawing their dream interface.

Walk-through of daily flow
Affective probe: the "messy drawer" feeling
Sketched ideal-state UI

§ 03

Sample profile

Who we heard from.

Tenure

Long enough to accumulate a problem.

Over half of the sample has used LLM apps for more than a year. The bulk of respondents have lived through the slow accumulation of threads that produces structural retrieval failure.

> 1 YEAR

52.1%

≤ 1 YEAR

47.9%

n = 48> half are veterans

Skill self-rating

A skew toward the experienced.

Beginner and intermediate users were intentionally included to surface first-touch friction, but the bulk of the sample sits in the long-tail use range where structural problems live.

Beginner + Intermediate

25%

Advanced

46%

Expert

29%

App × use case

The "moat by use case" hypothesis fails.

We expected differentiation: Copilot for code, ChatGPT for writing, Claude for research. The data refused that story. Across the board, whichever app a user prefers becomes a general-purpose tool covering coding, writing, learning, lookup, and analysis.

Cramér's V = 0.049 · well below the 0.10 small-effect threshold TOST (α = 0.05) rejects "meaningful association" → no app-vs-task moat exists yet

ChatGPT

25/48

Coding · Writing · Learning

Gemini

20/48

Coding (primary)

Claude

10/48

Coding (primary)

Copilot · Grok

≤7/48

Mixed

DeepSeek · Doubao · Qwen

≤7/48

Long tail

§ 04

A very even frustration

Same temperature, different vocabulary.

Pain-point checklist · multi-select

Eleven things people prepared to complain about.

Of the items below, the share who selected each as a real pain point in their own use:

Auto-titles are meaningless14.6%

Similar topics scattered everywhere41.7%

Too many threads to find the one I want·

Important conversations get buried·

Can't remember which thread had it·

Search is broken or absent·

No way to group by topic / project·

The list itself loads slowly·

Can't find the answer inside a long thread·

Worry about losing info if I delete·

I don't have any trouble·

Skill × specificity

Experts name the pain. Everyone feels it.

50% of high-skill users flagged either "auto-titles" or "too many threads"; only 16.7% of casual users did the same.

Fisher exact p = 0.042OR = 4.85

Overall frustration · 5-pt scale

High-skill

M = 3.06

Casual

M = 3.17

TOST equivalence p = 0.0425Cohen's d = 0.105 → identical felt frustration

Experts can articulate which drawer hinge is broken, because 69.7% of them already hold 50+ conversations. Casual users (33.3% past 50) just feel the friction without the vocabulary.

§ 05

Three persistent pains

"Tolerable" hides three specific failures.

Overall satisfaction lands at 6.35/10 (SD = 1.92, t(47) = 3.07, p < 0.01); subjective frustration sits at 3.08/5 (SD = 1.05, equivalent to neutral). The headline number is "fine." The interview transcripts are not.

Pain 01

Auto-generated titles

Title-accuracy rating averages 2.94 (SD = 1.00), equivalent to neutral on the TOST test (δ = 0.5, p = 0.002). One veteran said ChatGPT and Claude titles were "close enough." Another noted that Gemini's mobile voice mode opens a fresh thread per question; the auto-titles are so generic that one thread silently absorbs many topics.

14.6% select as painM = 2.94 / 5TOST: equivalent to neutral

Pain 02

Users want finer organization tools

Demand for tagging averages M = 4.12 (SD = 0.76, t = 10.2, p < 0.001), the highest item in the battery. In-thread bookmarks come in at M = 3.79 (SD = 0.97, p < 0.001) but with wider spread. Interviews surface concrete asks: prefixed brackets like [work] as ad-hoc tags, Drive-style folders by project name, the ability to star and pin threads.

Tagging M = 4.12In-thread bookmark M = 3.79

Pain 03

Model capability is currently masking the UX problem.

Multiple interviewees said the same thing differently. The drawer is broken, but the model is what they're paying for. The implication is uncomfortable: as model capability converges across vendors over the next two years, the moat shifts to interface, and the interface is the part nobody has invested in.

History management matters, but it isn't decisive. I just go with whichever model is strongest.P-04 · advanced user · two-year tenure

§ 06

Where frustration comes from

Volume doesn't hurt.
Scatter does.

Logistic regression · "Have you given up looking in the past month?"

One predictor breaks the model open.

Four candidate predictors entered a single logistic regression. Only one came out significant, and not by a small margin.

Volume is a distal cause. Retrieval failure is the proximate one.

Number of conversations

───→β = 0.003
p = 0.442 · null

Subjective frustration

───→β = -0.665
p = 0.016 · live

Overall satisfaction

Number of conversations

···→direct path ≈ 0
TOST confirmed null

Overall satisfaction

Volume alone doesn't make people miserable. Repeated retrieval failure inside that volume does. Storing more is fine. Storing more without findability is the failure mode.

Initiative beats skill

Whoever organizes a little, retrieves a lot.

Self-rated tech skill explains essentially zero variance in satisfaction (β = 0.067, p = 0.838, R² < 0.1%). The story changes when we replace skill with an "organizational initiative" composite (rename frequency + tidy-up willingness + topic-splitting habit, z-summed).

Tech skill → Satisfaction

β = 0.07

R² < 0.1% · p = 0.838 · indistinguishable from noise

Initiative → Retrieval (control: skill)

partial r = 0.345

p = 0.017 · willingness alone explains R² = 7.8% of satisfaction (p = 0.054)

A small amount of organizing effort, regardless of the user's technical fluency, reliably yields a smoother retrieval experience. That's a design lever, make organizing low-cost, and the lever is open to everyone.

§ 07

Three mental models

Three drawers, three minds.

K-Means clustering (k = 3) on feature-preference scores produced three groups with sharply different shapes. ANOVA F(2, 45) = 34.75, p < 0.001, η² = 0.61. The separation is too clean to be sampling noise.

31/ 48 · 65%

Full-text search

The biggest group. Wants something close to a search-engine over their entire history. Treats every past conversation as a queryable document.

Mental model

drawer ≈ inbox

10/ 48 · 21%

Frequency-sorted

Treats threads as a reusable function library. Wants the most-used items pinned to the top, surfaced by recency-of-use rather than recency-of-creation.

Mental model

drawer ≈ toolbox

7/ 48 · 15%

Timeline browsers

No appetite for organization. Relies on recent memory and the default reverse-chronological list. Actively rejects manual folders (Kruskal-Wallis H = 12.78, p = 0.002).

Mental model

drawer ≈ scroll

Cluster identity is not tied to occupation (Fisher exact p = 0.483, Cramér's V = 0.28). Treat the clusters as a feature-prioritization map, not a demographic taxonomy.

§ 08

Which features actually move the needle

The single feature that mattered.

We correlated each of ten candidate features with overall satisfaction. The pattern is hard to misread: features that hand control back to the user beat features that try to take it away.

Feature	Correlation w/ satisfaction	Verdict
In-thread bookmarks	r = 0.282 · p = 0.026	★ ONLY SIGNIFICANT
Manual folders	r = 0.227 · p = 0.06	marginal · n.s.
Auto-grouping	r = 0.018	TOST: null
Auto-link related threads	r = -0.004	TOST: null
Timeline view	r = 0.051	TOST: null
Full-text search	r = -0.047	TOST: null

Correlation isn't causation. It could be that already-organized users like both bookmarks and the current UI. But when every "let the system do it" feature flatlines simultaneously, the signal lands cleanly: outsourcing organization to the AI doesn't move the experience. Letting the user mark a single line at the moment they realize it matters does.

Sub-finding

Pinning is for the already-organized.

Splitting respondents by whether they selected "scattered topics" as a pain reveals an asymmetry. Users without scatter pain rate the pin/top-of-list feature significantly higher than those with it (M = 4.39 vs. 3.95, t = 2.07, df = 44.1, p = 0.022 one-tailed, Cohen's d = 0.59). Pinning is a "nice-to-have" for users who already have their house in order, not a fix for those drowning.

§ 09

Composite personas

Three users, three drawers.

Synthesized from interview transcripts and cluster membership. Each persona aggregates several real respondents who shared a mental model. They are illustrative, not census categories.

Frequency-sorted · power user

Carlos

P-01

Software architect2 yearsHeavy power user

Opens a new thread per coding question, treating each as a reusable function. His system held up at small scale and broke when the threads stacked up, most acutely on Gemini mobile, where every voice question becomes its own session under a generic auto-title.

I started re-asking, because finding the original was slower than typing it again./ P-01

Wants a Spotify-style sidebar: most-used at the top, sorted by frequency. Has migrated to third-party clients (Cherry Studio) for tag and filter support. Has zero appetite for AI-driven categorization.

Full-text search · academic

Maria

P-02

PhD candidate14 months73 threads

Uses LLM as external memory for dissertation work: Derrida one day, Victorian novels the next, chapter revisions in between. Daily use is fine. Pulling a single argument back is brutal.

Twenty minutes opening seven threads called "Literary Analysis," then I just gave up and re-asked the question./ P-02

Asks for thread-level organization (tags, folders), but the only feature correlated with satisfaction is sub-thread bookmarking. Currently copies key paragraphs into Notion to build her own index outside the chat app.

Timeline · passive browser

James

P-03

Marketing consultant7 months~30 threads

Asked what he wants from history management, his answer is direct: he scrolls until something looks familiar. Reverse-chronological is enough; he usually just wants yesterday's brainstorm or last week's notes.

I scroll till I see something I recognize. That's it./ P-03

Adds 8–10 threads/week, heading toward the ~50-thread breakdown point. Refused folders ("feels like homework"). Said the "related threads" recommendation feels useful, because it asks nothing of him.

§ 10

From data to design

Four principles.

01

Keep agency with the user

The most counter-intuitive finding is how little automation buys. Auto-grouping, auto-linking, timeline view, full-text search, all flatlined under TOST. In-thread bookmarks (r = 0.282) led the table; "initiative" independently predicted satisfaction (partial r = 0.345). Both signals point the same direction: user participation has value of its own.

Instead of silently filing untouched threads into algorithm-shaped folders, give people light-weight manual tools, bookmarking a paragraph, splitting a thread, that activate at the moment of intent.

02

Prompt at the topic switch

The interview pattern: a single thread quietly drifts from one topic into another, and the early content gets buried. This is the same failure mode as scatter (OR = 12). It also poisons the model's own context window.

A non-intrusive cue when the system detects a topic shift, "this looks like a new topic. Continue here, or split from this point?", makes splitting the path of least resistance, without punishing users who genuinely want a long thread. Conversational UIs should remember they're closer to "talking to someone" than to "selecting a session."

03

Design for 100, not 10

Expert pain is a scale problem, not a skill problem. The current interface holds at 10 threads (James's comfort zone) and cracks at 100 (Carlos's crisis). That gap shouldn't be patched late.

From day one: server-side full-text indexing instead of client-side Ctrl+F that fails on lazy-loaded items; multiple sort views, chronological, frequency, manual folder, instead of one canonical reverse-time list.

04

Smart defaults, manual overrides

James won't use folders. Carlos will live and die by them. A single design must serve both. Smart defaults, most-used threads pinned automatically this week, related-thread suggestions when starting a new prompt, work for the passive group; explicit folders, tags, and pinning serve the active group.

The principle: invisible affordances by default, visible affordances on request.

It's like sticking a label in front of the title.P-05 · ad-hoc tagging via brackets

I just want a Google Drive, folders by project name.P-06 · folder request

Folders feel like homework. I scroll instead.P-03 · timeline cluster

§ 11

Conclusion

Five things worth taking.

Capabilities are converging across vendors. Differentiation is migrating to the surface. The vendor that makes their drawer actually usable wins a lasting edge, and right now, no one is.

01

Scatter is the strongest single predictor of giving up.

Among all candidates, only "similar topics scattered" reaches significance (OR = 12.07, p = 0.003). Volume, search quality, and titles do not.

OR = 12.07 · 95% CI [2.69, 72.93]
02

Initiative predicts satisfaction; tech skill does not.

Controlling for self-rated skill, organizational initiative still independently explains satisfaction. Skill itself is a non-event.

partial r = 0.345 · p = 0.017
tech β = 0.067 · p = 0.838
03

"Let the system handle it" features confirm null.

Auto-grouping, auto-linking, timeline view, full-text search: all four were equivalence-tested to zero against satisfaction.

TOST confirmed null × 4
04

In-thread bookmarks are the only feature linked to satisfaction.

A low-friction manual tool, applied at the moment a user knows they care, is the single feature whose adoption tracks happiness.

r = 0.282 · p = 0.026
05

Volume itself is innocent.

Mediation analysis confirms the path from count to frustration is null. Frustration arises from structural retrieval failure inside volume, not volume per se.

count → frustration
TOST confirmed null

Models will converge. Letting users command their own conversations is increasingly close to a real differentiator.

§ 12

Limitations

What this study can't say.

Sample size

N = 48 caps statistical power.

The findings are best read as a stable directional signal, not a deterministic verdict. We used equivalence tests throughout to distinguish "real null" from "underpowered to detect." Marginal items remain trends only.

Design

Cross-sectional. Causality is open.

Even where initiative correlates with satisfaction, reverse causation is plausible: organized personalities may report higher satisfaction independent of any tool effect.

Self-report

Skill ratings are subjective.

Tech-skill rests on self-evaluation, with the inherent bias of self-report.

What's next

A/B test for the agency mechanism.

Long-term tracking and A/B trials are the right next step: when we actually ship a more user-controllable interface, do organizing behavior and satisfaction move? That's how the agency hypothesis goes from correlation to cause.

The history drawer no one opens again.

The trash-can problem.

Two instruments, one drawer.

A balanced sample, not a heroic one.

Likert + behavioral counts.

Watch the ritual, then ask for the ideal.

Who we heard from.

Long enough to accumulate a problem.

A skew toward the experienced.

The "moat by use case" hypothesis fails.

Same temperature, different vocabulary.

Eleven things people prepared to complain about.

Experts name the pain. Everyone feels it.

"Tolerable" hides three specific failures.

Auto-generated titles

Users want finer organization tools

Model capability is currently masking the UX problem.

Volume doesn't hurt.Scatter does.

One predictor breaks the model open.

Volume is a distal cause. Retrieval failure is the proximate one.

Whoever organizes a little, retrieves a lot.

Three drawers, three minds.

Full-text search

Frequency-sorted

Timeline browsers

The single feature that mattered.

Pinning is for the already-organized.

Three users, three drawers.

Four principles.

Keep agency with the user

Prompt at the topic switch

Design for 100, not 10

Smart defaults, manual overrides

Five things worth taking.

Scatter is the strongest single predictor of giving up.

Initiative predicts satisfaction; tech skill does not.

"Let the system handle it" features confirm null.

In-thread bookmarks are the only feature linked to satisfaction.

Volume itself is innocent.

What this study can't say.

N = 48 caps statistical power.

Cross-sectional. Causality is open.

Skill ratings are subjective.

A/B test for the agency mechanism.

The history
drawer no one
opens again.

Volume doesn't hurt.
Scatter does.