Traces and evals

Production conversations become model evidence

1chat treats the conversation as the primary unit of inspection, labeling, and future eval generation.

Trace model

A single user conversation can contain many LLM calls: classification, retrieval planning, answer generation, tool repair, summarization, and follow-up handling. 1chat stores each call but assembles them by `conversation_id` so reviewers see the thread first and the individual calls second.

{
  "trace_id": "tr_abc",
  "conversation_id": "tax-session-42",
  "calls": [
    {
      "request_id": "req_1",
      "surface": "openai.chat",
      "requested_model": "1chat/auto",
      "selected_model": "gemini/gemini-2.5-flash-lite",
      "route_reason": "Default production candidate for tax triage",
      "charge_usd": 0.000010,
      "latency_ms": 812
    }
  ]
}

Human labels

The first labeling surface should stay lightweight: thumbs up or down, a required reason, and an optional note. The label is attached to the call, but the viewer should show surrounding conversation context so reviewers judge the answer the way a user experienced it.

Thumbs up

The answer satisfied the user or met the task rubric.

Thumbs down

The answer was wrong, incomplete, unsafe, too slow, too expensive, or off policy.

Reason

A compact category such as missed caveat, invalid JSON, hallucinated citation, or user corrected.

Note

Freeform context for domain experts and future evaluator design.

Auto-label suggestions

1chat can infer weak labels from the next user turn. If a user says “no, that is not what I meant,” the prior model call should be suggested as thumbs down. If a user says “perfect” or “exactly,” the prior call can be suggested as thumbs up. These suggestions should never be treated as ground truth until the product team decides how much trust to give them.

Why this matters

Many teams do not have explicit feedback buttons. Follow-up turns are often the highest-volume signal they already have, especially in support, tutoring, tax, and agent workflows.

Eval set creation

Filter traces by project, task hint, selected model, label, cost, latency, user segment, or route decision.
Select the calls or full conversations that represent a business-critical task class.
Create an eval set with the prompt, expected behavior, label metadata, and original surrounding context.
Run candidate models against the set asynchronously so production latency is unaffected.
Promote candidates into routing only after they clear the project quality floor.

Reviewer experience

Show the full conversation timeline first.
Let reviewers click any LLM call to inspect raw request, raw response, selected model, and route reason.
Keep labeling controls visible near the model output, not buried in a separate workflow.
Show inferred label suggestions as suggestions with confidence and evidence.
Make “create eval set from selected traces” a primary action once reviewers have filtered a useful group.