Skip to content

Triage Production Errors

Quickly focus on error-level events, capture the surrounding context, and hand off a concise summary for incident response.

Who This Is For

  • On-call engineers triaging incidents across multiple services.
  • SREs verifying that a fix reduced error volume.
  • Developers preparing error samples for debugging.

Before You Start

  • Kelora installed (see Installation for download options).
  • Access to the relevant log files or streams. Examples below use examples/simple_json.jsonl; swap in your own paths.
  • Familiarity with the log format you are reading (-j for JSON, -f logfmt for logfmt, -f combined for web access logs, etc.).

Step 1: Scope the Log Set

Decide which files and time ranges matter for the investigation. Use level filtering for the fastest coarse grained scan.

kelora -j examples/simple_json.jsonl \
  -l error,critical \
  --since "2024-01-15T10:00:00Z" \
  --until 30m
  • -l (or --levels) executes as soon as it appears, so place it before heavier stages to prune work early.
  • You can mix absolute ISO timestamps with relative offsets to anchor investigations (as shown above).
  • Prefer explicit formats (-j, -f logfmt, -f combined) over -f auto detection to avoid surprises.
  • For a directory of files, pass a glob (logs/app/*.jsonl) or feed a file list via find … -print0 | xargs -0 kelora ….

Step 2: Narrow to Relevant Signals

Add scripted filters to isolate services, customers, or error types. Combine filter expressions with level-based filters for precision.

kelora -j examples/simple_json.jsonl \
  -l error,critical \
  --filter 'e.service == "orders" || e.message.contains("timeout")' \
  -k timestamp,service,message

Guidance:

  • Prefer --filter (Rhai expression) for field checks, pattern scans, or numerical comparisons.
  • Chain multiple --filter flags if it reads better than a long expression.
  • Use safe accessors (e.get_path("error.code", "unknown")) when fields may be missing.

Step 3: Pull Contextual Events

Show surrounding events to understand what happened before and after each error. Combine context flags with multiline handling when stack traces are present.

kelora -j examples/simple_json.jsonl \
  -l error,critical \
  -B 2 \
  -A 1
  • -B/-A (short for --before-context/--after-context) mimic grep’s context flags. Use them sparingly to avoid flooding output.
  • If the log contains multi-line traces, run the relevant strategy from Choose a Multiline Strategy first, then apply level/context filtering.

Step 4: Summarise Severity and Ownership

Track counts while you inspect events so you can state who is affected and how often it happens.

kelora -j examples/simple_json.jsonl \
  -l error,critical \
  -e 'track_count(e.service)' \
  -e 'track_count(e.get_path("error.code", "unknown"))' \
  --metrics \
  --stats
  • track_count() tallies per-key counts; combine with --metrics to view the table at the end.
  • --stats reports records processed, parse failures, and throughput so you can mention data quality in the incident summary.

Step 5: Export Evidence

Produce a shareable artifact once you know which events matter.

kelora -j examples/simple_json.jsonl \
  -l error,critical \
  -e 'e.error_code = e.get_path("error.code", "unknown")' \
  -k timestamp,service,error_code,message \
  -F csv \
  -o errors.csv

Alternatives:

  • -J (or -F json) for structured archives.
  • -q or --silent when piping into shell scripts that only care about exit codes.

Variations

  • Web server failures

    kelora -f combined examples/simple_combined.log \
      --filter 'e.status >= 500' \
      -k timestamp,status,request
    

  • Search compressed archives in parallel

    kelora -j logs/2024-04-*.jsonl.gz \
      --parallel \
      -l error,critical \
      --filter 'e.message.contains("timeout")'
    

  • Pivot by customer or shard

    kelora -j examples/simple_json.jsonl \
      -l error,critical \
      -e 'track_count(e.account_id)' \
      --metrics
    

  • Fail fast in CI

    kelora -q -j build/kelora.log -l error \
      && echo "no errors" \
      || (echo "errors detected" >&2; exit 1)
    

Validate and Hand Off

  • Inspect --stats output to ensure parse error counts are zero (exit code 1 indicates parse/runtime failures).
  • Sample a few events with --take 20 before exporting to confirm filters captured the right incidents.
  • Note any masking or redaction you applied so downstream teams know whether to consult raw logs.

See Also