Triage Production Errors¶
Quickly focus on error-level events, capture the surrounding context, and hand off a concise summary for incident response.
Who This Is For¶
- On-call engineers triaging incidents across multiple services.
- SREs verifying that a fix reduced error volume.
- Developers preparing error samples for debugging.
Before You Start¶
- Kelora installed (see Installation for download options).
- Access to the relevant log files or streams. Examples below use
examples/simple_json.jsonl; swap in your own paths. - Familiarity with the log format you are reading (
-jfor JSON,-f logfmtfor logfmt,-f combinedfor web access logs, etc.).
Step 1: Scope the Log Set¶
Decide which files and time ranges matter for the investigation. Use level filtering for the fastest coarse grained scan.
kelora -j examples/simple_json.jsonl \
-l error,critical \
--since "2024-01-15T10:00:00Z" \
--until 30m
-l(or--levels) executes as soon as it appears, so place it before heavier stages to prune work early.- You can mix absolute ISO timestamps with relative offsets to anchor investigations (as shown above).
- Prefer explicit formats (
-j,-f logfmt,-f combined) over-f autodetection to avoid surprises. - For a directory of files, pass a glob (
logs/app/*.jsonl) or feed a file list viafind … -print0 | xargs -0 kelora ….
Step 2: Narrow to Relevant Signals¶
Add scripted filters to isolate services, customers, or error types. Combine filter expressions with level-based filters for precision.
kelora -j examples/simple_json.jsonl \
-l error,critical \
--filter 'e.service == "orders" || e.message.contains("timeout")' \
-k timestamp,service,message
Guidance:
- Prefer
--filter(Rhai expression) for field checks, pattern scans, or numerical comparisons. - Chain multiple
--filterflags if it reads better than a long expression. - Use safe accessors (
e.get_path("error.code", "unknown")) when fields may be missing.
Step 3: Pull Contextual Events¶
Show surrounding events to understand what happened before and after each error. Combine context flags with multiline handling when stack traces are present.
-B/-A(short for--before-context/--after-context) mimicgrep’s context flags. Use them sparingly to avoid flooding output.- If the log contains multi-line traces, run the relevant strategy from Choose a Multiline Strategy first, then apply level/context filtering.
Step 4: Summarise Severity and Ownership¶
Track counts while you inspect events so you can state who is affected and how often it happens.
kelora -j examples/simple_json.jsonl \
-l error,critical \
-e 'track_count(e.service)' \
-e 'track_count(e.get_path("error.code", "unknown"))' \
--metrics \
--stats
track_count()tallies per-key counts; combine with--metricsto view the table at the end.--statsreports records processed, parse failures, and throughput so you can mention data quality in the incident summary.
Step 5: Export Evidence¶
Produce a shareable artifact once you know which events matter.
kelora -j examples/simple_json.jsonl \
-l error,critical \
-e 'e.error_code = e.get_path("error.code", "unknown")' \
-k timestamp,service,error_code,message \
-F csv \
-o errors.csv
Alternatives:
-J(or-F json) for structured archives.-qor--silentwhen piping into shell scripts that only care about exit codes.
Variations¶
-
Web server failures
-
Search compressed archives in parallel
-
Pivot by customer or shard
-
Fail fast in CI
Validate and Hand Off¶
- Inspect
--statsoutput to ensure parse error counts are zero (exit code 1 indicates parse/runtime failures). - Sample a few events with
--take 20before exporting to confirm filters captured the right incidents. - Note any masking or redaction you applied so downstream teams know whether to consult raw logs.
See Also¶
- Concept: Error Handling for pipeline failure semantics.
- Build a Service Health Snapshot for ongoing health metrics.
- Design Streaming Alerts to turn this triage flow into a live alert.
- CLI Reference: Filtering for complete flag and expression details.