Process Archives at Scale¶

Crunch large log archives—compressed or not—while balancing throughput, ordering, and resource usage.

When This Guide Helps¶

Daily or monthly log archives need to be scanned for regressions or security events.
You want to benchmark --parallel vs sequential performance on your hardware.
Pipelines must stay reliable while processing tens of gigabytes of data.

Before You Start¶

Examples use logs/2024-*.jsonl.gz as a placeholder. Replace with your archive paths.
Verify disk bandwidth and CPU capacity; the fastest settings differ between laptops and servers.
Have a small sample ready for functional testing before running across huge datasets.

Step 1: Measure a Sequential Baseline¶

Start simple to confirm filters, transformations, and outputs are correct.

time kelora -j logs/2024-04-01.jsonl.gz \
  -l error \
  -k timestamp,service,message \
  -J > errors-sample.json

Record elapsed time, CPU usage, and memory footprint (/usr/bin/time -v can help).
Keep this script for future regression checks.

Step 2: Enable Parallel Processing¶

Use --parallel to leverage multiple cores. Let Kelora auto-detect thread count first.

time kelora -j logs/2024-04-*.jsonl.gz \
  --parallel \
  -l error \
  -e 'track_count(e.service)' \
  --metrics

Default threads = number of logical CPU cores.
On heavily I/O-bound workloads, try more threads with --threads 2x the core count; for CPU-heavy transforms, use equal or fewer threads.

Step 3: Tune Batch Size and Ordering¶

Batch size controls how many events each worker processes before flushing.

kelora -j logs/2024-04-*.jsonl.gz \
  --parallel --batch-size 5000 \
  -l error \
  --stats

Guidelines:

1000 (default) balances throughput and memory.
Increase to 5000–10000 for simple filters on machines with ample RAM.
Decrease to 200–500 when transformations are heavy or memory is constrained.

Ordering options:

--file-order name for deterministic alphabetical processing.
--file-order mtime to scan oldest/newest archives first.

Step 4: Drop Ordering When Safe¶

If output order is irrelevant (metrics only, exports to sorted files), add --unordered for higher throughput.

kelora -j logs/2024-04-*.jsonl.gz \
  --parallel --unordered \
  -e 'track_count(e.service)' \
  -e 'track_count("errors")' \
  --metrics

--unordered flushes worker buffers immediately; expect non-deterministic ordering in the output stream.
Combine with --metrics or -s when you only care about aggregates.

Step 5: Automate and Monitor¶

Wrap the tuned command in a script so you can schedule it via cron or CI.

#!/usr/bin/env bash
set -euo pipefail

ARCHIVE_GLOB="/var/log/app/app-$(date +%Y-%m)-*.jsonl.gz"
OUTPUT="reports/errors-$(date +%Y-%m-%d).json"

kelora -j $ARCHIVE_GLOB \
  --parallel --unordered --batch-size 5000 \
  -l error \
  -k timestamp,service,message \
  -J > "$OUTPUT"

kelora -j "$OUTPUT" --stats

Log --stats output for traceability (processed lines, parse errors, throughput).
Capture timing with /usr/bin/time or hyperfine to detect future regressions.

Variations¶

Mixed compression

kelora -j logs/2024-04-*.jsonl logs/2024-04-*.jsonl.gz \
  --parallel --threads 8 \
  -l critical \
  --stats

Recursive discovery

find /archives/app -name "*.jsonl.gz" -print0 |
  xargs -0 kelora -j --parallel --unordered \
    -e 'track_count(e.service)' \
    --metrics

Performance sweep

for size in 500 1000 5000; do
  echo "Batch size $size"
  time kelora -j logs/2024-04-*.jsonl.gz --parallel --batch-size $size -q
done

Performance Checklist¶

CPU-bound? Lower thread count or batch size to reduce contention.
I/O-bound? Increase threads, run from fast storage, or pre-uncompress archives.
Memory spikes? Reduce batch size, avoid large window_* or emit_each() operations, or process archives sequentially.
Need reproducible results? Avoid --unordered, process files in a consistent order, and archive the command used.