Process Archives at Scale¶
Crunch large log archives—compressed or not—while balancing throughput, ordering, and resource usage.
When This Guide Helps¶
- Daily or monthly log archives need to be scanned for regressions or security events.
- You want to benchmark
--parallelvs sequential performance on your hardware. - Pipelines must stay reliable while processing tens of gigabytes of data.
Before You Start¶
- Examples use
logs/2024-*.jsonl.gzas a placeholder. Replace with your archive paths. - Verify disk bandwidth and CPU capacity; the fastest settings differ between laptops and servers.
- Have a small sample ready for functional testing before running across huge datasets.
Step 1: Measure a Sequential Baseline¶
Start simple to confirm filters, transformations, and outputs are correct.
time kelora -j logs/2024-04-01.jsonl.gz \
-l error \
-k timestamp,service,message \
-J > errors-sample.json
- Record elapsed time, CPU usage, and memory footprint (
/usr/bin/time -vcan help). - Keep this script for future regression checks.
Step 2: Enable Parallel Processing¶
Use --parallel to leverage multiple cores. Let Kelora auto-detect thread count first.
time kelora -j logs/2024-04-*.jsonl.gz \
--parallel \
-l error \
-e 'track_count(e.service)' \
--metrics
- Default threads = number of logical CPU cores.
- On heavily I/O-bound workloads, try more threads with
--threads 2xthe core count; for CPU-heavy transforms, use equal or fewer threads.
Step 3: Tune Batch Size and Ordering¶
Batch size controls how many events each worker processes before flushing.
Guidelines:
- 1000 (default) balances throughput and memory.
- Increase to 5000–10000 for simple filters on machines with ample RAM.
- Decrease to 200–500 when transformations are heavy or memory is constrained.
Ordering options:
--file-order namefor deterministic alphabetical processing.--file-order mtimeto scan oldest/newest archives first.
Step 4: Drop Ordering When Safe¶
If output order is irrelevant (metrics only, exports to sorted files), add --unordered for higher throughput.
kelora -j logs/2024-04-*.jsonl.gz \
--parallel --unordered \
-e 'track_count(e.service)' \
-e 'track_count("errors")' \
--metrics
--unorderedflushes worker buffers immediately; expect non-deterministic ordering in the output stream.- Combine with
--metricsor-swhen you only care about aggregates.
Step 5: Automate and Monitor¶
Wrap the tuned command in a script so you can schedule it via cron or CI.
#!/usr/bin/env bash
set -euo pipefail
ARCHIVE_GLOB="/var/log/app/app-$(date +%Y-%m)-*.jsonl.gz"
OUTPUT="reports/errors-$(date +%Y-%m-%d).json"
kelora -j $ARCHIVE_GLOB \
--parallel --unordered --batch-size 5000 \
-l error \
-k timestamp,service,message \
-J > "$OUTPUT"
kelora -j "$OUTPUT" --stats
- Log
--statsoutput for traceability (processed lines, parse errors, throughput). - Capture timing with
/usr/bin/timeorhyperfineto detect future regressions.
Variations¶
-
Mixed compression
-
Recursive discovery
-
Performance sweep
Performance Checklist¶
- CPU-bound? Lower thread count or batch size to reduce contention.
- I/O-bound? Increase threads, run from fast storage, or pre-uncompress archives.
- Memory spikes? Reduce batch size, avoid large
window_*oremit_each()operations, or process archives sequentially. - Need reproducible results? Avoid
--unordered, process files in a consistent order, and archive the command used.
See Also¶
- Roll Up Logs with Span Windows for time-based summaries after processing archives.
- Prepare CSV Exports for Analytics to clean flattened outputs before sharing.
- Concept: Performance Model for a deeper explanation of Kelora’s execution modes.