Processing Architecture¶

Understanding how Kelora processes logs through its multi-layer architecture.

Overview¶

Kelora's processing model consists of three distinct layers operating on different data types:

Input Layer - File/stdin handling and decompression
Line-Level Processing - Raw string filtering and event boundary detection
Event-Level Processing - Structured data transformation and output

This layered architecture enables efficient streaming with low memory usage while supporting both sequential and parallel processing modes.

Layer 1: Input Layer¶

Input Sources¶

Stdin Mode: - Activated when no files specified or file is "-" - Background thread reads from stdin via channel - Supports one stdin source (error if "-" appears multiple times) - Useful for piping: tail -f app.log | kelora -j

File Mode: - Processes one or more files sequentially - Tracks current filename for context - Supports --file-order for processing sequence: - cli (default) - Process in CLI argument order - name - Sort alphabetically - mtime - Sort by modification time (oldest first)

Examples:

# Stdin mode
tail -f app.log | kelora -j

# File mode with ordering
kelora *.log --file-order mtime

# Mixed stdin and files
kelora file1.log - file2.log  # stdin in middle

Automatic Decompression¶

Kelora automatically detects and decompresses compressed input using magic bytes detection (not file extensions):

Supported Formats: - Gzip - Magic bytes 1F 8B 08 (.gz files or gzipped stdin) - Zstd - Magic bytes 28 B5 2F FD (.zst files or zstd stdin) - Plain - No magic bytes, passthrough

Behavior: - Transparent decompression before any processing - Works on both files and stdin - ZIP files explicitly rejected with error message - Decompression happens in Input Layer

Examples:

kelora app.log.gz                    # Auto-detected gzip
kelora app.log.zst --parallel        # Auto-detected zstd
gzip -c app.log | kelora -j          # Gzipped stdin

Reader Threading¶

Sequential Mode: - Spawns background reader thread - Sends lines via bounded channel (1024 line buffer) - Main thread processes lines one at a time - Supports multiline timeout flush (default: 200ms)

Parallel Mode: - Reader batches lines (default: 1000 lines, 200ms timeout) - Worker pool processes batches concurrently - No cross-batch state (impacts multiline, spans)

Layer 2: Line-Level Processing¶

Operations on raw string lines before parsing into events.

Line Skipping (`--skip-lines`)¶

Skip first N lines from input (useful for CSV headers, preambles).

kelora data.csv --skip-lines 1

Line Filtering (`--ignore-lines`, `--keep-lines`)¶

Regex-based filtering on raw lines before parsing:

--ignore-lines <REGEX> - Skip lines matching pattern
--keep-lines <REGEX> - Keep only lines matching pattern

Resilient mode: Skip non-matching lines, continue processing Strict mode: Abort on regex error

# Ignore health checks before parsing
kelora access.log --ignore-lines 'health-check'

# Keep only lines starting with timestamp
kelora app.log --keep-lines '^\d{4}-\d{2}-\d{2}'

Section Selection¶

Extract specific sections from logs based on start/end markers:

Flags: - --section-start <REGEX> - Begin section (default: exclude marker) - --section-from <REGEX> - Begin section (include marker) - --section-through <REGEX> - End section (include marker) - --section-before <REGEX> - End section (exclude marker) - --max-sections <N> - Limit number of sections

State Machine:

NotStarted → (match start) → InSection → (match end) → BetweenSections → ...

Example:

# Extract sections between markers
kelora system.log \
    --section-from '=== Test Started ===' \
    --section-through '=== Test Completed ==='

Event Aggregation (Multiline)¶

Detects event boundaries to combine multiple lines into single events before parsing.

Four Strategies:

1. Timestamp Strategy (auto-detect timestamp headers)

kelora app.log --multiline timestamp

Detects lines starting with timestamps as new events. Continuation lines (stack traces, wrapped messages) are appended to current event.

2. Indent Strategy (whitespace continuation)

kelora app.log --multiline indent

Lines starting with whitespace are continuations of previous event.

3. Regex Strategy (custom patterns)

kelora app.log \
    --multiline regex \
    --multiline-start '^\[' \
    --multiline-end '^\['

Define custom start/end patterns for event boundaries.

4. All Strategy (entire input as one event)

kelora config.json --multiline all

Buffers entire input as single event (use for structured files).

Multiline Timeout: - Sequential mode: Flush incomplete events after timeout (default: 200ms) - Parallel mode: Flush at batch boundaries (no timeout)

Critical: Multiline creates event boundaries before parsing. Each complete event string is then parsed into structured data.

Layer 3: Event-Level Processing¶

Operations on parsed events (maps/objects).

Parsing¶

Convert complete event strings into structured maps:

Event string → Parser → Event map (e.field accessible)

Parsers: json, logfmt, syslog, combined, csv, tsv, cols, etc.

Script Stages (Pipeline Core)¶

User script stages execute in exact CLI order:

--filter <EXPR> - Boolean filter (true = keep, false = skip)
--exec <SCRIPT> - Transform/process event
--exec-file <PATH> - Execute script from file (alias: -E)

Multiple instances allowed, intermixed in any order.

Example:

kelora -j app.log \
    --filter 'e.status >= 400' \    # Stage 1: Filter
    --exec 'e.alert = true' \        # Stage 2: Exec (only 4xx/5xx)
    --filter 'e.status < 500' \      # Stage 3: Filter (only 4xx)
    --exec 'track_count(e.path)'     # Stage 4: Exec (track 4xx paths)

Each stage processes the output of the previous stage sequentially.

Complete Stage Ordering¶

After user script stages, built-in filters execute:

User script stages (--filter, --exec in CLI order)
Timestamp filtering (--since, --until)
Level filtering (--levels, --exclude-levels)
Key filtering (--keys, --exclude-keys)

Note: User filters execute before built-in filtering like --levels.

Span Processing¶

Groups events into spans for aggregation:

Count-based Spans:

kelora -j app.log --span 100 \
    --span-close 'print("Span complete: " + meta.span_id)'

Closes span every N events that pass filters.

Time-based Spans:

kelora -j app.log --span 5m \
    --span-close 'track_sum("requests", meta.span_event_count)'

Closes span on aligned time windows (5m, 1h, 30s, etc.).

Span Processing Flow: 1. Event passes through filters/execs 2. Span processor assigns span_id and SpanStatus 3. Event processed with span context 4. When span closes → --span-close hook executes 5. Hook has access to meta.span_id, meta.span_start, meta.span_end, metrics

Constraints: - Spans force sequential mode (incompatible with --parallel) - Span state maintained across events

Begin and End Stages¶

--begin: Execute once before processing any events --end: Execute once after all events processed

kelora -j app.log \
    --begin 'print("Starting analysis")' \
    --exec 'track_count(e.service)' \
    --end 'print("Services seen: " + metrics.count.len())' \
    --metrics

In parallel mode: - --begin runs sequentially before worker pool starts - --end runs sequentially after workers complete (with merged metrics)

Context Lines¶

Show surrounding lines around matches:

--before-context N / -B N - Show N lines before match
--after-context N / -A N - Show N lines after match
--context N / -C N - Show N lines before and after

Requires active filtering (--filter, --levels, --since, etc.).

kelora -j app.log \
    --filter 'e.level == "ERROR"' \
    --before-context 2 \
    --after-context 2

Output Stage¶

Format and emit events:

Apply --keys field selection
Convert timestamps (--convert-ts, --show-ts-local, --show-ts-utc)
Format output (--output-format: default, json, csv, etc.)
Apply --take limit
Write to stdout or files

kelora -j app.log \
    --keys timestamp,service,message \
    -F json \
    --take 100

Parallel Processing Model¶

Kelora's --parallel mode is batch-parallel, not stage-parallel.

Architecture¶

Sequential:  Line → Line filters → Multiline → Parse → Script stages → Output
             (one at a time)

Parallel:    Batch of lines → Worker pool
             Each worker: Line filters → Multiline → Parse → Script stages
             Results → Ordering buffer → Output

Where: - Line filters = --skip-lines, --ignore-lines, --section-start, etc. - Multiline = Event boundary detection (aggregates multiple lines into events) - Script stages = --filter and --exec in CLI order

How It Works¶

Reader thread batches lines (default: 1000 lines, 200ms timeout)
Worker pool processes batches independently (default: CPU count workers)
Each worker has its own Pipeline instance
Results merged with ordering preservation (default) or unordered (--unordered)
Stats/metrics merged from all workers

Configuration:

kelora -j large.log \
    --parallel \
    --threads 8 \
    --batch-size 2000 \
    --batch-timeout 500

Constraints and Tradeoffs¶

Incompatible Features: - Spans - Cannot maintain span state across batches (forces sequential) - Cross-event context - Each batch processed independently

Multiline Behavior: - Multiline chunking happens per-batch - Event boundaries may not span batch boundaries - Consider larger batch sizes for multiline workloads

Ordering: - Default: Preserve input order (adds overhead) - --unordered: Trade ordering for maximum throughput

Best For: - Large files with independent events - CPU-bound transformations (regex, hashing, calculations) - High-throughput batch processing

Not Ideal For: - Real-time streaming (use sequential) - Cross-event analysis (use spans in sequential mode) - Small files (overhead exceeds benefit)

Metrics and Statistics¶

Kelora maintains two tracking systems:

User Metrics (`--metrics`)¶

Populated by Rhai functions in --exec scripts:

kelora -j app.log \
    --exec 'track_count(e.service)' \
    --exec 'track_sum("total_bytes", e.bytes)' \
    --exec 'track_avg("response_time", e.duration_ms)' \
    --exec 'track_unique("users", e.user_id)' \
    --metrics

Available Functions: - track_count(key) - Increment counter - track_inc(key, value) - Add to counter - track_sum(key, value) - Sum values - track_avg(key, value) - Calculate average - track_unique(key, value) - Collect unique values

Access in --end stage:

kelora -j app.log \
    --exec 'track_count(e.service)' \
    --end 'print("Total services: " + metrics.count.len())' \
    --metrics

Output: - Printed to stderr with --metrics - Written to JSON file with --metrics-file metrics.json

Internal Statistics (`--stats`)¶

Auto-collected counters:

events_created - Parsed events
events_output - Output events
events_filtered - Filtered events
discovered_levels - Log levels seen
discovered_keys - Field names seen
Parse errors, filter errors, etc.

kelora -j app.log --stats

Parallel Metrics Merging¶

In parallel mode: - Each worker maintains local tracking state - GlobalTracker merges worker states after processing: - Counters: summed - Unique sets: unioned - Averages: recomputed from sums and counts - Merged metrics available in --end stage

Error Handling¶

Resilient Mode (Default)¶

Parse errors: Skip line, continue processing
Filter errors: Treat as false, skip event
Transform errors: Return original event unchanged
Summary: Show error count at end

kelora -j app.log --verbose  # Show errors as they occur

Strict Mode (`--strict`)¶

Any error: Abort immediately with exit code 1
No summary: Program exits on first error

kelora -j app.log --strict

Verbosity Levels¶

-v / --verbose - Show detailed errors (level 1)
-vv - More verbose (level 2)
-vvv - Maximum verbosity (level 3)

Quiet Modes¶

-q - Suppress diagnostics
-qq - Suppress diagnostics and events
-qqq - Suppress diagnostics, events, and script output

Complete Data Flow¶

┌─────────────────────────────────────────┐
│  Layer 1: Input                         │
├─────────────────────────────────────────┤
│  • Stdin or Files (--file-order)        │
│  • Automatic decompression (gzip/zstd)  │
│  • Reader thread spawning               │
└──────────────┬──────────────────────────┘
               │ Raw lines
┌──────────────▼──────────────────────────┐
│  Layer 2: Line-Level Processing         │
├─────────────────────────────────────────┤
│  • --skip-lines (skip first N)          │
│  • --section-start/through (sections)   │
│  • --ignore-lines/--keep-lines (regex)  │
│  • Multiline chunker (event boundaries) │
└──────────────┬──────────────────────────┘
               │ Complete event strings
┌──────────────▼──────────────────────────┐
│  Layer 3: Event-Level Processing        │
├─────────────────────────────────────────┤
│  • Parser → Event map                   │
│  • Span preparation (assign span_id)    │
│  • Script stages (--filter/--exec)      │
│    - User stages in CLI order           │
│    - Timestamp filtering (--since)      │
│    - Level filtering (--levels)         │
│    - Key filtering (--keys)             │
│  • Span close hooks (--span-close)      │
│  • Output formatting                    │
└──────────────┬──────────────────────────┘
               │ Formatted output
               ▼
           stdout/files

Parallel Mode Differences:

Line batching (1000 lines) → Worker pool
Each worker independently:
  - Line-level processing
  - Event-level processing
Results → Ordering buffer → Merged output
Metrics → GlobalTracker → Merged stats

Performance Characteristics¶

Streaming¶

Low memory usage - Events processed and discarded
Real-time capable - Works with tail -f and live streams
No lookahead - Cannot access future events (except with --window)

Sequential vs Parallel¶

Sequential (default): - Events processed in order - Lower memory usage - Predictable output order - Supports spans and cross-event state - Best for streaming and interactive use

Parallel (--parallel): - Events processed in batches across cores - Higher throughput for CPU-bound work - Higher memory usage (batching + worker pools) - Limited cross-event features - Best for batch processing large files

Optimization Tips¶

Early filtering:

# Good: Cheap filters first
kelora -j app.log \
    --levels error \
    --filter 'e.message.has_matches(r"expensive.*regex")'

# Less efficient: Expensive filter on all events
kelora -j app.log \
    --filter 'e.message.has_matches(r"expensive.*regex")' \
    --levels error

Use --keys to reduce output processing:

kelora -j app.log --keys timestamp,message -F json

Parallel for CPU-bound transformations:

kelora -j large.log \
    --parallel \
    --exec 'e.hash = e.content.hash("sha256")' \
    --batch-size 1000

Use --take for quick exploration:

kelora -j large.log --take 100