Parsing Custom Formats¶

Learn how to parse custom log formats using Kelora's column specification syntax.

What You'll Learn¶

By the end of this tutorial, you'll be able to:

Parse whitespace- and separator-delimited custom logs
Use column specifications with names, multi-token captures, and skips
Apply supported type annotations for automatic conversion
Handle timestamp fields and greedy captures
Combine custom parsing with filtering and transformation

Prerequisites¶

Completed the Quickstart
Basic understanding of log formats
Familiarity with command-line operations

Overview¶

Many applications generate custom log formats that don't match standard formats like JSON or syslog. Kelora's cols:<spec> format lets you define custom column specifications to parse these logs.

Time: ~15 minutes

Step 1: Understanding Column Specifications¶

The basic syntax for column parsing is:

-f 'cols:field1 field2 *rest'

Each token in the specification describes how many delimited columns to consume and which field name to assign.

Token types:

field - Consume a single column into field
field(N) - Consume N columns (joined with spaces or the provided separator)
- / -(N) - Skip one or N columns entirely
*field - Capture every remaining column into field (must be last)
field:type - Apply a type annotation (int, float, bool, string) after extraction

Combine these building blocks to describe almost any column-based log format.

Example Log Format¶

Let's start with a simple application log:

ERROR api Connection failed
INFO  db  Query executed
WARN  api Slow response time

Each line has three whitespace-delimited fields: level, service, message.

Step 2: Basic Whitespace-Delimited Parsing¶

For whitespace-delimited fields, just list the field names:

CommandOutput

kelora -f 'cols:ts(2) level *msg' examples/cols_fixed.log -n 3

kelora -f 'cols:ts(2) level *msg' examples/cols_fixed.log -n 3

ts='2024-01-15 10:00:00' level='INFO' msg='server  Application started successfully\n'
ts='2024-01-15 10:00:05' level='DEBUG' msg='config  Loading configuration from file\n'
ts='2024-01-15 10:00:10' level='WARN' msg='memory  Memory usage at 85% threshold\n'

How it works:

level - First whitespace-delimited column
service - Second column
message - Third column

Step 3: Combining Multiple Columns¶

Sometimes a field spans more than one column. Use the (N) suffix to join multiple tokens.

echo "2024-01-15 10:30:00 INFO Connection failed" | \
    kelora -f 'cols:timestamp(2) level *message'

How it works:

timestamp(2) - Consumes the first two columns (2024-01-15 and 10:30:00)
level - Third column
*message - Everything else in the line

Step 4: Adding Type Annotations¶

Convert fields to specific types using annotations:

echo "200 1234 0.123
404 5678 0.456
500 9012 0.789" | \
    kelora -f 'cols:status:int bytes:int response_time:float'

Supported types:

int - Integer (i64) conversion
float - Floating-point (f64) conversion
bool - Boolean conversion (true/false, yes/no, 1/0)
string - Explicitly keep as string (useful when mixing annotations)

After type conversion, you can use numeric operations immediately:

echo "200 1234 0.123" | \
    kelora -f 'cols:status:int bytes:int response_time:float' \
    --filter 'e.status >= 400'

Step 5: Handling Timestamps¶

Kelora automatically looks for common timestamp field names such as timestamp, ts, or time. You can also point it at a specific field with --ts-field and describe the format with --ts-format when needed.

echo "2024-01-15T10:30:00Z ERROR Connection failed" | \
    kelora -f 'cols:timestamp level *message' --ts-field timestamp

Kelora will parse the timestamp field so that --since, --until, and timestamp-aware formatting work. If your timestamps do not include a timezone, provide --ts-format and optionally --input-tz:

kelora -f 'cols:timestamp(2) level *message' \
    --ts-field timestamp \
    --ts-format '%Y-%m-%d %H:%M:%S' \
    --input-tz 'UTC'

Step 6: Custom Separators¶

For non-whitespace separators, use --cols-sep:

CommandOutput

# Multi-character separator with prefix extraction
kelora --extract-prefix node --prefix-sep ' >>> ' \
    -f 'cols:timestamp *message' \
    examples/prefix_custom.log -n 5

# Multi-character separator with prefix extraction
kelora --extract-prefix node --prefix-sep ' >>> ' \
    -f 'cols:timestamp *message' \
    examples/prefix_custom.log -n 5

timestamp='2024-01-15T10:00:00Z' message='Cluster node started' node='[node-1]'
timestamp='2024-01-15T10:00:05Z' message='Joining cluster at 192.168.1.100' node='[node-1]'
timestamp='2024-01-15T10:00:10Z' message='Cluster node started' node='[node-2]'
timestamp='2024-01-15T10:00:15Z' message='Joining cluster at 192.168.1.100' node='[node-2]'
timestamp='2024-01-15T10:00:20Z' message='Connected to 1 peer(s)' node='[node-1]'

Works with any separator string:

# Comma-separated
kelora -f 'cols:level service message' --cols-sep ','

# Tab-separated
kelora -f 'cols:level service message' --cols-sep $'\t'

# Multi-character separator
kelora -f 'cols:level service message' --cols-sep ' :: '

Step 7: Real-World Example - Application Logs¶

Let's parse a custom application log format:

Log format:

[2024-01-15 10:30:00] ERROR api Connection failed to database
[2024-01-15 10:30:01] INFO  db  Query executed successfully
[2024-01-15 10:30:02] WARN  api Slow response: 2500ms

Column specification:

kelora -f 'cols:raw_ts level service *message' app.log

But the timestamp is wrapped in brackets. We need to clean it up.

Using Regex in Exec¶

kelora -f 'cols:raw_ts level service *message' app.log \
    -e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
    -e 'e.raw_ts = ()' \
    --ts-field timestamp \
    -k timestamp,level,service,message

What this does: 1. Parse raw_ts as the first field (including brackets) 2. Extract the timestamp using regex 3. Remove the temporary raw_ts field 4. Output cleaned fields

Step 8: Combining with Transformations¶

Parse a custom format and add computed fields:

cat app.log | \
    kelora -f 'cols:timestamp level service *message' \
    --ts-field timestamp \
    -e 'if e.message.contains("ms") { e.duration = e.message.extract_re(r"(\d+)ms", 1).to_int() }' \
    --filter 'e.has_path("duration") && e.duration > 1000' \
    -k timestamp,service,duration,message

Pipeline: 1. Parse custom format 2. Extract duration from the message if present 3. Filter for slow requests (>1000ms) 4. Output relevant fields

Step 9: Working with Mixed Columns¶

Some logs have consistent positions but variable content:

ERROR  2024-01-15 Connection failed
INFO   2024-01-15 Query OK
WARN   2024-01-15 High memory usage

Use whitespace delimiters for flexibility:

kelora -f 'cols:level date message' app.log

If you need to ignore specific columns, use skip tokens and multi-column captures:

kelora -f 'cols:level - message(2)' app.log

Step 10: Complete Example¶

Let's parse a complex custom format with everything we've learned:

Input log (custom_app.log):

[2024-01-15 10:30:00] 200 api    1234 0.123 Success
[2024-01-15 10:30:01] 404 web    5678 0.456 Not found
[2024-01-15 10:30:02] 500 api    9012 0.789 Server error

Parse specification:

kelora -f 'cols:raw_ts status:int service bytes:int latency:float *message' custom_app.log \
    -e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
    -e 'e.raw_ts = ()' \
    --ts-field timestamp \
    -e 'e.is_error = e.status >= 400' \
    -e 'e.is_slow = e.latency > 0.5' \
    --filter 'e.is_error || e.is_slow' \
    -k timestamp,status,service,message

What happens: 1. Parse fields with type annotations 2. Extract the timestamp from brackets 3. Mark HTTP errors (>= 400) and slow responses 4. Filter for problematic events 5. Output cleaned fields

Common Patterns¶

Pattern 1: Log Level + Message¶

kelora -f 'cols:level message' app.log -l error,warn

Pattern 2: Timestamp + Level + Service + Message¶

kelora -f 'cols:timestamp level service *message' app.log \
    --ts-field timestamp \
    --filter 'e.level == "ERROR"'

Pattern 3: Type Conversion with Greedy Capture¶

kelora -f 'cols:status:int bytes:int duration:float *path' access.log \
    --filter 'e.status >= 500' \
    -e 'track_avg("latency", e.duration)' \
    -m

Pattern 4: Extract and Transform¶

kelora -f 'cols:timestamp level *data' app.log \
    --ts-field timestamp \
    -e 'e.values = e.data.split(",")' \
    -e 'e.count = e.values.len()' \
    -k timestamp,level,count

Tips and Best Practices¶

Use Whitespace Delimiters When Possible¶

Simpler and more flexible:

# Good - whitespace delimited
kelora -f 'cols:level service message' app.log

# Skip columns rather than relying on alignment
kelora -f 'cols:level - *message' app.log

Use Type Annotations Early¶

Convert types during parsing, not in exec:

# Good - parse as int
kelora -f 'cols:status:int bytes:int' --filter 'e.status >= 400'

# Less efficient - convert in exec
kelora -f 'cols:status bytes' -e 'e.status = e.status.to_int()' --filter 'e.status >= 400'

Name Fields Descriptively¶

# Good
kelora -f 'cols:timestamp level service *message'

# Less clear
kelora -f 'cols:col1 col2 col3 col4'

Use Greedy Capture for Messages¶

Always use * for the last field if it's a free-form message:

kelora -f 'cols:level service *message'

Combine with Prefix Extraction¶

For Docker Compose-style logs:

docker compose logs | \
    kelora --extract-prefix container \
           -f 'cols:timestamp level *message' \
           --ts-field timestamp

More Recipes to Practice¶

CSV with type annotations:

CommandOutput

kelora -f "csv status:int bytes:int duration_ms:int" examples/simple_csv.csv

kelora -f "csv status:int bytes:int duration_ms:int" examples/simple_csv.csv

timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status=200 bytes=1234 duration_ms=45
timestamp='2024-01-15T10:00:05Z' method='GET' path='/api/users' status=200 bytes=5678
  duration_ms=120
timestamp='2024-01-15T10:00:10Z' method='POST' path='/api/login' status=200 bytes=234 duration_ms=67
timestamp='2024-01-15T10:00:15Z' method='GET' path='/api/users/123' status=200 bytes=890
  duration_ms=34
timestamp='2024-01-15T10:00:20Z' method='PUT' path='/api/users/123' status=200 bytes=456
  duration_ms=89
timestamp='2024-01-15T10:00:25Z' method='GET' path='/static/logo.png' status=200 bytes=12345
  duration_ms=12
timestamp='2024-01-15T10:00:30Z' method='GET' path='/api/posts' status=200 bytes=9876
  duration_ms=156
timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes=0 duration_ms=5
timestamp='2024-01-15T10:00:40Z' method='POST' path='/api/posts' status=201 bytes=567
  duration_ms=234
timestamp='2024-01-15T10:00:45Z' method='GET' path='/api/comments' status=200 bytes=4321
  duration_ms=98
timestamp='2024-01-15T10:00:50Z' method='DELETE' path='/api/posts/456' status=204 bytes=0
  duration_ms=45
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes=123 duration_ms=12
timestamp='2024-01-15T10:01:00Z' method='GET' path='/api/search' status=200 bytes=7890
  duration_ms=345
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes=0 duration_ms=5
timestamp='2024-01-15T10:01:10Z' method='GET' path='/api/stats' status=200 bytes=2345
  duration_ms=234
timestamp='2024-01-15T10:01:15Z' method='GET' path='/api/health' status=200 bytes=89 duration_ms=8
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes=567
  duration_ms=1234
timestamp='2024-01-15T10:01:25Z' method='GET' path='/api/retry' status=200 bytes=890 duration_ms=89
timestamp='2024-01-15T10:01:30Z' method='PATCH' path='/api/users/789' status=200 bytes=345
  duration_ms=123
timestamp='2024-01-15T10:01:35Z' method='GET' path='/robots.txt' status=200 bytes=234 duration_ms=4
timestamp='2024-01-15T10:01:40Z' method='GET' path='/api/feed' status=200 bytes=15678
  duration_ms=456
timestamp='2024-01-15T10:01:45Z' method='POST' path='/api/logout' status=200 bytes=123
  duration_ms=34
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes=567 duration_ms=7
timestamp='2024-01-15T10:01:55Z' method='GET' path='/api/metrics' status=200 bytes=3456
  duration_ms=67
timestamp='2024-01-15T10:02:00Z' method='GET' path='/api/version' status=200 bytes=89 duration_ms=12

TSV with type annotations:

CommandOutput

kelora -f "tsv: user_id:int success:bool" examples/simple_tsv.tsv

kelora -f "tsv: user_id:int success:bool" examples/simple_tsv.tsv

timestamp='2024-01-15T10:00:00Z' user_id=101 username='alice' action='login' success=true
timestamp='2024-01-15T10:00:10Z' user_id=102 username='bob' action='login' success=true
timestamp='2024-01-15T10:00:20Z' user_id=103 username='charlie' action='login' success=false
timestamp='2024-01-15T10:01:00Z' user_id=101 username='alice' action='view_profile' success=true
timestamp='2024-01-15T10:01:30Z' user_id=104 username='diana' action='login' success=true
timestamp='2024-01-15T10:02:00Z' user_id=102 username='bob' action='edit_post' success=true
timestamp='2024-01-15T10:02:30Z' user_id=101 username='alice' action='upload_file' success=false
timestamp='2024-01-15T10:03:00Z' user_id=105 username='eve' action='login' success=true
timestamp='2024-01-15T10:03:30Z' user_id=103 username='charlie' action='reset_password' success=true
timestamp='2024-01-15T10:04:00Z' user_id=104 username='diana' action='delete_account' success=true
timestamp='2024-01-15T10:05:00Z' user_id=102 username='bob' action='logout' success=true
timestamp='2024-01-15T10:05:30Z' user_id=105 username='eve' action='view_dashboard' success=true
timestamp='2024-01-15T10:06:00Z' user_id=101 username='alice' action='logout' success=true
timestamp='2024-01-15T10:07:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:07:30Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:08:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:09:00Z' user_id=105 username='eve' action='change_settings' success=true
timestamp='2024-01-15T10:10:00Z' user_id=103 username='charlie' action='login' success=true
timestamp='2024-01-15T10:11:00Z' user_id=107 username='grace' action='login' success=true
timestamp='2024-01-15T10:12:00Z' user_id=103 username='charlie' action='logout' success=true

Columns format with multi-token timestamp:

CommandOutput

kelora -f "cols:ts(2) level *msg:string" examples/cols_fixed.log

kelora -f "cols:ts(2) level *msg:string" examples/cols_fixed.log

ts='2024-01-15 10:00:00' level='INFO' msg='server  Application started successfully\n'
ts='2024-01-15 10:00:05' level='DEBUG' msg='config  Loading configuration from file\n'
ts='2024-01-15 10:00:10' level='WARN' msg='memory  Memory usage at 85% threshold\n'
ts='2024-01-15 10:00:15' level='ERROR' msg='network Connection timeout after 5000ms\n'
ts='2024-01-15 10:00:20' level='INFO' msg='api     Received GET request for /users\n'
ts='2024-01-15 10:00:25' level='DEBUG' msg='cache   Cache hit for key user:123\n'
ts='2024-01-15 10:00:30' level='INFO' msg='api     Response sent with status 200\n'
ts='2024-01-15 10:00:35' level='CRITICAL' msg='disk    Disk space below 1GB on /var\n'
ts='2024-01-15 10:00:40' level='ERROR' msg='auth    Invalid credentials for admin\n'
ts='2024-01-15 10:00:45' level='WARN' msg='rate    Rate limit approaching for IP\n'

Strict mode for catching malformed rows:

CommandOutput

kelora -f "csv status:int" --strict examples/errors_csv_ragged.csv

kelora -f "csv status:int" --strict examples/errors_csv_ragged.csv

timestamp='2024-01-15T10:00:00Z' level='INFO' message='Valid 4-column row' extra='data'
timestamp='2024-01-15T10:00:05Z' level='WARN' message='Missing extra column'
timestamp='2024-01-15T10:00:10Z' level='ERROR' message='Extra' extra='column'
timestamp='2024-01-15T10:00:15Z' level='INFO' message='Valid row' extra='more_data'
timestamp='2024-01-15T10:00:20Z' level='DEBUG'
timestamp='2024-01-15T10:00:25Z' level='INFO' message='Another valid' extra='row'
timestamp='2024-01-15T10:00:30Z' level='WARN' message='Quoted "comma' extra=' inside"'
timestamp='2024-01-15T10:00:35Z' level='ERROR' message='Missing' extra=''
timestamp='2024-01-15T10:00:40Z' level='INFO' message='Valid' extra='data'
timestamp='2024-01-15T10:00:45Z' level='CRITICAL' message='Final' extra='row'

Inline type annotations and strict mode are perfect for catching malformed rows during ingestion before they reach downstream systems.

Troubleshooting¶

Fields Not Parsing Correctly¶

Problem: Fields are misaligned or missing.

Solution: Check separators and column counts:

# Debug by outputting all fields
kelora -f 'cols:field1 field2 *field3' app.log -n 3

# Try different separator
kelora -f 'cols:field1 field2 field3' --cols-sep '|' app.log -n 3

Timestamp Not Recognized¶

Problem: Timestamp field not working with --since.

Solution: Tell Kelora which field holds the timestamp and provide the format if needed:

kelora -f 'cols:timestamp level *message' app.log \
    --ts-field timestamp \
    --ts-format '%Y-%m-%d %H:%M:%S'

Type Conversion Failures¶

Problem: Integer/float conversions failing.

Solution: Use get_path with defaults in resilient mode:

kelora -f 'cols:status bytes' -e 'e.status_int = to_int_or(e.status, 0)'

Greedy Field Capturing Too Much¶

Problem: Last field includes separators.

Solution: Be specific with earlier fields:

kelora -f 'cols:level service message(2) *remainder'

Next Steps¶

Now that you understand custom format parsing, explore:

Working with Time - Advanced timestamp handling
Format Reference - Complete format documentation
Scripting Transforms - Advanced event transformation