Skip to content

Parsing Custom Formats

Learn how to parse custom log formats using Kelora's column specification syntax.

What You'll Learn

By the end of this tutorial, you'll be able to:

  • Parse whitespace- and separator-delimited custom logs
  • Use column specifications with names, multi-token captures, and skips
  • Apply supported type annotations for automatic conversion
  • Handle timestamp fields and greedy captures
  • Combine custom parsing with filtering and transformation

Prerequisites

  • Completed the Quickstart
  • Basic understanding of log formats
  • Familiarity with command-line operations

Overview

Many applications generate custom log formats that don't match standard formats like JSON or syslog. Kelora's cols:<spec> format lets you define custom column specifications to parse these logs.

Time: ~15 minutes

Step 1: Understanding Column Specifications

The basic syntax for column parsing is:

-f 'cols:field1 field2 *rest'

Each token in the specification describes how many delimited columns to consume and which field name to assign.

Token types:

  • field - Consume a single column into field
  • field(N) - Consume N columns (joined with spaces or the provided separator)
  • - / -(N) - Skip one or N columns entirely
  • *field - Capture every remaining column into field (must be last)
  • field:type - Apply a type annotation (int, float, bool, string) after extraction

Combine these building blocks to describe almost any column-based log format.

Example Log Format

Let's start with a simple application log:

ERROR api Connection failed
INFO  db  Query executed
WARN  api Slow response time

Each line has three whitespace-delimited fields: level, service, message.

Step 2: Basic Whitespace-Delimited Parsing

For whitespace-delimited fields, just list the field names:

kelora -f 'cols:ts(2) level *msg' examples/cols_fixed.log -n 3
kelora -f 'cols:ts(2) level *msg' examples/cols_fixed.log -n 3
ts='2024-01-15 10:00:00' level='INFO' msg='server  Application started successfully\n'
ts='2024-01-15 10:00:05' level='DEBUG' msg='config  Loading configuration from file\n'
ts='2024-01-15 10:00:10' level='WARN' msg='memory  Memory usage at 85% threshold\n'

How it works:

  • level - First whitespace-delimited column
  • service - Second column
  • message - Third column

Step 3: Combining Multiple Columns

Sometimes a field spans more than one column. Use the (N) suffix to join multiple tokens.

echo "2024-01-15 10:30:00 INFO Connection failed" | \
    kelora -f 'cols:timestamp(2) level *message'

How it works:

  • timestamp(2) - Consumes the first two columns (2024-01-15 and 10:30:00)
  • level - Third column
  • *message - Everything else in the line

Step 4: Adding Type Annotations

Convert fields to specific types using annotations:

echo "200 1234 0.123
404 5678 0.456
500 9012 0.789" | \
    kelora -f 'cols:status:int bytes:int response_time:float'

Supported types:

  • int - Integer (i64) conversion
  • float - Floating-point (f64) conversion
  • bool - Boolean conversion (true/false, yes/no, 1/0)
  • string - Explicitly keep as string (useful when mixing annotations)

After type conversion, you can use numeric operations immediately:

echo "200 1234 0.123" | \
    kelora -f 'cols:status:int bytes:int response_time:float' \
    --filter 'e.status >= 400'

Step 5: Handling Timestamps

Kelora automatically looks for common timestamp field names such as timestamp, ts, or time. You can also point it at a specific field with --ts-field and describe the format with --ts-format when needed.

echo "2024-01-15T10:30:00Z ERROR Connection failed" | \
    kelora -f 'cols:timestamp level *message' --ts-field timestamp

Kelora will parse the timestamp field so that --since, --until, and timestamp-aware formatting work. If your timestamps do not include a timezone, provide --ts-format and optionally --input-tz:

kelora -f 'cols:timestamp(2) level *message' \
    --ts-field timestamp \
    --ts-format '%Y-%m-%d %H:%M:%S' \
    --input-tz 'UTC'

Step 6: Custom Separators

For non-whitespace separators, use --cols-sep:

# Multi-character separator with prefix extraction
kelora --extract-prefix node --prefix-sep ' >>> ' \
    -f 'cols:timestamp *message' \
    examples/prefix_custom.log -n 5
# Multi-character separator with prefix extraction
kelora --extract-prefix node --prefix-sep ' >>> ' \
    -f 'cols:timestamp *message' \
    examples/prefix_custom.log -n 5
timestamp='2024-01-15T10:00:00Z' message='Cluster node started' node='[node-1]'
timestamp='2024-01-15T10:00:05Z' message='Joining cluster at 192.168.1.100' node='[node-1]'
timestamp='2024-01-15T10:00:10Z' message='Cluster node started' node='[node-2]'
timestamp='2024-01-15T10:00:15Z' message='Joining cluster at 192.168.1.100' node='[node-2]'
timestamp='2024-01-15T10:00:20Z' message='Connected to 1 peer(s)' node='[node-1]'

Works with any separator string:

# Comma-separated
kelora -f 'cols:level service message' --cols-sep ','

# Tab-separated
kelora -f 'cols:level service message' --cols-sep $'\t'

# Multi-character separator
kelora -f 'cols:level service message' --cols-sep ' :: '

Step 7: Real-World Example - Application Logs

Let's parse a custom application log format:

Log format:

[2024-01-15 10:30:00] ERROR api Connection failed to database
[2024-01-15 10:30:01] INFO  db  Query executed successfully
[2024-01-15 10:30:02] WARN  api Slow response: 2500ms

Column specification:

kelora -f 'cols:raw_ts level service *message' app.log

But the timestamp is wrapped in brackets. We need to clean it up.

Using Regex in Exec

kelora -f 'cols:raw_ts level service *message' app.log \
    -e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
    -e 'e.raw_ts = ()' \
    --ts-field timestamp \
    -k timestamp,level,service,message

What this does: 1. Parse raw_ts as the first field (including brackets) 2. Extract the timestamp using regex 3. Remove the temporary raw_ts field 4. Output cleaned fields

Step 8: Combining with Transformations

Parse a custom format and add computed fields:

cat app.log | \
    kelora -f 'cols:timestamp level service *message' \
    --ts-field timestamp \
    -e 'if e.message.contains("ms") { e.duration = e.message.extract_re(r"(\d+)ms", 1).to_int() }' \
    --filter 'e.has_path("duration") && e.duration > 1000' \
    -k timestamp,service,duration,message

Pipeline: 1. Parse custom format 2. Extract duration from the message if present 3. Filter for slow requests (>1000ms) 4. Output relevant fields

Step 9: Working with Mixed Columns

Some logs have consistent positions but variable content:

ERROR  2024-01-15 Connection failed
INFO   2024-01-15 Query OK
WARN   2024-01-15 High memory usage

Use whitespace delimiters for flexibility:

kelora -f 'cols:level date message' app.log

If you need to ignore specific columns, use skip tokens and multi-column captures:

kelora -f 'cols:level - message(2)' app.log

Step 10: Complete Example

Let's parse a complex custom format with everything we've learned:

Input log (custom_app.log):

[2024-01-15 10:30:00] 200 api    1234 0.123 Success
[2024-01-15 10:30:01] 404 web    5678 0.456 Not found
[2024-01-15 10:30:02] 500 api    9012 0.789 Server error

Parse specification:

kelora -f 'cols:raw_ts status:int service bytes:int latency:float *message' custom_app.log \
    -e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
    -e 'e.raw_ts = ()' \
    --ts-field timestamp \
    -e 'e.is_error = e.status >= 400' \
    -e 'e.is_slow = e.latency > 0.5' \
    --filter 'e.is_error || e.is_slow' \
    -k timestamp,status,service,message

What happens: 1. Parse fields with type annotations 2. Extract the timestamp from brackets 3. Mark HTTP errors (>= 400) and slow responses 4. Filter for problematic events 5. Output cleaned fields

Common Patterns

Pattern 1: Log Level + Message

kelora -f 'cols:level message' app.log -l error,warn

Pattern 2: Timestamp + Level + Service + Message

kelora -f 'cols:timestamp level service *message' app.log \
    --ts-field timestamp \
    --filter 'e.level == "ERROR"'

Pattern 3: Type Conversion with Greedy Capture

kelora -f 'cols:status:int bytes:int duration:float *path' access.log \
    --filter 'e.status >= 500' \
    -e 'track_avg("latency", e.duration)' \
    -m

Pattern 4: Extract and Transform

kelora -f 'cols:timestamp level *data' app.log \
    --ts-field timestamp \
    -e 'e.values = e.data.split(",")' \
    -e 'e.count = e.values.len()' \
    -k timestamp,level,count

Tips and Best Practices

Use Whitespace Delimiters When Possible

Simpler and more flexible:

# Good - whitespace delimited
kelora -f 'cols:level service message' app.log

# Skip columns rather than relying on alignment
kelora -f 'cols:level - *message' app.log

Use Type Annotations Early

Convert types during parsing, not in exec:

# Good - parse as int
kelora -f 'cols:status:int bytes:int' --filter 'e.status >= 400'

# Less efficient - convert in exec
kelora -f 'cols:status bytes' -e 'e.status = e.status.to_int()' --filter 'e.status >= 400'

Name Fields Descriptively

# Good
kelora -f 'cols:timestamp level service *message'

# Less clear
kelora -f 'cols:col1 col2 col3 col4'

Use Greedy Capture for Messages

Always use * for the last field if it's a free-form message:

kelora -f 'cols:level service *message'

Combine with Prefix Extraction

For Docker Compose-style logs:

docker compose logs | \
    kelora --extract-prefix container \
           -f 'cols:timestamp level *message' \
           --ts-field timestamp

More Recipes to Practice

CSV with type annotations:

kelora -f "csv status:int bytes:int duration_ms:int" examples/simple_csv.csv
kelora -f "csv status:int bytes:int duration_ms:int" examples/simple_csv.csv
timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status=200 bytes=1234 duration_ms=45
timestamp='2024-01-15T10:00:05Z' method='GET' path='/api/users' status=200 bytes=5678
  duration_ms=120
timestamp='2024-01-15T10:00:10Z' method='POST' path='/api/login' status=200 bytes=234 duration_ms=67
timestamp='2024-01-15T10:00:15Z' method='GET' path='/api/users/123' status=200 bytes=890
  duration_ms=34
timestamp='2024-01-15T10:00:20Z' method='PUT' path='/api/users/123' status=200 bytes=456
  duration_ms=89
timestamp='2024-01-15T10:00:25Z' method='GET' path='/static/logo.png' status=200 bytes=12345
  duration_ms=12
timestamp='2024-01-15T10:00:30Z' method='GET' path='/api/posts' status=200 bytes=9876
  duration_ms=156
timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes=0 duration_ms=5
timestamp='2024-01-15T10:00:40Z' method='POST' path='/api/posts' status=201 bytes=567
  duration_ms=234
timestamp='2024-01-15T10:00:45Z' method='GET' path='/api/comments' status=200 bytes=4321
  duration_ms=98
timestamp='2024-01-15T10:00:50Z' method='DELETE' path='/api/posts/456' status=204 bytes=0
  duration_ms=45
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes=123 duration_ms=12
timestamp='2024-01-15T10:01:00Z' method='GET' path='/api/search' status=200 bytes=7890
  duration_ms=345
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes=0 duration_ms=5
timestamp='2024-01-15T10:01:10Z' method='GET' path='/api/stats' status=200 bytes=2345
  duration_ms=234
timestamp='2024-01-15T10:01:15Z' method='GET' path='/api/health' status=200 bytes=89 duration_ms=8
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes=567
  duration_ms=1234
timestamp='2024-01-15T10:01:25Z' method='GET' path='/api/retry' status=200 bytes=890 duration_ms=89
timestamp='2024-01-15T10:01:30Z' method='PATCH' path='/api/users/789' status=200 bytes=345
  duration_ms=123
timestamp='2024-01-15T10:01:35Z' method='GET' path='/robots.txt' status=200 bytes=234 duration_ms=4
timestamp='2024-01-15T10:01:40Z' method='GET' path='/api/feed' status=200 bytes=15678
  duration_ms=456
timestamp='2024-01-15T10:01:45Z' method='POST' path='/api/logout' status=200 bytes=123
  duration_ms=34
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes=567 duration_ms=7
timestamp='2024-01-15T10:01:55Z' method='GET' path='/api/metrics' status=200 bytes=3456
  duration_ms=67
timestamp='2024-01-15T10:02:00Z' method='GET' path='/api/version' status=200 bytes=89 duration_ms=12

TSV with type annotations:

kelora -f "tsv: user_id:int success:bool" examples/simple_tsv.tsv
kelora -f "tsv: user_id:int success:bool" examples/simple_tsv.tsv
timestamp='2024-01-15T10:00:00Z' user_id=101 username='alice' action='login' success=true
timestamp='2024-01-15T10:00:10Z' user_id=102 username='bob' action='login' success=true
timestamp='2024-01-15T10:00:20Z' user_id=103 username='charlie' action='login' success=false
timestamp='2024-01-15T10:01:00Z' user_id=101 username='alice' action='view_profile' success=true
timestamp='2024-01-15T10:01:30Z' user_id=104 username='diana' action='login' success=true
timestamp='2024-01-15T10:02:00Z' user_id=102 username='bob' action='edit_post' success=true
timestamp='2024-01-15T10:02:30Z' user_id=101 username='alice' action='upload_file' success=false
timestamp='2024-01-15T10:03:00Z' user_id=105 username='eve' action='login' success=true
timestamp='2024-01-15T10:03:30Z' user_id=103 username='charlie' action='reset_password' success=true
timestamp='2024-01-15T10:04:00Z' user_id=104 username='diana' action='delete_account' success=true
timestamp='2024-01-15T10:05:00Z' user_id=102 username='bob' action='logout' success=true
timestamp='2024-01-15T10:05:30Z' user_id=105 username='eve' action='view_dashboard' success=true
timestamp='2024-01-15T10:06:00Z' user_id=101 username='alice' action='logout' success=true
timestamp='2024-01-15T10:07:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:07:30Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:08:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:09:00Z' user_id=105 username='eve' action='change_settings' success=true
timestamp='2024-01-15T10:10:00Z' user_id=103 username='charlie' action='login' success=true
timestamp='2024-01-15T10:11:00Z' user_id=107 username='grace' action='login' success=true
timestamp='2024-01-15T10:12:00Z' user_id=103 username='charlie' action='logout' success=true

Columns format with multi-token timestamp:

kelora -f "cols:ts(2) level *msg:string" examples/cols_fixed.log
kelora -f "cols:ts(2) level *msg:string" examples/cols_fixed.log
ts='2024-01-15 10:00:00' level='INFO' msg='server  Application started successfully\n'
ts='2024-01-15 10:00:05' level='DEBUG' msg='config  Loading configuration from file\n'
ts='2024-01-15 10:00:10' level='WARN' msg='memory  Memory usage at 85% threshold\n'
ts='2024-01-15 10:00:15' level='ERROR' msg='network Connection timeout after 5000ms\n'
ts='2024-01-15 10:00:20' level='INFO' msg='api     Received GET request for /users\n'
ts='2024-01-15 10:00:25' level='DEBUG' msg='cache   Cache hit for key user:123\n'
ts='2024-01-15 10:00:30' level='INFO' msg='api     Response sent with status 200\n'
ts='2024-01-15 10:00:35' level='CRITICAL' msg='disk    Disk space below 1GB on /var\n'
ts='2024-01-15 10:00:40' level='ERROR' msg='auth    Invalid credentials for admin\n'
ts='2024-01-15 10:00:45' level='WARN' msg='rate    Rate limit approaching for IP\n'

Strict mode for catching malformed rows:

kelora -f "csv status:int" --strict examples/errors_csv_ragged.csv
kelora -f "csv status:int" --strict examples/errors_csv_ragged.csv
timestamp='2024-01-15T10:00:00Z' level='INFO' message='Valid 4-column row' extra='data'
timestamp='2024-01-15T10:00:05Z' level='WARN' message='Missing extra column'
timestamp='2024-01-15T10:00:10Z' level='ERROR' message='Extra' extra='column'
timestamp='2024-01-15T10:00:15Z' level='INFO' message='Valid row' extra='more_data'
timestamp='2024-01-15T10:00:20Z' level='DEBUG'
timestamp='2024-01-15T10:00:25Z' level='INFO' message='Another valid' extra='row'
timestamp='2024-01-15T10:00:30Z' level='WARN' message='Quoted "comma' extra=' inside"'
timestamp='2024-01-15T10:00:35Z' level='ERROR' message='Missing' extra=''
timestamp='2024-01-15T10:00:40Z' level='INFO' message='Valid' extra='data'
timestamp='2024-01-15T10:00:45Z' level='CRITICAL' message='Final' extra='row'

Inline type annotations and strict mode are perfect for catching malformed rows during ingestion before they reach downstream systems.

Troubleshooting

Fields Not Parsing Correctly

Problem: Fields are misaligned or missing.

Solution: Check separators and column counts:

# Debug by outputting all fields
kelora -f 'cols:field1 field2 *field3' app.log -n 3

# Try different separator
kelora -f 'cols:field1 field2 field3' --cols-sep '|' app.log -n 3

Timestamp Not Recognized

Problem: Timestamp field not working with --since.

Solution: Tell Kelora which field holds the timestamp and provide the format if needed:

kelora -f 'cols:timestamp level *message' app.log \
    --ts-field timestamp \
    --ts-format '%Y-%m-%d %H:%M:%S'

Type Conversion Failures

Problem: Integer/float conversions failing.

Solution: Use get_path with defaults in resilient mode:

kelora -f 'cols:status bytes' -e 'e.status_int = to_int_or(e.status, 0)'

Greedy Field Capturing Too Much

Problem: Last field includes separators.

Solution: Be specific with earlier fields:

kelora -f 'cols:level service message(2) *remainder'

Next Steps

Now that you understand custom format parsing, explore: