Parsing Custom Formats¶
Learn how to parse custom log formats using Kelora's column specification syntax.
What You'll Learn¶
By the end of this tutorial, you'll be able to:
- Parse whitespace- and separator-delimited custom logs
- Use column specifications with names, multi-token captures, and skips
- Apply supported type annotations for automatic conversion
- Handle timestamp fields and greedy captures
- Combine custom parsing with filtering and transformation
Prerequisites¶
- Completed the Quickstart
- Basic understanding of log formats
- Familiarity with command-line operations
Overview¶
Many applications generate custom log formats that don't match standard formats like JSON or syslog. Kelora's cols:<spec> format lets you define custom column specifications to parse these logs.
Time: ~15 minutes
Step 1: Understanding Column Specifications¶
The basic syntax for column parsing is:
Each token in the specification describes how many delimited columns to consume and which field name to assign.
Token types:
field- Consume a single column intofieldfield(N)- ConsumeNcolumns (joined with spaces or the provided separator)-/-(N)- Skip one orNcolumns entirely*field- Capture every remaining column intofield(must be last)field:type- Apply a type annotation (int,float,bool,string) after extraction
Combine these building blocks to describe almost any column-based log format.
Example Log Format¶
Let's start with a simple application log:
Each line has three whitespace-delimited fields: level, service, message.
Step 2: Basic Whitespace-Delimited Parsing¶
For whitespace-delimited fields, just list the field names:
How it works:
level- First whitespace-delimited columnservice- Second columnmessage- Third column
Step 3: Combining Multiple Columns¶
Sometimes a field spans more than one column. Use the (N) suffix to join multiple tokens.
How it works:
timestamp(2)- Consumes the first two columns (2024-01-15and10:30:00)level- Third column*message- Everything else in the line
Step 4: Adding Type Annotations¶
Convert fields to specific types using annotations:
echo "200 1234 0.123
404 5678 0.456
500 9012 0.789" | \
kelora -f 'cols:status:int bytes:int response_time:float'
Supported types:
int- Integer (i64) conversionfloat- Floating-point (f64) conversionbool- Boolean conversion (true/false,yes/no,1/0)string- Explicitly keep as string (useful when mixing annotations)
After type conversion, you can use numeric operations immediately:
echo "200 1234 0.123" | \
kelora -f 'cols:status:int bytes:int response_time:float' \
--filter 'e.status >= 400'
Step 5: Handling Timestamps¶
Kelora automatically looks for common timestamp field names such as timestamp, ts, or time. You can also point it at a specific field with --ts-field and describe the format with --ts-format when needed.
echo "2024-01-15T10:30:00Z ERROR Connection failed" | \
kelora -f 'cols:timestamp level *message' --ts-field timestamp
Kelora will parse the timestamp field so that --since, --until, and timestamp-aware formatting work. If your timestamps do not include a timezone, provide --ts-format and optionally --input-tz:
kelora -f 'cols:timestamp(2) level *message' \
--ts-field timestamp \
--ts-format '%Y-%m-%d %H:%M:%S' \
--input-tz 'UTC'
Step 6: Custom Separators¶
For non-whitespace separators, use --cols-sep:
# Multi-character separator with prefix extraction
kelora --extract-prefix node --prefix-sep ' >>> ' \
-f 'cols:timestamp *message' \
examples/prefix_custom.log -n 5
timestamp='2024-01-15T10:00:00Z' message='Cluster node started' node='[node-1]'
timestamp='2024-01-15T10:00:05Z' message='Joining cluster at 192.168.1.100' node='[node-1]'
timestamp='2024-01-15T10:00:10Z' message='Cluster node started' node='[node-2]'
timestamp='2024-01-15T10:00:15Z' message='Joining cluster at 192.168.1.100' node='[node-2]'
timestamp='2024-01-15T10:00:20Z' message='Connected to 1 peer(s)' node='[node-1]'
Works with any separator string:
# Comma-separated
kelora -f 'cols:level service message' --cols-sep ','
# Tab-separated
kelora -f 'cols:level service message' --cols-sep $'\t'
# Multi-character separator
kelora -f 'cols:level service message' --cols-sep ' :: '
Step 7: Real-World Example - Application Logs¶
Let's parse a custom application log format:
Log format:
[2024-01-15 10:30:00] ERROR api Connection failed to database
[2024-01-15 10:30:01] INFO db Query executed successfully
[2024-01-15 10:30:02] WARN api Slow response: 2500ms
Column specification:
But the timestamp is wrapped in brackets. We need to clean it up.
Using Regex in Exec¶
kelora -f 'cols:raw_ts level service *message' app.log \
-e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
-e 'e.raw_ts = ()' \
--ts-field timestamp \
-k timestamp,level,service,message
What this does:
1. Parse raw_ts as the first field (including brackets)
2. Extract the timestamp using regex
3. Remove the temporary raw_ts field
4. Output cleaned fields
Step 8: Combining with Transformations¶
Parse a custom format and add computed fields:
cat app.log | \
kelora -f 'cols:timestamp level service *message' \
--ts-field timestamp \
-e 'if e.message.contains("ms") { e.duration = e.message.extract_re(r"(\d+)ms", 1).to_int() }' \
--filter 'e.has_path("duration") && e.duration > 1000' \
-k timestamp,service,duration,message
Pipeline: 1. Parse custom format 2. Extract duration from the message if present 3. Filter for slow requests (>1000ms) 4. Output relevant fields
Step 9: Working with Mixed Columns¶
Some logs have consistent positions but variable content:
Use whitespace delimiters for flexibility:
If you need to ignore specific columns, use skip tokens and multi-column captures:
Step 10: Complete Example¶
Let's parse a complex custom format with everything we've learned:
Input log (custom_app.log):
[2024-01-15 10:30:00] 200 api 1234 0.123 Success
[2024-01-15 10:30:01] 404 web 5678 0.456 Not found
[2024-01-15 10:30:02] 500 api 9012 0.789 Server error
Parse specification:
kelora -f 'cols:raw_ts status:int service bytes:int latency:float *message' custom_app.log \
-e 'e.timestamp = e.raw_ts.extract_re(r"\[(.*?)\]", 1)' \
-e 'e.raw_ts = ()' \
--ts-field timestamp \
-e 'e.is_error = e.status >= 400' \
-e 'e.is_slow = e.latency > 0.5' \
--filter 'e.is_error || e.is_slow' \
-k timestamp,status,service,message
What happens:
1. Parse fields with type annotations
2. Extract the timestamp from brackets
3. Mark HTTP errors (>= 400) and slow responses
4. Filter for problematic events
5. Output cleaned fields
Common Patterns¶
Pattern 1: Log Level + Message¶
Pattern 2: Timestamp + Level + Service + Message¶
kelora -f 'cols:timestamp level service *message' app.log \
--ts-field timestamp \
--filter 'e.level == "ERROR"'
Pattern 3: Type Conversion with Greedy Capture¶
kelora -f 'cols:status:int bytes:int duration:float *path' access.log \
--filter 'e.status >= 500' \
-e 'track_avg("latency", e.duration)' \
-m
Pattern 4: Extract and Transform¶
kelora -f 'cols:timestamp level *data' app.log \
--ts-field timestamp \
-e 'e.values = e.data.split(",")' \
-e 'e.count = e.values.len()' \
-k timestamp,level,count
Tips and Best Practices¶
Use Whitespace Delimiters When Possible¶
Simpler and more flexible:
# Good - whitespace delimited
kelora -f 'cols:level service message' app.log
# Skip columns rather than relying on alignment
kelora -f 'cols:level - *message' app.log
Use Type Annotations Early¶
Convert types during parsing, not in exec:
# Good - parse as int
kelora -f 'cols:status:int bytes:int' --filter 'e.status >= 400'
# Less efficient - convert in exec
kelora -f 'cols:status bytes' -e 'e.status = e.status.to_int()' --filter 'e.status >= 400'
Name Fields Descriptively¶
# Good
kelora -f 'cols:timestamp level service *message'
# Less clear
kelora -f 'cols:col1 col2 col3 col4'
Use Greedy Capture for Messages¶
Always use * for the last field if it's a free-form message:
Combine with Prefix Extraction¶
For Docker Compose-style logs:
docker compose logs | \
kelora --extract-prefix container \
-f 'cols:timestamp level *message' \
--ts-field timestamp
More Recipes to Practice¶
CSV with type annotations:
timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status=200 bytes=1234 duration_ms=45
timestamp='2024-01-15T10:00:05Z' method='GET' path='/api/users' status=200 bytes=5678
duration_ms=120
timestamp='2024-01-15T10:00:10Z' method='POST' path='/api/login' status=200 bytes=234 duration_ms=67
timestamp='2024-01-15T10:00:15Z' method='GET' path='/api/users/123' status=200 bytes=890
duration_ms=34
timestamp='2024-01-15T10:00:20Z' method='PUT' path='/api/users/123' status=200 bytes=456
duration_ms=89
timestamp='2024-01-15T10:00:25Z' method='GET' path='/static/logo.png' status=200 bytes=12345
duration_ms=12
timestamp='2024-01-15T10:00:30Z' method='GET' path='/api/posts' status=200 bytes=9876
duration_ms=156
timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes=0 duration_ms=5
timestamp='2024-01-15T10:00:40Z' method='POST' path='/api/posts' status=201 bytes=567
duration_ms=234
timestamp='2024-01-15T10:00:45Z' method='GET' path='/api/comments' status=200 bytes=4321
duration_ms=98
timestamp='2024-01-15T10:00:50Z' method='DELETE' path='/api/posts/456' status=204 bytes=0
duration_ms=45
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes=123 duration_ms=12
timestamp='2024-01-15T10:01:00Z' method='GET' path='/api/search' status=200 bytes=7890
duration_ms=345
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes=0 duration_ms=5
timestamp='2024-01-15T10:01:10Z' method='GET' path='/api/stats' status=200 bytes=2345
duration_ms=234
timestamp='2024-01-15T10:01:15Z' method='GET' path='/api/health' status=200 bytes=89 duration_ms=8
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes=567
duration_ms=1234
timestamp='2024-01-15T10:01:25Z' method='GET' path='/api/retry' status=200 bytes=890 duration_ms=89
timestamp='2024-01-15T10:01:30Z' method='PATCH' path='/api/users/789' status=200 bytes=345
duration_ms=123
timestamp='2024-01-15T10:01:35Z' method='GET' path='/robots.txt' status=200 bytes=234 duration_ms=4
timestamp='2024-01-15T10:01:40Z' method='GET' path='/api/feed' status=200 bytes=15678
duration_ms=456
timestamp='2024-01-15T10:01:45Z' method='POST' path='/api/logout' status=200 bytes=123
duration_ms=34
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes=567 duration_ms=7
timestamp='2024-01-15T10:01:55Z' method='GET' path='/api/metrics' status=200 bytes=3456
duration_ms=67
timestamp='2024-01-15T10:02:00Z' method='GET' path='/api/version' status=200 bytes=89 duration_ms=12
TSV with type annotations:
timestamp='2024-01-15T10:00:00Z' user_id=101 username='alice' action='login' success=true
timestamp='2024-01-15T10:00:10Z' user_id=102 username='bob' action='login' success=true
timestamp='2024-01-15T10:00:20Z' user_id=103 username='charlie' action='login' success=false
timestamp='2024-01-15T10:01:00Z' user_id=101 username='alice' action='view_profile' success=true
timestamp='2024-01-15T10:01:30Z' user_id=104 username='diana' action='login' success=true
timestamp='2024-01-15T10:02:00Z' user_id=102 username='bob' action='edit_post' success=true
timestamp='2024-01-15T10:02:30Z' user_id=101 username='alice' action='upload_file' success=false
timestamp='2024-01-15T10:03:00Z' user_id=105 username='eve' action='login' success=true
timestamp='2024-01-15T10:03:30Z' user_id=103 username='charlie' action='reset_password' success=true
timestamp='2024-01-15T10:04:00Z' user_id=104 username='diana' action='delete_account' success=true
timestamp='2024-01-15T10:05:00Z' user_id=102 username='bob' action='logout' success=true
timestamp='2024-01-15T10:05:30Z' user_id=105 username='eve' action='view_dashboard' success=true
timestamp='2024-01-15T10:06:00Z' user_id=101 username='alice' action='logout' success=true
timestamp='2024-01-15T10:07:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:07:30Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:08:00Z' user_id=106 username='frank' action='login' success=false
timestamp='2024-01-15T10:09:00Z' user_id=105 username='eve' action='change_settings' success=true
timestamp='2024-01-15T10:10:00Z' user_id=103 username='charlie' action='login' success=true
timestamp='2024-01-15T10:11:00Z' user_id=107 username='grace' action='login' success=true
timestamp='2024-01-15T10:12:00Z' user_id=103 username='charlie' action='logout' success=true
Columns format with multi-token timestamp:
ts='2024-01-15 10:00:00' level='INFO' msg='server Application started successfully\n'
ts='2024-01-15 10:00:05' level='DEBUG' msg='config Loading configuration from file\n'
ts='2024-01-15 10:00:10' level='WARN' msg='memory Memory usage at 85% threshold\n'
ts='2024-01-15 10:00:15' level='ERROR' msg='network Connection timeout after 5000ms\n'
ts='2024-01-15 10:00:20' level='INFO' msg='api Received GET request for /users\n'
ts='2024-01-15 10:00:25' level='DEBUG' msg='cache Cache hit for key user:123\n'
ts='2024-01-15 10:00:30' level='INFO' msg='api Response sent with status 200\n'
ts='2024-01-15 10:00:35' level='CRITICAL' msg='disk Disk space below 1GB on /var\n'
ts='2024-01-15 10:00:40' level='ERROR' msg='auth Invalid credentials for admin\n'
ts='2024-01-15 10:00:45' level='WARN' msg='rate Rate limit approaching for IP\n'
Strict mode for catching malformed rows:
timestamp='2024-01-15T10:00:00Z' level='INFO' message='Valid 4-column row' extra='data'
timestamp='2024-01-15T10:00:05Z' level='WARN' message='Missing extra column'
timestamp='2024-01-15T10:00:10Z' level='ERROR' message='Extra' extra='column'
timestamp='2024-01-15T10:00:15Z' level='INFO' message='Valid row' extra='more_data'
timestamp='2024-01-15T10:00:20Z' level='DEBUG'
timestamp='2024-01-15T10:00:25Z' level='INFO' message='Another valid' extra='row'
timestamp='2024-01-15T10:00:30Z' level='WARN' message='Quoted "comma' extra=' inside"'
timestamp='2024-01-15T10:00:35Z' level='ERROR' message='Missing' extra=''
timestamp='2024-01-15T10:00:40Z' level='INFO' message='Valid' extra='data'
timestamp='2024-01-15T10:00:45Z' level='CRITICAL' message='Final' extra='row'
Inline type annotations and strict mode are perfect for catching malformed rows during ingestion before they reach downstream systems.
Troubleshooting¶
Fields Not Parsing Correctly¶
Problem: Fields are misaligned or missing.
Solution: Check separators and column counts:
# Debug by outputting all fields
kelora -f 'cols:field1 field2 *field3' app.log -n 3
# Try different separator
kelora -f 'cols:field1 field2 field3' --cols-sep '|' app.log -n 3
Timestamp Not Recognized¶
Problem: Timestamp field not working with --since.
Solution: Tell Kelora which field holds the timestamp and provide the format if needed:
kelora -f 'cols:timestamp level *message' app.log \
--ts-field timestamp \
--ts-format '%Y-%m-%d %H:%M:%S'
Type Conversion Failures¶
Problem: Integer/float conversions failing.
Solution: Use get_path with defaults in resilient mode:
Greedy Field Capturing Too Much¶
Problem: Last field includes separators.
Solution: Be specific with earlier fields:
Next Steps¶
Now that you understand custom format parsing, explore:
- Working with Time - Advanced timestamp handling
- Format Reference - Complete format documentation
- Scripting Transforms - Advanced event transformation