Process CSV Data¶
Parse and analyze CSV/TSV files with type annotations, header handling, and field transformations.
Problem¶
You have CSV or TSV files and need to parse them with proper type handling, filter rows, aggregate values, or transform the data for further analysis.
Solutions¶
Basic CSV Parsing¶
Parse CSV files with automatic header detection:
# Parse CSV with headers
kelora -f csv data.csv -n 5
# Parse TSV (tab-separated)
kelora -f tsv data.tsv -n 5
CSV with Type Annotations¶
Specify field types for proper numeric handling:
timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status:int='200' bytes:int='1234'
duration:int='45'
timestamp='2024-01-15T10:00:05Z' method='POST' path='/api/login' status:int='200' bytes:int='456'
duration:int='78'
timestamp='2024-01-15T10:00:10Z' method='GET' path='/api/users' status:int='200' bytes:int='9876'
duration:int='123'
Type annotations enable:
- Numeric comparisons and arithmetic
- Sorting by numeric values
- Proper aggregations (sum, average, etc.)
CSV without Headers¶
Process CSV files without header rows:
# No headers - use csvnh (CSV No Header)
kelora -f csvnh data.csv
# Access by index: _1, _2, _3, etc.
kelora -f csvnh data.csv -e 'e.timestamp = e._1; e.status = e._2.to_int()'
# TSV without headers
kelora -f tsvnh data.tsv
Filter CSV Rows¶
Filter based on field values:
# Filter by status code using CLI type annotations
kelora -f 'csv status:int' examples/simple_csv.csv \
--filter 'e.status >= 400'
timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes='0'
duration_ms='5'
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes='123' duration_ms='12'
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes='0'
duration_ms='5'
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes='567'
duration_ms='1234'
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes='567'
duration_ms='7'
Aggregate CSV Data¶
Calculate statistics across rows:
# Count by status code
kelora -f 'csv status:int' data.csv \
-e 'track_count(e.status)' \
--metrics
# Sum bytes transferred
kelora -f 'csv bytes:int' data.csv \
-e 'track_sum("total_bytes", e.bytes)' \
--metrics
# Track unique values
kelora -f csv data.csv \
-e 'track_unique("methods", e.method)' \
--metrics
Transform CSV Fields¶
Create new fields or modify existing ones:
# Calculate derived fields
kelora -f 'csv bytes:int duration:int' data.csv \
-e 'e.throughput = e.bytes / e.duration'
# Normalize timestamps
kelora -f csv data.csv \
-e 'e.hour = to_datetime(e.timestamp).hour()'
# Extract path components
kelora -f csv data.csv \
-e 'e.endpoint = e.path.split("/")[1]' \
-e 'track_count(e.endpoint)' \
--metrics
Convert CSV to Other Formats¶
# CSV to JSON
kelora -f csv data.csv -J > output.json
# CSV to logfmt
kelora -f csv data.csv -F logfmt > output.log
# CSV to JSON with selected fields
kelora -f csv data.csv -k timestamp,status,bytes -F json
Handle Ragged/Malformed CSV¶
Process CSV files with missing or extra fields:
# Resilient mode (default) - skip bad rows
kelora -f csv data.csv --stats
# See parsing errors
kelora -f csv data.csv --verbose
# Strict mode - abort on first error
kelora -f csv data.csv --strict
Select Specific Columns¶
Output only desired fields:
# Select specific fields
kelora -f csv data.csv -k timestamp,method,status
# Exclude sensitive fields
kelora -f csv data.csv --exclude-keys email,ip_address
# Reorder fields in output
kelora -f csv data.csv -k status,method,path -F csv
Real-World Examples¶
Find Slow API Calls¶
kelora -f 'csv path method status:int duration:int' api_log.csv \
--filter 'e.duration > 1000' \
-e 'track_count(e.path)' \
-k path,method,duration --metrics
Calculate Response Time Percentiles¶
kelora -f 'csv duration:int' data.csv \
--window 10000 --end '
let times = window_numbers("duration");
print("p50: " + times.percentile(50));
print("p95: " + times.percentile(95));
print("p99: " + times.percentile(99))
'
Export Error Rows¶
Group by Time Windows¶
kelora -f csv data.csv \
-e 'e.hour = to_datetime(e.timestamp).format("%Y-%m-%d %H:00")' \
-e 'track_count(e.hour)' \
--metrics
Clean and Normalize Data¶
kelora -f csv raw_data.csv \
-e 'e.email = e.email.to_lower().strip()' \
-e 'e.status = to_int_or(e.status, 0)' \
-e 'e.timestamp = to_datetime(e.timestamp).to_iso()' \
-F csv > cleaned.csv
Tips¶
Type Handling:
- Use
:intfor status codes, counts, IDs - Use
:floatfor durations, measurements, rates - Use
:boolfor flags (accepts: true/false, 1/0, yes/no) - Untyped fields remain as strings
Performance:
- Add
--parallelfor large CSV files - Use
--batch-sizeto control memory usage - Filter early to reduce processing overhead
Error Handling:
- Use
--verboseto see parsing errors - Use
--statsto see skip/error counts - Ragged CSV (missing fields) creates partial events
Field Access:
- Headers become field names (spaces → underscores)
- Without headers, use
_1,_2,_3etc. - Use
to_int_or()/to_float_or()for safe type conversion
See Also¶
- Fan Out Nested Structures - Process nested CSV data
- Analyze Web Traffic - Similar patterns for log files
- Function Reference - Type conversion functions
- CLI Reference - Format specifications