Process CSV Data¶

Parse and analyze CSV/TSV files with type annotations, header handling, and field transformations.

Problem¶

You have CSV or TSV files and need to parse them with proper type handling, filter rows, aggregate values, or transform the data for further analysis.

Solutions¶

Basic CSV Parsing¶

Parse CSV files with automatic header detection:

# Parse CSV with headers
kelora -f csv data.csv -n 5

# Parse TSV (tab-separated)
kelora -f tsv data.tsv -n 5

CSV with Type Annotations¶

Specify field types for proper numeric handling:

CommandOutput

# Type annotations can be in the CSV header itself
kelora -f csv examples/csv_typed.csv -n 3

# Type annotations can be in the CSV header itself
kelora -f csv examples/csv_typed.csv -n 3

timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status:int='200' bytes:int='1234'
  duration:int='45'
timestamp='2024-01-15T10:00:05Z' method='POST' path='/api/login' status:int='200' bytes:int='456'
  duration:int='78'
timestamp='2024-01-15T10:00:10Z' method='GET' path='/api/users' status:int='200' bytes:int='9876'
  duration:int='123'

Type annotations enable:

Numeric comparisons and arithmetic
Sorting by numeric values
Proper aggregations (sum, average, etc.)

CSV without Headers¶

Process CSV files without header rows:

# No headers - use csvnh (CSV No Header)
kelora -f csvnh data.csv

# Access by index: _1, _2, _3, etc.
kelora -f csvnh data.csv -e 'e.timestamp = e._1; e.status = e._2.to_int()'

# TSV without headers
kelora -f tsvnh data.tsv

Filter CSV Rows¶

Filter based on field values:

CommandOutput

# Filter by status code using CLI type annotations
kelora -f 'csv status:int' examples/simple_csv.csv \
  --filter 'e.status >= 400'

# Filter by status code using CLI type annotations
kelora -f 'csv status:int' examples/simple_csv.csv \
  --filter 'e.status >= 400'

timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes='0'
  duration_ms='5'
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes='123' duration_ms='12'
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes='0'
  duration_ms='5'
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes='567'
  duration_ms='1234'
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes='567'
  duration_ms='7'

Aggregate CSV Data¶

Calculate statistics across rows:

# Count by status code
kelora -f 'csv status:int' data.csv \
  -e 'track_count(e.status)' \
  --metrics

# Sum bytes transferred
kelora -f 'csv bytes:int' data.csv \
  -e 'track_sum("total_bytes", e.bytes)' \
  --metrics

# Track unique values
kelora -f csv data.csv \
  -e 'track_unique("methods", e.method)' \
  --metrics

Transform CSV Fields¶

Create new fields or modify existing ones:

# Calculate derived fields
kelora -f 'csv bytes:int duration:int' data.csv \
  -e 'e.throughput = e.bytes / e.duration'

# Normalize timestamps
kelora -f csv data.csv \
  -e 'e.hour = to_datetime(e.timestamp).hour()'

# Extract path components
kelora -f csv data.csv \
  -e 'e.endpoint = e.path.split("/")[1]' \
  -e 'track_count(e.endpoint)' \
  --metrics

Convert CSV to Other Formats¶

# CSV to JSON
kelora -f csv data.csv -J > output.json

# CSV to logfmt
kelora -f csv data.csv -F logfmt > output.log

# CSV to JSON with selected fields
kelora -f csv data.csv -k timestamp,status,bytes -F json

Handle Ragged/Malformed CSV¶

Process CSV files with missing or extra fields:

# Resilient mode (default) - skip bad rows
kelora -f csv data.csv --stats

# See parsing errors
kelora -f csv data.csv --verbose

# Strict mode - abort on first error
kelora -f csv data.csv --strict

Select Specific Columns¶

Output only desired fields:

# Select specific fields
kelora -f csv data.csv -k timestamp,method,status

# Exclude sensitive fields
kelora -f csv data.csv --exclude-keys email,ip_address

# Reorder fields in output
kelora -f csv data.csv -k status,method,path -F csv

Real-World Examples¶

Find Slow API Calls¶

kelora -f 'csv path method status:int duration:int' api_log.csv \
  --filter 'e.duration > 1000' \
  -e 'track_count(e.path)' \
  -k path,method,duration --metrics

Calculate Response Time Percentiles¶

kelora -f 'csv duration:int' data.csv \
  --window 10000 --end '
    let times = window_numbers("duration");
    print("p50: " + times.percentile(50));
    print("p95: " + times.percentile(95));
    print("p99: " + times.percentile(99))
  '

Export Error Rows¶

kelora -f 'csv status:int' data.csv \
  --filter 'e.status >= 500' \
  -J > errors.json

Group by Time Windows¶

kelora -f csv data.csv \
  -e 'e.hour = to_datetime(e.timestamp).format("%Y-%m-%d %H:00")' \
  -e 'track_count(e.hour)' \
  --metrics

Clean and Normalize Data¶

kelora -f csv raw_data.csv \
  -e 'e.email = e.email.to_lower().strip()' \
  -e 'e.status = to_int_or(e.status, 0)' \
  -e 'e.timestamp = to_datetime(e.timestamp).to_iso()' \
  -F csv > cleaned.csv

Tips¶

Type Handling:

Use :int for status codes, counts, IDs
Use :float for durations, measurements, rates
Use :bool for flags (accepts: true/false, 1/0, yes/no)
Untyped fields remain as strings

Performance:

Add --parallel for large CSV files
Use --batch-size to control memory usage
Filter early to reduce processing overhead

Error Handling:

Use --verbose to see parsing errors
Use --stats to see skip/error counts
Ragged CSV (missing fields) creates partial events

Field Access:

Headers become field names (spaces → underscores)
Without headers, use _1, _2, _3 etc.
Use to_int_or() / to_float_or() for safe type conversion