Skip to content

Process CSV Data

Parse and analyze CSV/TSV files with type annotations, header handling, and field transformations.

Problem

You have CSV or TSV files and need to parse them with proper type handling, filter rows, aggregate values, or transform the data for further analysis.

Solutions

Basic CSV Parsing

Parse CSV files with automatic header detection:

# Parse CSV with headers
kelora -f csv data.csv -n 5

# Parse TSV (tab-separated)
kelora -f tsv data.tsv -n 5

CSV with Type Annotations

Specify field types for proper numeric handling:

# Type annotations can be in the CSV header itself
kelora -f csv examples/csv_typed.csv -n 3
# Type annotations can be in the CSV header itself
kelora -f csv examples/csv_typed.csv -n 3
timestamp='2024-01-15T10:00:00Z' method='GET' path='/' status:int='200' bytes:int='1234'
  duration:int='45'
timestamp='2024-01-15T10:00:05Z' method='POST' path='/api/login' status:int='200' bytes:int='456'
  duration:int='78'
timestamp='2024-01-15T10:00:10Z' method='GET' path='/api/users' status:int='200' bytes:int='9876'
  duration:int='123'

Type annotations enable:

  • Numeric comparisons and arithmetic
  • Sorting by numeric values
  • Proper aggregations (sum, average, etc.)

CSV without Headers

Process CSV files without header rows:

# No headers - use csvnh (CSV No Header)
kelora -f csvnh data.csv

# Access by index: _1, _2, _3, etc.
kelora -f csvnh data.csv -e 'e.timestamp = e._1; e.status = e._2.to_int()'

# TSV without headers
kelora -f tsvnh data.tsv

Filter CSV Rows

Filter based on field values:

# Filter by status code using CLI type annotations
kelora -f 'csv status:int' examples/simple_csv.csv \
  --filter 'e.status >= 400'
# Filter by status code using CLI type annotations
kelora -f 'csv status:int' examples/simple_csv.csv \
  --filter 'e.status >= 400'
timestamp='2024-01-15T10:00:35Z' method='GET' path='/favicon.ico' status=404 bytes='0'
  duration_ms='5'
timestamp='2024-01-15T10:00:55Z' method='GET' path='/admin' status=401 bytes='123' duration_ms='12'
timestamp='2024-01-15T10:01:05Z' method='POST' path='/api/upload' status=413 bytes='0'
  duration_ms='5'
timestamp='2024-01-15T10:01:20Z' method='GET' path='/api/users' status=500 bytes='567'
  duration_ms='1234'
timestamp='2024-01-15T10:01:50Z' method='GET' path='/nonexistent' status=404 bytes='567'
  duration_ms='7'

Aggregate CSV Data

Calculate statistics across rows:

# Count by status code
kelora -f 'csv status:int' data.csv \
  -e 'track_count(e.status)' \
  --metrics

# Sum bytes transferred
kelora -f 'csv bytes:int' data.csv \
  -e 'track_sum("total_bytes", e.bytes)' \
  --metrics

# Track unique values
kelora -f csv data.csv \
  -e 'track_unique("methods", e.method)' \
  --metrics

Transform CSV Fields

Create new fields or modify existing ones:

# Calculate derived fields
kelora -f 'csv bytes:int duration:int' data.csv \
  -e 'e.throughput = e.bytes / e.duration'

# Normalize timestamps
kelora -f csv data.csv \
  -e 'e.hour = to_datetime(e.timestamp).hour()'

# Extract path components
kelora -f csv data.csv \
  -e 'e.endpoint = e.path.split("/")[1]' \
  -e 'track_count(e.endpoint)' \
  --metrics

Convert CSV to Other Formats

# CSV to JSON
kelora -f csv data.csv -J > output.json

# CSV to logfmt
kelora -f csv data.csv -F logfmt > output.log

# CSV to JSON with selected fields
kelora -f csv data.csv -k timestamp,status,bytes -F json

Handle Ragged/Malformed CSV

Process CSV files with missing or extra fields:

# Resilient mode (default) - skip bad rows
kelora -f csv data.csv --stats

# See parsing errors
kelora -f csv data.csv --verbose

# Strict mode - abort on first error
kelora -f csv data.csv --strict

Select Specific Columns

Output only desired fields:

# Select specific fields
kelora -f csv data.csv -k timestamp,method,status

# Exclude sensitive fields
kelora -f csv data.csv --exclude-keys email,ip_address

# Reorder fields in output
kelora -f csv data.csv -k status,method,path -F csv

Real-World Examples

Find Slow API Calls

kelora -f 'csv path method status:int duration:int' api_log.csv \
  --filter 'e.duration > 1000' \
  -e 'track_count(e.path)' \
  -k path,method,duration --metrics

Calculate Response Time Percentiles

kelora -f 'csv duration:int' data.csv \
  --window 10000 --end '
    let times = window_numbers("duration");
    print("p50: " + times.percentile(50));
    print("p95: " + times.percentile(95));
    print("p99: " + times.percentile(99))
  '

Export Error Rows

kelora -f 'csv status:int' data.csv \
  --filter 'e.status >= 500' \
  -J > errors.json

Group by Time Windows

kelora -f csv data.csv \
  -e 'e.hour = to_datetime(e.timestamp).format("%Y-%m-%d %H:00")' \
  -e 'track_count(e.hour)' \
  --metrics

Clean and Normalize Data

kelora -f csv raw_data.csv \
  -e 'e.email = e.email.to_lower().strip()' \
  -e 'e.status = to_int_or(e.status, 0)' \
  -e 'e.timestamp = to_datetime(e.timestamp).to_iso()' \
  -F csv > cleaned.csv

Tips

Type Handling:

  • Use :int for status codes, counts, IDs
  • Use :float for durations, measurements, rates
  • Use :bool for flags (accepts: true/false, 1/0, yes/no)
  • Untyped fields remain as strings

Performance:

  • Add --parallel for large CSV files
  • Use --batch-size to control memory usage
  • Filter early to reduce processing overhead

Error Handling:

  • Use --verbose to see parsing errors
  • Use --stats to see skip/error counts
  • Ragged CSV (missing fields) creates partial events

Field Access:

  • Headers become field names (spaces → underscores)
  • Without headers, use _1, _2, _3 etc.
  • Use to_int_or() / to_float_or() for safe type conversion

See Also