Skip to content

Prepare CSV Exports for Analytics

Clean up CSV or TSV logs, enforce types, and produce analysis-ready datasets for spreadsheets, SQL engines, or BI tools.

Use This Guide When

  • Your application or appliance emits CSV/TSV logs that need quick filtering or enrichment.
  • You need to normalise raw exports before loading them into DuckDB, Pandas, or spreadsheets.
  • You want to ship lightweight metrics without building a full ETL job.

Before You Start

  • Examples use examples/simple_csv.csv. Replace paths with your real data.
  • Confirm whether the file has a header row. Use -f csvnh/tsvnh for header-less data.
  • Decide which columns require numeric or boolean types; type annotations enable filters and arithmetic.

Step 1: Inspect the Source File

Glance at a handful of rows to understand column names and potential anomalies.

kelora -f csv examples/simple_csv.csv -n 5
  • Look for blank fields, mixed types, or embedded quotes.
  • Use --stats during early runs to catch parsing errors.

Step 2: Enforce Column Types

Typed columns allow numerical comparisons and aggregations.

kelora -f 'csv status:int bytes:int duration_ms:int' examples/simple_csv.csv \
  --take 3

Alternatives:

  • Embed annotations in the header row (e.g., status:int,duration_ms:int).
  • For TSV, replace csv with tsv in the format string.

Step 3: Clean and Transform Rows

Normalise values, derive new columns, or drop unnecessary data.

kelora -f 'csv status:int bytes:int duration_ms:int' examples/simple_csv.csv \
  -e 'e.endpoint = e.path.split("/")[1]' \
  -e 'e.duration_s = e.duration_ms / 1000.0' \
  -e 'e.success = e.status < 400' \
  -k timestamp,method,endpoint,status,duration_s,success

Tips:

  • Use to_int_or() / to_float_or() for defensive conversions.
  • Remove sensitive data with assignments like e.user_email = () before export.
  • Apply strip() when whitespace is inconsistent.

Step 4: Summarise for Sanity Checks

Validate assumptions and produce quick metrics prior to export.

kelora -f 'csv status:int duration_ms:int' examples/simple_csv.csv \
  -e 'track_count("total")' \
  -e 'if e.status >= 500 { track_count("errors") }' \
  -e 'track_sum("latency_total_ms", e.duration_ms)' \
  -e 'track_count("latency_samples")' \
  -m \
  --end '
    if metrics.contains("latency_total_ms") && metrics["latency_samples"] != 0 {
      let avg = metrics["latency_total_ms"] / metrics["latency_samples"];
      print("avg_latency_ms=" + avg.to_string());
    }
  ' \
  --metrics
  • Compare counts with what you expect from upstream systems.
  • Divide latency_total_ms by latency_samples for averages, or compute per-service stats with additional prefixes.
  • Use track_bucket() to build histograms or track_unique() to measure cardinality.

Step 5: Export the Prepared Dataset

Write the cleaned result to CSV, JSON, or logfmt depending on downstream needs.

kelora -f 'csv status:int duration_ms:int bytes:int' examples/simple_csv.csv \
  --filter 'e.status < 500' \
  -k timestamp,method,path,status,duration_ms,bytes \
  -F csv > cleaned.csv

Other formats:

  • -J or -F json for ingestion into document stores.
  • -F logfmt when shipping data to systems that expect key=value lines.

Variations

  • Files without headers

    kelora -f csvnh data.csv \
      -e 'e.timestamp = e._1; e.status = e._2.to_int(); e.bytes = e._3.to_int()' \
      -k timestamp,status,bytes
    

  • Large archives

    kelora -f 'csv status:int bytes:int' logs/2024-*.csv.gz \
      --parallel \
      -e 'track_sum("bytes_total", e.bytes)' \
      --metrics
    

  • Ragged data validation

    kelora -f csv raw.csv \
      --stats \
      --verbose \
      --filter 'meta.parse_errors > 0'
    
    Switch to --strict to abort on the first malformed row once cleanup is complete.

Quality Checklist

  • Compare row counts between raw and cleaned files; mismatches often indicate parse errors or filters that were too aggressive.
  • If you drop columns, annotate the export with a README describing what was removed.
  • Capture --stats output in automation logs so data consumers see parsing success rates.

See Also