Pseudonymize Identifiers for Analytics¶

Replace user identifiers with consistent, non-reversible tokens so you can analyse behaviour without exposing raw PII.

Why Use This Approach¶

Analysts need to count or group by user, tenant, or device while respecting privacy.
Logs must be exported outside the production boundary (vendors, BI tools).
You want deterministic transformations that do not require a central mapping database.

Before You Start¶

Set the KELORA_SECRET environment variable to a strong, random string. Kelora derives pseudonyms from this secret.
Plan domain separation keys ("users", "sessions", "tenants"). Different domains prevent correlating identities across datasets.
Decide how you will rotate the secret. Rotating changes all generated tokens; coordinate with downstream consumers.

Step 1: Configure the Secret¶

Export the secret in the shell or set it within your scheduler/service definition.

export KELORA_SECRET="$(openssl rand -hex 32)"

Keep the value in a secret manager or environment configuration, not in source control.
All Kelora invocations must share the same secret to maintain consistency.

Step 2: Generate Pseudonyms for Primary Identifiers¶

Use the pseudonym() function to create consistent tokens per domain.

kelora -j examples/security_audit.jsonl \
  -e 'e.user_id = pseudonym(e.user_email, "users")' \
  -e 'e.session_id = pseudonym(e.session_token, "sessions")' \
  -e 'e.user_email = (); e.session_token = ()' \
  -F json

Passing different domains ("users", "sessions") ensures tokens cannot be matched across contexts.
Drop the original fields immediately after generating pseudonyms.

Step 3: Preserve Useful Context¶

Extract secondary attributes (domains, regions) before removing raw data so analysts retain dimensional information.

kelora -j examples/security_audit.jsonl \
  -e 'let email = e.user_email.parse_email();
        e.user_domain = email.domain;
        e.user_id = pseudonym(e.user_email, "users");
        e.user_email = ()' \
  -F json

Combine with masking (mask_ip) when you also need approximate locations or network groupings.
Track counts of pseudonyms (track_unique) to estimate active user bases without storing clear text identifiers.

Step 4: Compare Against Hashing¶

When you only need grouping (and not reversibility), hashing may suffice. Benchmark both approaches so downstream tools understand collision risks.

kelora -j app.log \
  -e 'e.user_hash = e.user_id.hash("xxh3")' \
  -e 'e.user_anon = pseudonym(e.user_id, "users")' \
  -e 'track_unique("hashes", e.user_hash)' \
  -e 'track_unique("pseudonyms", e.user_anon)' \
  --metrics

hash("xxh3") is fast but not secret; use pseudonym() whenever anonymity guarantees matter.
Cryptographic hashes (sha256, blake3) avoid collisions but still leak the same input if the attacker knows it. Pseudonyms resist this by requiring the secret.

Step 5: Validate Consistency¶

Confirm that the same inputs produce identical pseudonyms across runs and that expected fields are gone.

kelora -j pseudonymized.json \
  -qq \
  --filter '!e.contains("user_email") && !e.contains("session_token")' \
  --stats

kelora -j examples/security_audit.jsonl \
  -e 'let token = pseudonym(e.user_email, "users");' \
  -e 'track_unique("tokens", token)' \
  --metrics

Use small fixture files in tests to guard against accidental changes to the pseudonym rules.
When rotating secrets, run both the old and new pipeline on a sample and confirm the change is expected.

Variations¶

Tenant isolation

kelora -j multi_tenant.log \
  -e 'let tenant = e.tenant_name;
        e.tenant_id = pseudonym(tenant, "tenants");
        e.user_id = pseudonym(e.user_email, tenant);
        e.tenant_name = (); e.user_email = ()' \
  -F json

Tokens cannot be correlated between tenants while remaining consistent inside each tenant.

Selective reversibility
Keep a secure mapping of (original_id, pseudonym) when legal/compliance teams need an escalation path.
Store the mapping in an encrypted datastore; never in the output log files themselves.
Secret rotation checklist
Deploy new secret value.
Run both old and new pipelines in parallel for validation.
Notify downstream teams that historical comparisons may require dual ingestion.

Operational Tips¶

Do not log the secret or intermediate values in --verbose output.
Combine with Sanitize Logs Before Sharing to remove residual text-based identifiers.
Review legal requirements around pseudonymisation; some jurisdictions treat pseudonyms as personal data if the secret is accessible.