Extract and Mask Sensitive Data¶
Safely process logs containing PII, credentials, and sensitive information using masking, pseudonymization, and redaction techniques.
Problem¶
Your logs contain sensitive data like IP addresses, emails, credit card numbers, or personally identifiable information (PII). You need to analyze the logs while protecting sensitive information or comply with privacy regulations.
Solutions¶
Mask IP Addresses¶
Mask IP addresses to protect privacy while maintaining network analysis capabilities:
# Mask last octet (default)
kelora -j app.log \
-e 'e.client_ip = e.client_ip.mask_ip()'
# Mask last 2 octets for stronger anonymization
kelora -j app.log \
-e 'e.client_ip = e.client_ip.mask_ip(2)'
# Mask last 3 octets
kelora -j app.log \
-e 'e.client_ip = e.client_ip.mask_ip(3)'
Results:
192.168.1.100
withmask_ip(1)
becomes192.168.1.0
192.168.1.100
withmask_ip(2)
becomes192.168.0.0
192.168.1.100
withmask_ip(3)
becomes192.0.0.0
Generate Pseudonyms¶
Create consistent pseudonyms using domain separation (requires KELORA_SECRET
environment variable):
# Set secret for pseudonymization
export KELORA_SECRET="your-secret-key-here"
# Pseudonymize user emails
kelora -j users.log \
-e 'e.user_pseudo = pseudonym(e.email, "users")' \
-e 'e.email = ()'
# Pseudonymize with different domains
kelora -j app.log \
-e 'e.user_id = pseudonym(e.user_email, "users")' \
-e 'e.session_id = pseudonym(e.session_token, "sessions")' \
-e 'e.user_email = ()' \
-e 'e.session_token = ()'
Benefits:
- Same input always produces same pseudonym (consistency)
- Different domains produce different pseudonyms (separation)
- Cannot reverse back to original (one-way)
- Requires secret key (security)
Hash Sensitive Fields¶
Hash data for grouping and deduplication without exposing original values:
# SHA-256 hash (default)
kelora -j app.log \
-e 'e.user_hash = e.username.hash()' \
-e 'e.username = ()'
# Fast hash for grouping (xxh3)
kelora -j app.log \
-e 'e.session_hash = e.session_id.hash("xxh3")' \
-e 'e.session_id = ()'
# MD5 for legacy compatibility
kelora -j app.log \
-e 'e.email_hash = e.email.hash("md5")' \
-e 'e.email = ()'
# BLAKE3 for high-speed cryptographic hashing
kelora -j app.log \
-e 'e.api_key_hash = e.api_key.hash("blake3")' \
-e 'e.api_key = ()'
Available algorithms: sha256
(default), sha1
, md5
, xxh3
, blake3
Remove Sensitive Fields¶
Delete fields entirely from events:
# Remove single field
kelora -j app.log \
-e 'e.password = ()'
# Remove multiple sensitive fields
kelora -j app.log \
-e 'e.password = (); e.api_key = (); e.ssn = (); e.credit_card = ()'
# Conditional removal based on level
kelora -j app.log \
-e 'if e.level != "DEBUG" { e.stack_trace = (); e.locals = () }'
# Remove all except specific fields
kelora -j app.log \
-e 'let timestamp = e.timestamp; let level = e.level; let message = e.message' \
-e 'e = ()' \
-e 'e.timestamp = timestamp; e.level = level; e.message = message'
Extract and Redact Pattern¶
Extract sensitive data for analysis, then redact from output:
# Extract credit card type, then redact number
kelora -j transactions.log \
-e 'e.card_type = e.card_number.extract_re(r"^(\\d{4})", 1)' \
-e 'e.card_last4 = e.card_number.extract_re(r"(\\d{4})$", 1)' \
-e 'e.card_number = "REDACTED"'
# Extract email domain, remove local part
kelora -j users.log \
-e 'let email = e.email.parse_email()' \
-e 'e.email_domain = email.domain' \
-e 'e.email = ()'
# Extract URL domain, remove path/query
kelora -j requests.log \
-e 'let url = e.request_url.parse_url()' \
-e 'e.domain = url.host' \
-e 'e.has_query = url.query != ""' \
-e 'e.request_url = ()'
Mask Within Text Fields¶
Find and mask sensitive patterns in log messages:
# Mask credit card numbers in message
kelora -j app.log \
-e 'e.message = e.message.replace(r"\\b\\d{16}\\b", "[CARD_REDACTED]")'
# Mask email addresses
kelora -j app.log \
-e 'e.message = e.message.replace(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", "[EMAIL_REDACTED]")'
# Mask API keys (various formats)
kelora -j app.log \
-e 'e.message = e.message.replace(r"api[_-]?key[=:]\\s*[a-zA-Z0-9]{20,}", "api_key=[REDACTED]")'
# Mask IP addresses in text
kelora -j app.log \
-e 'let ip = e.message.extract_ip()' \
-e 'if ip != "" { e.message = e.message.replace(ip, ip.mask_ip(2)) }'
Real-World Examples¶
GDPR-Compliant Log Processing¶
export KELORA_SECRET="gdpr-compliance-secret"
kelora -j user_activity.log \
-e 'e.user_pseudo = pseudonym(e.user_email, "users")' \
-e 'e.ip_masked = e.ip_address.mask_ip(2)' \
-e 'e.user_email = (); e.ip_address = (); e.full_name = ()' \
-k timestamp,user_pseudo,ip_masked,action \
-J > anonymized.json
Security Audit Log Sanitization¶
kelora -j security.log \
-e 'e.username_hash = e.username.hash("sha256")' \
-e 'e.source_ip = e.source_ip.mask_ip(2)' \
-e 'e.session_hash = e.session_id.hash("xxh3")' \
-e 'e.username = (); e.session_id = ()' \
--filter 'e.event_type != "login_success"' \
-J > sanitized_audit.json
Payment Log Redaction¶
kelora -j payments.log \
-e 'e.card_last4 = e.card_number.extract_re(r"(\\d{4})$", 1)' \
-e 'e.card_bin = e.card_number.extract_re(r"^(\\d{6})", 1)' \
-e 'e.amount_bucket = floor(e.amount / 100) * 100' \
-e 'e.card_number = (); e.cvv = (); e.expiry = ()' \
-k timestamp,card_bin,card_last4,amount_bucket,status
Multi-Tenant Data Isolation¶
export KELORA_SECRET="tenant-separation-key"
kelora -j multi_tenant.log \
-e 'e.tenant_id = pseudonym(e.tenant_name, "tenants")' \
-e 'e.user_id = pseudonym(e.user_email, e.tenant_name)' \
-e 'e.tenant_name = (); e.user_email = (); e.user_name = ()' \
-J > isolated_logs.json
Database Query Sanitization¶
kelora -j db_queries.log \
-e 'e.query_hash = e.query.hash("xxh3")' \
-e 'e.table = e.query.extract_re(r"FROM\\s+(\\w+)", 1)' \
-e 'e.has_where = e.query.contains("WHERE")' \
-e 'e.query = "REDACTED"' \
-k timestamp,user,table,duration_ms,rows,query_hash
API Key and Token Redaction¶
kelora -f combined access.log \
-e 'if e.path.contains("key=") || e.path.contains("token=") {
let params = e.path.after("?").parse_query_params();
if params.contains("key") { e.has_api_key = true };
if params.contains("token") { e.has_token = true };
e.path = e.path.before("?") + "?[PARAMS_REDACTED]"
}' \
-k timestamp,ip,method,path,status
Healthcare Data De-identification¶
export KELORA_SECRET="hipaa-compliance-secret"
kelora -j health_records.log \
-e 'e.patient_id = pseudonym(e.patient_ssn, "patients")' \
-e 'e.provider_id = pseudonym(e.doctor_name, "providers")' \
-e 'e.age_bracket = floor(e.age / 10) * 10' \
-e 'e.zip_prefix = e.zip_code.slice(":3")' \
-e 'e.patient_ssn = (); e.patient_name = (); e.doctor_name = ()' \
-e 'e.age = (); e.zip_code = (); e.phone = ()' \
-J > deidentified.json
Session Tracking Without PII¶
export KELORA_SECRET="session-tracking-secret"
kelora -j app.log \
-e 'e.session_hash = pseudonym(e.session_id, "sessions")' \
-e 'e.user_hash = pseudonym(e.user_id, "users")' \
-e 'e.ip_network = e.ip.mask_ip(2)' \
-e 'e.session_id = (); e.user_id = (); e.ip = ()' \
-e 'track_unique("sessions", e.session_hash)' \
--metrics
Extract URLs Without Query Parameters¶
kelora -j web_requests.log \
-e 'let url = e.full_url.parse_url()' \
-e 'e.base_url = url.scheme + "://" + url.host + url.path' \
-e 'e.has_params = url.query != ""' \
-e 'e.param_count = url.query.split("&").len()' \
-e 'e.full_url = ()' \
-k timestamp,base_url,has_params,param_count
Automation and Export¶
Automated Daily Sanitization¶
#!/bin/bash
export KELORA_SECRET="$(cat /secure/kelora.secret)"
kelora -j /var/log/app-$(date +%Y-%m-%d).log \
-e 'e.user_id = pseudonym(e.email, "users")' \
-e 'e.ip = e.ip.mask_ip(2)' \
-e 'e.email = (); e.phone = ()' \
--parallel \
-J > "/archive/sanitized-$(date +%Y-%m-%d).json"
Validate Redaction¶
# Check for common sensitive patterns in output
kelora -j sanitized.log -qq \
--filter 'e.message.has_matches("\\b\\d{3}-\\d{2}-\\d{4}\\b")' \
&& echo "WARNING: SSN pattern found!" \
|| echo "No SSN patterns detected"
# Check for email addresses
kelora -j sanitized.log -qq \
--filter 'e.message.has_matches("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}")' \
&& echo "WARNING: Email addresses found!" \
|| echo "No email addresses detected"
Export to Secure Storage¶
kelora -j sensitive.log \
-e 'e.user_hash = e.email.hash("sha256")' \
-e 'e.email = ()' \
--parallel \
-J | gzip > secure_logs_$(date +%Y%m%d).json.gz
Tips¶
Security:
- Always use strong, random
KELORA_SECRET
for pseudonymization - Store secrets securely (environment variables, secret managers)
- Use domain separation in
pseudonym()
to prevent cross-context correlation - Prefer
pseudonym()
overhash()
when consistency across datasets is needed - Use
hash("xxh3")
for grouping without cryptographic security needs
Compliance:
- Document your masking strategy for audit purposes
- Test redaction patterns against sample sensitive data
- Use
--stats
to track how many events were processed - Consider using
--metrics
to count redacted fields - Validate output before sharing with third parties
Performance:
- Use
--parallel
for large-scale sanitization - Hash with
xxh3
is faster than cryptographic hashes for grouping - Remove fields early in pipeline to reduce processing overhead
- Use
mask_ip()
instead of regex for IP address masking
Validation:
# Count events with potential PII
kelora -j output.log -q \
-e 'if e.message.has_matches("\\b\\d{16}\\b") { track_count("cards") }' \
-e 'if e.message.has_matches("@") { track_count("emails") }' \
--metrics
# Sample output for manual inspection
kelora -j sanitized.log -n 100 | less
Common Patterns:
# Mask all IPs in event
for field in ["client_ip", "server_ip", "source_ip"] {
if e.contains(field) {
e[field] = e[field].mask_ip(2)
}
}
# Remove all fields containing "password" or "secret"
for (field, value) in e {
if field.to_lower().contains("password") || field.to_lower().contains("secret") {
e[field] = ()
}
}
Troubleshooting¶
Pseudonym consistency issues:
# Ensure KELORA_SECRET is set and consistent
echo $KELORA_SECRET
# Must be the same across runs for consistent pseudonyms
IP masking not working:
# Validate IP format first
kelora -j app.log \
--filter 'e.ip.is_ipv4()' \
-e 'e.ip = e.ip.mask_ip()'
Pattern not matching:
# Test regex patterns
kelora -j app.log \
-e 'e.test_match = e.field.has_matches("your_pattern")' \
--filter 'e.test_match' \
-n 10
Hashing performance:
# Use faster algorithms for non-cryptographic needs
e.group_id = e.user_id.hash("xxh3") # Fast
e.secure_id = e.user_id.hash("blake3") # Cryptographic + Fast
e.legacy_id = e.user_id.hash("sha256") # Cryptographic
See Also¶
- Function Reference - Complete list of masking and hashing functions
- Configuration System - Centralize redaction defaults
- Monitor Application Health - Health metrics without exposing PII
- Analyze Web Traffic - Web log analysis with IP masking