Sensitive Data Taxonomy for Terraform State Sanitization
Executive Summary
This document defines a comprehensive taxonomy of sensitive data found in Terraform state files that must be detected, filtered, and sanitized before loading into Backstage catalog. The taxonomy is organized by sensitivity level, resource type, and compliance requirements.
Sensitivity Classifications
CRITICAL (Must Always Filter)
Data that, if exposed, would immediately compromise security or violate compliance regulations.
HIGH (Filter Unless Explicitly Allowed)
Data that poses significant security risk but may be needed for specific operational use cases with proper controls.
MEDIUM (Configurable Based on Client Policy)
Data that should be evaluated based on client-specific security posture and risk tolerance.
LOW (Informational, May Be Allowed)
Data that is typically safe to expose but should be evaluated in context.
1. Authentication & Authorization Credentials
1.1 Private Keys (CRITICAL)
Terraform Attributes:
- private_key
- private_key_pem
- private_key_openssh
- tls_private_key
- service_account_key
- service_account_private_key
- certificate_private_key
- ssh_private_key
- rsa_private_key
- ecdsa_private_key
- ed25519_private_key
Resource Types Affected:
google_service_account_keygoogle_compute_instance(ssh keys in metadata)tls_private_keyaws_iam_access_keyazurerm_key_vault_secret
Detection Patterns:
/-----BEGIN (RSA|EC|OPENSSH|ENCRYPTED) PRIVATE KEY-----/
/private_key["\s]*[:=]["\s]*[A-Za-z0-9+/=]{64,}/
/\bAIza[0-9A-Za-z-_]{35}\b/ # Google API keys
Sanitization Action:
- REDACT: Replace entire value with
[REDACTED:PRIVATE_KEY] - AUDIT: Log resource type, attribute path, key length, algorithm type
- ALERT: Trigger security notification for manual review
1.2 Passwords & Secrets (CRITICAL)
Terraform Attributes:
- password
- admin_password
- root_password
- initial_password
- master_password
- db_password
- database_password
- user_password
- secret
- api_secret
- client_secret
- webhook_secret
- encryption_key
- master_key
Resource Types Affected:
google_sql_database_instancegoogle_secret_manager_secretgoogle_compute_instance(metadata)google_cloudfunctions_function(environment variables)google_cloud_run_service(environment variables)
Detection Patterns:
/(password|secret|key)["\s]*[:=]["\s]*["'][^"']{8,}["']/i
/\b[A-Za-z0-9_-]{32,}\b/ # Generic long random strings
Sanitization Action:
- REDACT: Replace with
[REDACTED:SECRET] - AUDIT: Log resource type, secret length, creation timestamp
- HASH: Store SHA256 hash for change detection without exposing value
1.3 Access Tokens & API Keys (CRITICAL)
Terraform Attributes:
- access_token
- refresh_token
- auth_token
- bearer_token
- api_key
- api_token
- oauth_token
- github_token
- gitlab_token
- slack_token
Provider-Specific Patterns:
Google Cloud:
- AIza[0-9A-Za-z-_]{35} # Google API Key
- ya29\.[0-9A-Za-z_-]{68,} # OAuth 2.0 Access Token
AWS:
- AKIA[0-9A-Z]{16} # AWS Access Key ID
- [0-9a-zA-Z/+=]{40} # AWS Secret Access Key
GitHub:
- ghp_[0-9a-zA-Z]{36} # GitHub Personal Access Token
- gho_[0-9a-zA-Z]{36} # GitHub OAuth Token
Slack:
- xoxb-[0-9]{11}-[0-9]{11}-[0-9a-zA-Z]{24} # Slack Bot Token
Sanitization Action:
- REDACT: Replace with
[REDACTED:TOKEN] - FINGERPRINT: Store last 4 characters for identification
- ROTATE: Trigger automated token rotation workflow
2. Network & Infrastructure Secrets
2.1 Private IP Addresses (HIGH)
Terraform Attributes:
- private_ip_address
- internal_ip
- private_cluster_config.private_endpoint
- ip_address (when in private range)
- network_interface.internal_ip
Detection Patterns:
/10\.\d{1,3}\.\d{1,3}\.\d{1,3}/ # 10.0.0.0/8
/172\.(1[6-9]|2[0-9]|3[0-1])\.\d{1,3}\.\d{1,3}/ # 172.16.0.0/12
/192\.168\.\d{1,3}\.\d{1,3}/ # 192.168.0.0/16
Resource Types Affected:
google_compute_instancegoogle_container_cluster(private endpoint)google_compute_address(PURPOSE=INTERNAL)google_sql_database_instance
Sanitization Action:
- MASK: Replace with
10.x.x.xor[PRIVATE_IP] - PRESERVE_SUBNET: Optionally keep subnet info:
10.1.x.x - CONFIGURABLE: Allow per-client policy (some clients may allow internal IPs)
2.2 Connection Strings & URIs (CRITICAL)
Terraform Attributes:
- connection_string
- connection_url
- jdbc_url
- database_url
- redis_url
- mongodb_uri
Detection Patterns:
# PostgreSQL
/postgres(ql)?:\/\/[^:]+:[^@]+@[^\/]+\/\w+/
# MySQL
/mysql:\/\/[^:]+:[^@]+@[^\/]+\/\w+/
# MongoDB
/mongodb(\+srv)?:\/\/[^:]+:[^@]+@[^\/]+\/\w+/
# Redis
/redis:\/\/:[^@]+@[^\/]+/
Sanitization Action:
- PARSE: Extract protocol, host, database name
- PRESERVE: Keep
protocol://[USER]:[REDACTED]@host:port/database - AUDIT: Log full connection details separately for troubleshooting
2.3 SSH Configuration (HIGH)
Terraform Attributes:
- ssh_keys (in metadata)
- ssh_authorized_keys
- ssh_config
- known_hosts
Resource Types Affected:
google_compute_instancegoogle_compute_instance_templategoogle_os_login_ssh_public_key
Sanitization Action:
- REDACT_PRIVATE: Remove private keys entirely
- HASH_PUBLIC: Replace public key content with fingerprint
- PRESERVE_TYPE: Keep key type (ssh-rsa, ed25519, etc.)
3. GCP-Specific Sensitive Data
3.1 Service Account Keys (CRITICAL)
Resource Type:
google_service_account_key
Attributes:
- private_key
- private_key_type
- public_key
- valid_after
- valid_before
Sanitization Rules:
private_key:
action: REDACT
replacement: "[REDACTED:SA_KEY]"
private_key_type:
action: PRESERVE
reason: "Needed for key rotation tracking"
public_key:
action: HASH
method: SHA256
reason: "Public keys can identify service account without exposing credentials"
3.2 Cloud SQL Admin Passwords (CRITICAL)
Resource Type:
google_sql_database_instance
google_sql_user
Attributes:
- root_password
- password
- settings.backup_configuration.binary_log_enabled
- settings.ip_configuration.private_network
Sanitization Rules:
root_password:
action: REDACT
replacement: "[REDACTED:SQL_PASSWORD]"
password:
action: REDACT
replacement: "[REDACTED:SQL_PASSWORD]"
settings.ip_configuration.private_network:
action: PRESERVE
reason: "Network reference needed for topology visualization"
settings.ip_configuration.authorized_networks[].value:
action: MASK_LAST_OCTET
reason: "Show network ranges without exposing exact IPs"
3.3 KMS Crypto Keys (HIGH)
Resource Type:
google_kms_crypto_key
google_kms_secret_ciphertext
Attributes:
- ciphertext
- plaintext
- additional_authenticated_data
Sanitization Rules:
ciphertext:
action: REDACT
replacement: "[REDACTED:CIPHERTEXT]"
preserve_length: true # For size analysis
plaintext:
action: REDACT
replacement: "[REDACTED:PLAINTEXT]"
alert: CRITICAL
additional_authenticated_data:
action: EVALUATE
rules:
- if contains sensitive patterns: REDACT
- else: PRESERVE
3.4 Secret Manager Secrets (CRITICAL)
Resource Type:
google_secret_manager_secret
google_secret_manager_secret_version
Attributes:
- secret_data
- payload.data
Sanitization Rules:
secret_data:
action: REDACT_COMPLETELY
replacement: "[REDACTED:SECRET_MANAGER]"
preserve_metadata:
- name
- version
- create_time
- labels
payload.data:
action: REDACT_COMPLETELY
3.5 IAM Policy Bindings (MEDIUM)
Resource Type:
google_project_iam_*
google_organization_iam_*
Attributes:
- members[]
- role
- condition
Sanitization Rules:
members:
action: EVALUATE_PER_MEMBER
rules:
- "user:*@external.com": REDACT_DOMAIN
- "serviceAccount:*@*.iam.gserviceaccount.com": PRESERVE
- "group:*": PRESERVE_IF_INTERNAL
role:
action: PRESERVE
reason: "Needed for RBAC visualization"
condition.expression:
action: ANALYZE_FOR_SECRETS
pattern: /password|secret|token/i
if_match: REDACT_CLAUSE
4. Environment Variables (HIGH)
4.1 Cloud Run / Cloud Functions
Resource Types:
google_cloud_run_service
google_cloudfunctions_function
Attributes:
- spec.template.spec.containers[].env[]
- environment_variables
Sanitization Rules:
environment_variables:
action: EVALUATE_PER_VAR
rules:
# Secrets
- name matches /password|secret|key|token/i: REDACT_VALUE
# Configuration (safe)
- name matches /region|project|bucket|dataset/i: PRESERVE
# Connection strings
- value matches connection_string_pattern: REDACT_CREDENTIALS_ONLY
Common Secret Variable Names:
CRITICAL:
- DATABASE_PASSWORD
- API_KEY
- SECRET_KEY
- JWT_SECRET
- OAUTH_CLIENT_SECRET
- WEBHOOK_SECRET
HIGH:
- DATABASE_URL (contains credentials)
- REDIS_URL (contains password)
- SMTP_PASSWORD
SAFE:
- DATABASE_HOST
- DATABASE_NAME
- API_ENDPOINT
- LOG_LEVEL
5. Metadata & Custom Fields
5.1 Compute Instance Metadata (HIGH)
Resource Type:
google_compute_instance
google_compute_instance_template
Attributes:
- metadata
- metadata_startup_script
Sanitization Rules:
metadata:
action: DEEP_SCAN
rules:
- ssh-keys: REDACT_KEYS
- startup-script: SCAN_FOR_SECRETS
- user-data: SCAN_FOR_SECRETS
- custom fields: PATTERN_MATCH
metadata_startup_script:
action: SCAN_FOR_PATTERNS
patterns:
- /password\s*=\s*["'][^"']+["']/: REDACT
- /api_key\s*=\s*["'][^"']+["']/: REDACT
- /export\s+\w*SECRET\w*=/: REDACT_LINE
5.2 Labels & Tags (LOW)
All Resources:
- labels
- tags
Sanitization Rules:
labels:
action: EVALUATE_PER_LABEL
rules:
# Safe organizational labels
- environment: PRESERVE
- team: PRESERVE
- cost-center: PRESERVE
# Potentially sensitive
- owner: ANONYMIZE_IF_EXTERNAL
- contact: ANONYMIZE_EMAIL
- backup-key-id: REDACT
6. Networking Details
6.1 VPC & Subnets (MEDIUM)
Resource Types:
google_compute_network
google_compute_subnetwork
Attributes:
- ip_cidr_range
- secondary_ip_range[].ip_cidr_range
Sanitization Rules:
ip_cidr_range:
action: MASK_IF_PRIVATE
rules:
- if private range: "10.x.0.0/16"
- if public range: PRESERVE (infrastructure info)
secondary_ip_range:
action: PRESERVE_COUNT_ONLY
reason: "Show number of secondary ranges without exposing IPs"
6.2 Firewall Rules (MEDIUM)
Resource Type:
google_compute_firewall
Attributes:
- source_ranges[]
- source_tags[]
- allowed[].ports[]
Sanitization Rules:
source_ranges:
action: EVALUATE_PER_RANGE
rules:
- "0.0.0.0/0": FLAG_AS_PUBLIC_EXPOSURE
- private ranges: MASK_SPECIFIC_IPS
allowed:
action: PRESERVE_STRUCTURE
reason: "Firewall topology needed for security posture"
sensitive_ports: [22, 3389, 5432, 3306]
if_source_public_and_sensitive_port: ALERT
7. Database & Storage Credentials
7.1 BigQuery Dataset Access (HIGH)
Resource Type:
google_bigquery_dataset
google_bigquery_dataset_access
Attributes:
- access[].user_by_email
- access[].group_by_email
- default_encryption_configuration.kms_key_name
Sanitization Rules:
access:
action: ANONYMIZE_EXTERNAL_USERS
rules:
- internal domain: PRESERVE
- external domain: HASH_EMAIL
- service accounts: PRESERVE
default_encryption_configuration:
action: PRESERVE
reason: "Encryption key reference needed for compliance audit"
7.2 Storage Bucket Policies (MEDIUM)
Resource Type:
google_storage_bucket
google_storage_bucket_iam_*
Attributes:
- iam_configuration
- lifecycle_rule
Sanitization Rules:
iam_configuration.public_access_prevention:
action: PRESERVE
reason: "Critical security control visibility"
lifecycle_rule:
action: PRESERVE
reason: "Non-sensitive retention policy"
8. Terraform-Specific Sensitive Data
8.1 Provider Configurations (CRITICAL)
Terraform Blocks:
provider "google" {
credentials = file("account.json")
access_token = "..."
}
Sanitization Rules:
provider_config.credentials:
action: REDACT_COMPLETELY
provider_config.access_token:
action: REDACT_COMPLETELY
provider_config.project:
action: PRESERVE
reason: "Project ID needed for resource relationships"
8.2 Remote State Backends (HIGH)
Backend Configuration:
terraform {
backend "gcs" {
bucket = "tf-state-bucket"
prefix = "terraform/state"
credentials = "..."
}
}
Sanitization Rules:
backend.credentials:
action: REDACT
backend.bucket:
action: PRESERVE
reason: "State location needed for change tracking"
backend.encryption_key:
action: REDACT
9. Multi-Cloud Sensitive Data
9.1 AWS Credentials
Resource Types:
aws_iam_access_key
aws_secretsmanager_secret
Patterns:
/AKIA[0-9A-Z]{16}/ # AWS Access Key ID
/[A-Za-z0-9/+=]{40}/ # AWS Secret Access Key
9.2 Azure Credentials
Resource Types:
azurerm_key_vault_secret
azurerm_storage_account
Patterns:
/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/ # Azure Client ID
10. Detection Strategy Summary
Pattern Matching Priority
-
Exact Attribute Match (Highest Priority)
- Check known sensitive attribute names
- Example:
private_key,password,secret
-
Regex Pattern Match
- Scan all string values for credential patterns
- Example: Private key headers, JWT tokens
-
Semantic Analysis
- Context-aware detection (e.g., "key" in "key_name" vs. "private_key")
- Relationship analysis (is this a reference or actual secret?)
-
Entropy Analysis
- High-entropy strings (random characters) likely secrets
- Base64-encoded data examination
11. Sanitization Action Types
Primary Actions
REDACT:
description: "Replace entire value with placeholder"
example: "[REDACTED:PRIVATE_KEY]"
use_case: "Critical secrets that must never be exposed"
MASK:
description: "Partially hide value while preserving structure"
example: "10.1.x.x" or "***@example.com"
use_case: "IPs, emails where structure is useful"
HASH:
description: "Replace with cryptographic hash"
example: "sha256:abc123def..."
use_case: "Change detection without exposure"
PRESERVE:
description: "Keep original value"
example: "us-central1"
use_case: "Non-sensitive operational data"
ANONYMIZE:
description: "Replace with generic placeholder"
example: "user-1", "team-a"
use_case: "Personal information requiring privacy"
12. Per-Client Configuration
Client Risk Profiles
High Security (FinTech, Healthcare):
- Redact all private IPs
- Redact all email addresses
- Mask all connection strings
- No environment variables exposed
Medium Security (General Enterprise):
- Preserve internal IPs (10.x.x.x shown as subnet)
- Preserve internal email domains
- Redact database passwords only
- Show sanitized environment variable names
Low Security (Development, Non-Prod):
- Preserve most infrastructure details
- Redact only critical secrets
- Show configuration structure
13. Audit Trail Requirements
What to Log for Each Sanitization
log_entry:
timestamp: "2025-01-13T10:30:00Z"
workspace_id: "ws-abc123"
resource_type: "google_sql_database_instance"
resource_name: "prod-db"
attribute_path: "settings.ip_configuration.authorized_networks[0].value"
original_value_hash: "sha256:..."
sanitization_action: "MASK"
new_value: "10.1.x.x"
rule_matched: "private_ip_masking_v2"
sensitivity_level: "HIGH"
client_policy: "medium_security"
14. Compliance Mappings
SOC2 Requirements
- CC6.1: Redact all credentials and secrets
- CC6.6: Log all access to sensitive data
- CC6.7: Encrypt audit logs
GDPR Requirements
- Article 25: Anonymize personal data (emails, names)
- Article 32: Pseudonymization of user identifiers
HIPAA Requirements
- §164.514: De-identification of PHI (if in infrastructure metadata)
15. Testing & Validation
Test Cases for Sensitivity Detection
Test Suite:
- Private key in google_service_account_key: MUST DETECT
- Password in SQL user resource: MUST DETECT
- Private IP in compute instance: MUST DETECT
- Environment variable "DATABASE_URL" with credentials: MUST DETECT
- Public IP in compute address: MUST NOT TRIGGER
- Label "environment: production": MUST NOT TRIGGER
- KMS key name reference: MUST NOT TRIGGER (reference, not key itself)
False Positive Prevention
Safe Patterns:
- KMS key names: "projects/.../keyRings/.../cryptoKeys/..."
- Secret Manager references: "projects/.../secrets/.../versions/latest"
- Network references: "projects/.../global/networks/..."
- Public IPs: "34.x.x.x" (safe to expose)
16. Integration with External Tools
Recommended Tools for Enhanced Detection
- Terraform-compliance: Policy-as-code testing
- tfsec: Static analysis for security issues
- Checkov: Policy-based scanning
- TruffleHog: Secret detection
- GitLeaks: Credential scanning
Custom Detection Engine
class SensitiveDataDetector:
def __init__(self, config: ClientConfig):
self.rules = load_rules(config)
self.entropy_threshold = 4.5 # Shannon entropy
def detect(self, attribute_path: str, value: any) -> SensitivityLevel:
# 1. Exact attribute name match
# 2. Regex pattern match
# 3. Entropy analysis
# 4. Semantic context analysis
pass
Version History
- v1.0 (2025-01-13): Initial comprehensive taxonomy
- v1.1 (TBD): Add Kubernetes secrets detection
- v1.2 (TBD): Add multi-region compliance rules