Sensitive Data Taxonomy for Terraform State Sanitization

Executive Summary

This document defines a comprehensive taxonomy of sensitive data found in Terraform state files that must be detected, filtered, and sanitized before loading into Backstage catalog. The taxonomy is organized by sensitivity level, resource type, and compliance requirements.

Sensitivity Classifications

CRITICAL (Must Always Filter)

Data that, if exposed, would immediately compromise security or violate compliance regulations.

HIGH (Filter Unless Explicitly Allowed)

Data that poses significant security risk but may be needed for specific operational use cases with proper controls.

MEDIUM (Configurable Based on Client Policy)

Data that should be evaluated based on client-specific security posture and risk tolerance.

LOW (Informational, May Be Allowed)

Data that is typically safe to expose but should be evaluated in context.

1. Authentication & Authorization Credentials

1.1 Private Keys (CRITICAL)

Terraform Attributes:

- private_key
- private_key_pem
- private_key_openssh
- tls_private_key
- service_account_key
- service_account_private_key
- certificate_private_key
- ssh_private_key
- rsa_private_key
- ecdsa_private_key
- ed25519_private_key

Resource Types Affected:

google_service_account_key
google_compute_instance (ssh keys in metadata)
tls_private_key
aws_iam_access_key
azurerm_key_vault_secret

Detection Patterns:

/-----BEGIN (RSA|EC|OPENSSH|ENCRYPTED) PRIVATE KEY-----/
/private_key["\s]*[:=]["\s]*[A-Za-z0-9+/=]{64,}/
/\bAIza[0-9A-Za-z-_]{35}\b/  # Google API keys

Sanitization Action:

REDACT: Replace entire value with [REDACTED:PRIVATE_KEY]
AUDIT: Log resource type, attribute path, key length, algorithm type
ALERT: Trigger security notification for manual review

1.2 Passwords & Secrets (CRITICAL)

Terraform Attributes:

- password
- admin_password
- root_password
- initial_password
- master_password
- db_password
- database_password
- user_password
- secret
- api_secret
- client_secret
- webhook_secret
- encryption_key
- master_key

Resource Types Affected:

google_sql_database_instance
google_secret_manager_secret
google_compute_instance (metadata)
google_cloudfunctions_function (environment variables)
google_cloud_run_service (environment variables)

Detection Patterns:

/(password|secret|key)["\s]*[:=]["\s]*["'][^"']{8,}["']/i
/\b[A-Za-z0-9_-]{32,}\b/  # Generic long random strings

Sanitization Action:

REDACT: Replace with [REDACTED:SECRET]
AUDIT: Log resource type, secret length, creation timestamp
HASH: Store SHA256 hash for change detection without exposing value

1.3 Access Tokens & API Keys (CRITICAL)

Terraform Attributes:

- access_token
- refresh_token
- auth_token
- bearer_token
- api_key
- api_token
- oauth_token
- github_token
- gitlab_token
- slack_token

Provider-Specific Patterns:

Google Cloud:
  - AIza[0-9A-Za-z-_]{35}           # Google API Key
  - ya29\.[0-9A-Za-z_-]{68,}        # OAuth 2.0 Access Token

AWS:
  - AKIA[0-9A-Z]{16}                # AWS Access Key ID
  - [0-9a-zA-Z/+=]{40}              # AWS Secret Access Key

GitHub:
  - ghp_[0-9a-zA-Z]{36}             # GitHub Personal Access Token
  - gho_[0-9a-zA-Z]{36}             # GitHub OAuth Token

Slack:
  - xoxb-[0-9]{11}-[0-9]{11}-[0-9a-zA-Z]{24}  # Slack Bot Token

Sanitization Action:

REDACT: Replace with [REDACTED:TOKEN]
FINGERPRINT: Store last 4 characters for identification
ROTATE: Trigger automated token rotation workflow

2. Network & Infrastructure Secrets

2.1 Private IP Addresses (HIGH)

Terraform Attributes:

- private_ip_address
- internal_ip
- private_cluster_config.private_endpoint
- ip_address (when in private range)
- network_interface.internal_ip

Detection Patterns:

/10\.\d{1,3}\.\d{1,3}\.\d{1,3}/          # 10.0.0.0/8
/172\.(1[6-9]|2[0-9]|3[0-1])\.\d{1,3}\.\d{1,3}/  # 172.16.0.0/12
/192\.168\.\d{1,3}\.\d{1,3}/             # 192.168.0.0/16

Resource Types Affected:

google_compute_instance
google_container_cluster (private endpoint)
google_compute_address (PURPOSE=INTERNAL)
google_sql_database_instance

Sanitization Action:

MASK: Replace with 10.x.x.x or [PRIVATE_IP]
PRESERVE_SUBNET: Optionally keep subnet info: 10.1.x.x
CONFIGURABLE: Allow per-client policy (some clients may allow internal IPs)

2.2 Connection Strings & URIs (CRITICAL)

Terraform Attributes:

- connection_string
- connection_url
- jdbc_url
- database_url
- redis_url
- mongodb_uri

Detection Patterns:

# PostgreSQL
/postgres(ql)?:\/\/[^:]+:[^@]+@[^\/]+\/\w+/

# MySQL
/mysql:\/\/[^:]+:[^@]+@[^\/]+\/\w+/

# MongoDB
/mongodb(\+srv)?:\/\/[^:]+:[^@]+@[^\/]+\/\w+/

# Redis
/redis:\/\/:[^@]+@[^\/]+/

Sanitization Action:

PARSE: Extract protocol, host, database name
PRESERVE: Keep protocol://[USER]:[REDACTED]@host:port/database
AUDIT: Log full connection details separately for troubleshooting

2.3 SSH Configuration (HIGH)

Terraform Attributes:

- ssh_keys (in metadata)
- ssh_authorized_keys
- ssh_config
- known_hosts

Resource Types Affected:

google_compute_instance
google_compute_instance_template
google_os_login_ssh_public_key

Sanitization Action:

REDACT_PRIVATE: Remove private keys entirely
HASH_PUBLIC: Replace public key content with fingerprint
PRESERVE_TYPE: Keep key type (ssh-rsa, ed25519, etc.)

3. GCP-Specific Sensitive Data

3.1 Service Account Keys (CRITICAL)

Resource Type:

google_service_account_key

Attributes:

- private_key
- private_key_type
- public_key
- valid_after
- valid_before

Sanitization Rules:

private_key:
  action: REDACT
  replacement: "[REDACTED:SA_KEY]"

private_key_type:
  action: PRESERVE
  reason: "Needed for key rotation tracking"

public_key:
  action: HASH
  method: SHA256
  reason: "Public keys can identify service account without exposing credentials"

3.2 Cloud SQL Admin Passwords (CRITICAL)

Resource Type:

google_sql_database_instance
google_sql_user

Attributes:

- root_password
- password
- settings.backup_configuration.binary_log_enabled
- settings.ip_configuration.private_network

Sanitization Rules:

root_password:
  action: REDACT
  replacement: "[REDACTED:SQL_PASSWORD]"

password:
  action: REDACT
  replacement: "[REDACTED:SQL_PASSWORD]"

settings.ip_configuration.private_network:
  action: PRESERVE
  reason: "Network reference needed for topology visualization"

settings.ip_configuration.authorized_networks[].value:
  action: MASK_LAST_OCTET
  reason: "Show network ranges without exposing exact IPs"

3.3 KMS Crypto Keys (HIGH)

Resource Type:

google_kms_crypto_key
google_kms_secret_ciphertext

Attributes:

- ciphertext
- plaintext
- additional_authenticated_data

Sanitization Rules:

ciphertext:
  action: REDACT
  replacement: "[REDACTED:CIPHERTEXT]"
  preserve_length: true  # For size analysis

plaintext:
  action: REDACT
  replacement: "[REDACTED:PLAINTEXT]"
  alert: CRITICAL

additional_authenticated_data:
  action: EVALUATE
  rules:
    - if contains sensitive patterns: REDACT
    - else: PRESERVE

3.4 Secret Manager Secrets (CRITICAL)

Resource Type:

google_secret_manager_secret
google_secret_manager_secret_version

Attributes:

- secret_data
- payload.data

Sanitization Rules:

secret_data:
  action: REDACT_COMPLETELY
  replacement: "[REDACTED:SECRET_MANAGER]"
  preserve_metadata:
    - name
    - version
    - create_time
    - labels

payload.data:
  action: REDACT_COMPLETELY

3.5 IAM Policy Bindings (MEDIUM)

Resource Type:

google_project_iam_*
google_organization_iam_*

Attributes:

- members[]
- role
- condition

Sanitization Rules:

members:
  action: EVALUATE_PER_MEMBER
  rules:
    - "user:*@external.com": REDACT_DOMAIN
    - "serviceAccount:*@*.iam.gserviceaccount.com": PRESERVE
    - "group:*": PRESERVE_IF_INTERNAL

role:
  action: PRESERVE
  reason: "Needed for RBAC visualization"

condition.expression:
  action: ANALYZE_FOR_SECRETS
  pattern: /password|secret|token/i
  if_match: REDACT_CLAUSE

4. Environment Variables (HIGH)

4.1 Cloud Run / Cloud Functions

Resource Types:

google_cloud_run_service
google_cloudfunctions_function

Attributes:

- spec.template.spec.containers[].env[]
- environment_variables

Sanitization Rules:

environment_variables:
  action: EVALUATE_PER_VAR
  rules:
    # Secrets
    - name matches /password|secret|key|token/i: REDACT_VALUE
    # Configuration (safe)
    - name matches /region|project|bucket|dataset/i: PRESERVE
    # Connection strings
    - value matches connection_string_pattern: REDACT_CREDENTIALS_ONLY

Common Secret Variable Names:

CRITICAL:
  - DATABASE_PASSWORD
  - API_KEY
  - SECRET_KEY
  - JWT_SECRET
  - OAUTH_CLIENT_SECRET
  - WEBHOOK_SECRET

HIGH:
  - DATABASE_URL (contains credentials)
  - REDIS_URL (contains password)
  - SMTP_PASSWORD

SAFE:
  - DATABASE_HOST
  - DATABASE_NAME
  - API_ENDPOINT
  - LOG_LEVEL

5. Metadata & Custom Fields

5.1 Compute Instance Metadata (HIGH)

Resource Type:

google_compute_instance
google_compute_instance_template

Attributes:

- metadata
- metadata_startup_script

Sanitization Rules:

metadata:
  action: DEEP_SCAN
  rules:
    - ssh-keys: REDACT_KEYS
    - startup-script: SCAN_FOR_SECRETS
    - user-data: SCAN_FOR_SECRETS
    - custom fields: PATTERN_MATCH

metadata_startup_script:
  action: SCAN_FOR_PATTERNS
  patterns:
    - /password\s*=\s*["'][^"']+["']/: REDACT
    - /api_key\s*=\s*["'][^"']+["']/: REDACT
    - /export\s+\w*SECRET\w*=/: REDACT_LINE

5.2 Labels & Tags (LOW)

All Resources:

- labels
- tags

Sanitization Rules:

labels:
  action: EVALUATE_PER_LABEL
  rules:
    # Safe organizational labels
    - environment: PRESERVE
    - team: PRESERVE
    - cost-center: PRESERVE

    # Potentially sensitive
    - owner: ANONYMIZE_IF_EXTERNAL
    - contact: ANONYMIZE_EMAIL
    - backup-key-id: REDACT

6. Networking Details

6.1 VPC & Subnets (MEDIUM)

Resource Types:

google_compute_network
google_compute_subnetwork

Attributes:

- ip_cidr_range
- secondary_ip_range[].ip_cidr_range

Sanitization Rules:

ip_cidr_range:
  action: MASK_IF_PRIVATE
  rules:
    - if private range: "10.x.0.0/16"
    - if public range: PRESERVE (infrastructure info)

secondary_ip_range:
  action: PRESERVE_COUNT_ONLY
  reason: "Show number of secondary ranges without exposing IPs"

6.2 Firewall Rules (MEDIUM)

Resource Type:

google_compute_firewall

Attributes:

- source_ranges[]
- source_tags[]
- allowed[].ports[]

Sanitization Rules:

source_ranges:
  action: EVALUATE_PER_RANGE
  rules:
    - "0.0.0.0/0": FLAG_AS_PUBLIC_EXPOSURE
    - private ranges: MASK_SPECIFIC_IPS

allowed:
  action: PRESERVE_STRUCTURE
  reason: "Firewall topology needed for security posture"
  sensitive_ports: [22, 3389, 5432, 3306]
  if_source_public_and_sensitive_port: ALERT

7. Database & Storage Credentials

7.1 BigQuery Dataset Access (HIGH)

Resource Type:

google_bigquery_dataset
google_bigquery_dataset_access

Attributes:

- access[].user_by_email
- access[].group_by_email
- default_encryption_configuration.kms_key_name

Sanitization Rules:

access:
  action: ANONYMIZE_EXTERNAL_USERS
  rules:
    - internal domain: PRESERVE
    - external domain: HASH_EMAIL
    - service accounts: PRESERVE

default_encryption_configuration:
  action: PRESERVE
  reason: "Encryption key reference needed for compliance audit"

7.2 Storage Bucket Policies (MEDIUM)

Resource Type:

google_storage_bucket
google_storage_bucket_iam_*

Attributes:

- iam_configuration
- lifecycle_rule

Sanitization Rules:

iam_configuration.public_access_prevention:
  action: PRESERVE
  reason: "Critical security control visibility"

lifecycle_rule:
  action: PRESERVE
  reason: "Non-sensitive retention policy"

8. Terraform-Specific Sensitive Data

8.1 Provider Configurations (CRITICAL)

Terraform Blocks:

provider "google" {
  credentials = file("account.json")
  access_token = "..."
}

Sanitization Rules:

provider_config.credentials:
  action: REDACT_COMPLETELY

provider_config.access_token:
  action: REDACT_COMPLETELY

provider_config.project:
  action: PRESERVE
  reason: "Project ID needed for resource relationships"

8.2 Remote State Backends (HIGH)

Backend Configuration:

terraform {
  backend "gcs" {
    bucket = "tf-state-bucket"
    prefix = "terraform/state"
    credentials = "..."
  }
}

Sanitization Rules:

backend.credentials:
  action: REDACT

backend.bucket:
  action: PRESERVE
  reason: "State location needed for change tracking"

backend.encryption_key:
  action: REDACT

9. Multi-Cloud Sensitive Data

9.1 AWS Credentials

Resource Types:

aws_iam_access_key
aws_secretsmanager_secret

Patterns:

/AKIA[0-9A-Z]{16}/        # AWS Access Key ID
/[A-Za-z0-9/+=]{40}/      # AWS Secret Access Key

9.2 Azure Credentials

Resource Types:

azurerm_key_vault_secret
azurerm_storage_account

Patterns:

/[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/  # Azure Client ID

10. Detection Strategy Summary

Pattern Matching Priority

Exact Attribute Match (Highest Priority)
- Check known sensitive attribute names
- Example: private_key, password, secret
Regex Pattern Match
- Scan all string values for credential patterns
- Example: Private key headers, JWT tokens
Semantic Analysis
- Context-aware detection (e.g., "key" in "key_name" vs. "private_key")
- Relationship analysis (is this a reference or actual secret?)
Entropy Analysis
- High-entropy strings (random characters) likely secrets
- Base64-encoded data examination

11. Sanitization Action Types

Primary Actions

REDACT:
  description: "Replace entire value with placeholder"
  example: "[REDACTED:PRIVATE_KEY]"
  use_case: "Critical secrets that must never be exposed"

MASK:
  description: "Partially hide value while preserving structure"
  example: "10.1.x.x" or "***@example.com"
  use_case: "IPs, emails where structure is useful"

HASH:
  description: "Replace with cryptographic hash"
  example: "sha256:abc123def..."
  use_case: "Change detection without exposure"

PRESERVE:
  description: "Keep original value"
  example: "us-central1"
  use_case: "Non-sensitive operational data"

ANONYMIZE:
  description: "Replace with generic placeholder"
  example: "user-1", "team-a"
  use_case: "Personal information requiring privacy"

12. Per-Client Configuration

Client Risk Profiles

High Security (FinTech, Healthcare):
  - Redact all private IPs
  - Redact all email addresses
  - Mask all connection strings
  - No environment variables exposed

Medium Security (General Enterprise):
  - Preserve internal IPs (10.x.x.x shown as subnet)
  - Preserve internal email domains
  - Redact database passwords only
  - Show sanitized environment variable names

Low Security (Development, Non-Prod):
  - Preserve most infrastructure details
  - Redact only critical secrets
  - Show configuration structure

13. Audit Trail Requirements

What to Log for Each Sanitization

log_entry:
  timestamp: "2025-01-13T10:30:00Z"
  workspace_id: "ws-abc123"
  resource_type: "google_sql_database_instance"
  resource_name: "prod-db"
  attribute_path: "settings.ip_configuration.authorized_networks[0].value"
  original_value_hash: "sha256:..."
  sanitization_action: "MASK"
  new_value: "10.1.x.x"
  rule_matched: "private_ip_masking_v2"
  sensitivity_level: "HIGH"
  client_policy: "medium_security"

14. Compliance Mappings

SOC2 Requirements

CC6.1: Redact all credentials and secrets
CC6.6: Log all access to sensitive data
CC6.7: Encrypt audit logs

Article 25: Anonymize personal data (emails, names)
Article 32: Pseudonymization of user identifiers

HIPAA Requirements

§164.514: De-identification of PHI (if in infrastructure metadata)

15. Testing & Validation

Test Cases for Sensitivity Detection

Test Suite:
  - Private key in google_service_account_key: MUST DETECT
  - Password in SQL user resource: MUST DETECT
  - Private IP in compute instance: MUST DETECT
  - Environment variable "DATABASE_URL" with credentials: MUST DETECT
  - Public IP in compute address: MUST NOT TRIGGER
  - Label "environment: production": MUST NOT TRIGGER
  - KMS key name reference: MUST NOT TRIGGER (reference, not key itself)

False Positive Prevention

Safe Patterns:
  - KMS key names: "projects/.../keyRings/.../cryptoKeys/..."
  - Secret Manager references: "projects/.../secrets/.../versions/latest"
  - Network references: "projects/.../global/networks/..."
  - Public IPs: "34.x.x.x" (safe to expose)

16. Integration with External Tools

Recommended Tools for Enhanced Detection

Terraform-compliance: Policy-as-code testing
tfsec: Static analysis for security issues
Checkov: Policy-based scanning
TruffleHog: Secret detection
GitLeaks: Credential scanning

Custom Detection Engine

class SensitiveDataDetector:
    def __init__(self, config: ClientConfig):
        self.rules = load_rules(config)
        self.entropy_threshold = 4.5  # Shannon entropy

    def detect(self, attribute_path: str, value: any) -> SensitivityLevel:
        # 1. Exact attribute name match
        # 2. Regex pattern match
        # 3. Entropy analysis
        # 4. Semantic context analysis
        pass

Version History

v1.0 (2025-01-13): Initial comprehensive taxonomy
v1.1 (TBD): Add Kubernetes secrets detection
v1.2 (TBD): Add multi-region compliance rules

Executive Summary​

Sensitivity Classifications​

CRITICAL (Must Always Filter)​

HIGH (Filter Unless Explicitly Allowed)​

MEDIUM (Configurable Based on Client Policy)​

LOW (Informational, May Be Allowed)​

1. Authentication & Authorization Credentials​

1.1 Private Keys (CRITICAL)​

1.2 Passwords & Secrets (CRITICAL)​

1.3 Access Tokens & API Keys (CRITICAL)​

2. Network & Infrastructure Secrets​

2.1 Private IP Addresses (HIGH)​

2.2 Connection Strings & URIs (CRITICAL)​

2.3 SSH Configuration (HIGH)​

3. GCP-Specific Sensitive Data​

3.1 Service Account Keys (CRITICAL)​

3.2 Cloud SQL Admin Passwords (CRITICAL)​

3.3 KMS Crypto Keys (HIGH)​

3.4 Secret Manager Secrets (CRITICAL)​

3.5 IAM Policy Bindings (MEDIUM)​

4. Environment Variables (HIGH)​

4.1 Cloud Run / Cloud Functions​

5. Metadata & Custom Fields​

5.1 Compute Instance Metadata (HIGH)​

5.2 Labels & Tags (LOW)​

6. Networking Details​

6.1 VPC & Subnets (MEDIUM)​

6.2 Firewall Rules (MEDIUM)​

7. Database & Storage Credentials​

7.1 BigQuery Dataset Access (HIGH)​

7.2 Storage Bucket Policies (MEDIUM)​

8. Terraform-Specific Sensitive Data​

8.1 Provider Configurations (CRITICAL)​

8.2 Remote State Backends (HIGH)​

9. Multi-Cloud Sensitive Data​

9.1 AWS Credentials​

9.2 Azure Credentials​

10. Detection Strategy Summary​

Pattern Matching Priority​

11. Sanitization Action Types​

Primary Actions​

12. Per-Client Configuration​

Client Risk Profiles​

13. Audit Trail Requirements​

What to Log for Each Sanitization​

14. Compliance Mappings​

SOC2 Requirements​

GDPR Requirements​

HIPAA Requirements​

15. Testing & Validation​

Test Cases for Sensitivity Detection​

False Positive Prevention​

16. Integration with External Tools​

Recommended Tools for Enhanced Detection​

Custom Detection Engine​

Version History​

References​

Executive Summary

Sensitivity Classifications

CRITICAL (Must Always Filter)

HIGH (Filter Unless Explicitly Allowed)

MEDIUM (Configurable Based on Client Policy)

LOW (Informational, May Be Allowed)

1. Authentication & Authorization Credentials

1.1 Private Keys (CRITICAL)

1.2 Passwords & Secrets (CRITICAL)

1.3 Access Tokens & API Keys (CRITICAL)

2. Network & Infrastructure Secrets

2.1 Private IP Addresses (HIGH)

2.2 Connection Strings & URIs (CRITICAL)

2.3 SSH Configuration (HIGH)

3. GCP-Specific Sensitive Data

3.1 Service Account Keys (CRITICAL)

3.2 Cloud SQL Admin Passwords (CRITICAL)

3.3 KMS Crypto Keys (HIGH)

3.4 Secret Manager Secrets (CRITICAL)

3.5 IAM Policy Bindings (MEDIUM)

4. Environment Variables (HIGH)

4.1 Cloud Run / Cloud Functions

5. Metadata & Custom Fields

5.1 Compute Instance Metadata (HIGH)

5.2 Labels & Tags (LOW)

6. Networking Details

6.1 VPC & Subnets (MEDIUM)

6.2 Firewall Rules (MEDIUM)

7. Database & Storage Credentials

7.1 BigQuery Dataset Access (HIGH)

7.2 Storage Bucket Policies (MEDIUM)

8. Terraform-Specific Sensitive Data

8.1 Provider Configurations (CRITICAL)

8.2 Remote State Backends (HIGH)

9. Multi-Cloud Sensitive Data

9.1 AWS Credentials

9.2 Azure Credentials

10. Detection Strategy Summary

Pattern Matching Priority

11. Sanitization Action Types

Primary Actions

12. Per-Client Configuration

Client Risk Profiles

13. Audit Trail Requirements

What to Log for Each Sanitization

14. Compliance Mappings

SOC2 Requirements

GDPR Requirements

HIPAA Requirements

15. Testing & Validation

Test Cases for Sensitivity Detection

False Positive Prevention

16. Integration with External Tools

Recommended Tools for Enhanced Detection

Custom Detection Engine

Version History

References