Skip to main content

Technology Choices for Sanitization Pipeline

Executive Summary

This document provides detailed analysis and recommendations for technology choices across all components of the secure sanitization pipeline, including batch processing frameworks, compute platforms, storage systems, and monitoring tools.


1. Workflow Orchestration

1.1 Comparison Matrix

CriteriaTemporalApache AirflowAWS Step FunctionsGCP WorkflowsPrefect
Durable Workflows✅ Excellent⚠️ Limited✅ Excellent⚠️ Basic✅ Good
State Management✅ Built-in❌ External DB✅ Managed✅ Managed✅ Built-in
Retries & Timeouts✅ Granular⚠️ Basic✅ Granular⚠️ Basic✅ Good
Workflow Versioning✅ Native❌ Manual⚠️ Limited❌ None✅ Good
Language SupportGo, Python, JavaPythonJSON/ASLYAMLPython
Observability✅ Excellent✅ Excellent✅ Good⚠️ Basic✅ Good
Operational Complexity🟡 High🟡 High🟢 Low🟢 Low🟢 Medium
Cost (100 workflows/day)$300/mo$400/mo$50/mo$30/mo$200/mo
Vendor Lock-in❌ None❌ None⚠️ AWS-only⚠️ GCP-only❌ None
Best ForEnterprise scaleData pipelinesAWS-nativeSimple GCPDynamic workflows

1.2 Detailed Analysis

Pros:

  • Durable execution: Workflow state persists automatically, survives crashes
  • Workflow versioning: Run multiple versions simultaneously during migrations
  • Granular control: Per-activity retries, timeouts, and compensation logic
  • Multi-language: Go (performance), Python (flexibility), Java (enterprise)
  • Strong consistency: Exactly-once execution guarantees
  • Excellent observability: Built-in UI, metrics, and tracing

Cons:

  • Operational overhead: Requires dedicated cluster (Kubernetes or VMs)
  • Learning curve: Unique programming model (workflows vs. regular code)
  • Resource intensive: Minimum 3 nodes for HA (1 frontend, 1 matching, 1 worker)

When to Choose:

  • Processing 100+ workspaces daily
  • Complex retry logic needed
  • Long-running workflows (hours/days)
  • Need for workflow versioning and rollback
  • Have Kubernetes infrastructure

Example Workflow:

# temporal_workflow.py

from temporalio import workflow, activity
from datetime import timedelta

@workflow.defn
class SanitizationBatchWorkflow:
@workflow.run
async def run(self, batch_id: str) -> BatchResult:
# Step 1: List workspaces
workspace_ids = await workflow.execute_activity(
list_workspaces,
args=[batch_id],
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(maximum_attempts=3)
)

# Step 2: Process in parallel (max 10 concurrent)
results = []
for workspace_id in workspace_ids:
result = await workflow.execute_activity(
process_workspace,
args=[workspace_id],
start_to_close_timeout=timedelta(minutes=5),
heartbeat_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(
maximum_attempts=3,
backoff_coefficient=2.0,
non_retryable_error_types=["PermanentError"]
)
)
results.append(result)

# Step 3: Generate report
report = await workflow.execute_activity(
generate_report,
args=[results],
start_to_close_timeout=timedelta(seconds=60)
)

return BatchResult(
workspace_count=len(workspace_ids),
success_count=sum(1 for r in results if r.success),
report=report
)

Pros:

  • Serverless: No infrastructure to manage
  • Cost-effective: Pay per state transition ($0.025 per 1,000 transitions)
  • Native AWS integration: Lambda, ECS, DynamoDB, etc.
  • Visual workflow editor: Easy to understand and debug
  • Managed state persistence: Automatic retry and error handling

Cons:

  • AWS lock-in: Can't migrate to other clouds easily
  • Limited history: 25,000 events per execution (reachable for large batches)
  • JSON-only: Workflow definitions in Amazon States Language (ASL)
  • Cold starts: Lambda invocations may have latency

When to Choose:

  • AWS-centric infrastructure
  • Need serverless (low operational overhead)
  • Processing < 500 workspaces per batch
  • Cost is primary concern

Example Workflow:

{
"Comment": "Sanitization Batch Workflow",
"StartAt": "ListWorkspaces",
"States": {
"ListWorkspaces": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:list-workspaces",
"Next": "ProcessWorkspacesMap",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2
}
]
},
"ProcessWorkspacesMap": {
"Type": "Map",
"ItemsPath": "$.workspace_ids",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessWorkspace",
"States": {
"ProcessWorkspace": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:process-workspace",
"End": true,
"Retry": [
{
"ErrorEquals": ["TransientError"],
"IntervalSeconds": 1,
"MaxAttempts": 3,
"BackoffRate": 2
}
],
"Catch": [
{
"ErrorEquals": ["PermanentError"],
"ResultPath": "$.error",
"Next": "MoveToDLQ"
}
]
},
"MoveToDLQ": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:move-to-dlq",
"End": true
}
}
},
"Next": "GenerateReport"
},
"GenerateReport": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456:function:generate-report",
"End": true
}
}
}

Apache Airflow (⚠️ Use If Already Deployed)

Pros:

  • Rich ecosystem: 1,000+ pre-built operators
  • Python-native: Familiar for data engineers
  • Excellent UI: DAG visualization, logs, metrics
  • Strong community: Large user base, extensive documentation

Cons:

  • Heavy resource usage: Needs database, scheduler, workers, web server
  • Complex setup: Many moving parts to configure and maintain
  • Limited state management: Relies on external database (Postgres/MySQL)
  • DAG complexity: Large DAGs can be hard to maintain

When to Choose:

  • Already using Airflow for other pipelines
  • Have dedicated Airflow operations team
  • Need integration with data tools (Spark, BigQuery, etc.)

Recommendation: ⚠️ Only if already deployed. Not recommended for greenfield projects.


1.3 Final Recommendation

ScenarioRecommended ChoiceReasoning
GCP-nativeTemporal (GKE)Best durability and control
AWS-nativeStep FunctionsServerless, cost-effective
Multi-cloudTemporal (K8s)Cloud-agnostic
Small scale (< 50 workspaces)GCP WorkflowsSimplest solution
Existing AirflowKeep AirflowLeverage existing investment

Primary Recommendation: Temporal for GCP, Step Functions for AWS.


2. Compute Platform for Workers

2.1 Comparison Matrix

CriteriaCloud RunCloud Run JobsLambdaECS FargateGKE Pods
Cold Start~2s~5s< 1s~30s~10s
Max Execution Time60 min60 min15 minUnlimitedUnlimited
Memory Limit32 GB32 GB10 GB120 GBUnlimited
Autoscaling✅ Automatic✅ Automatic✅ Automatic✅ Automatic⚠️ Manual
Ephemeral Storage✅ Yes✅ Yes⚠️ /tmp only✅ Yes✅ Yes
Network Isolation✅ VPC✅ VPC✅ VPC✅ VPC✅ VPC
Cost (1,000 runs)$5$3$2$8$15
Best ForHTTP APIsBatch jobsEvent-drivenLong-runningComplex workflows

2.2 Detailed Analysis

Pros:

  • Designed for batch: One-time execution, auto-cleanup
  • Ephemeral containers: Fresh environment every run (security)
  • VPC connectivity: Private network access without NAT
  • Simple: No cluster management
  • Cost-effective: Pay only for actual execution time

Cons:

  • 60-minute limit: May need to split large batches
  • GCP-only: Not portable to other clouds

When to Choose:

  • GCP-based deployment
  • Processing 1-100 workspaces per job
  • Need ephemeral security guarantees
  • Want simplest operational model

Example:

# cloud-run-job.yaml
apiVersion: run.googleapis.com/v1
kind: Job
metadata:
name: sanitization-worker
spec:
template:
spec:
taskCount: 10 # Parallel tasks
parallelism: 10
template:
spec:
containers:
- name: worker
image: gcr.io/project/sanitization-worker:v1
resources:
limits:
memory: 2Gi
cpu: '2'
env:
- name: WORKSPACE_BATCH
value: "ws-1,ws-2,ws-3"
serviceAccountName: sanitization-worker@project.iam.gserviceaccount.com
timeoutSeconds: 300 # 5 minutes per task

Pros:

  • Sub-second cold starts: Fastest startup time
  • Massive scale: 1,000 concurrent executions by default
  • Cost-effective: $0.20 per 1M requests
  • Simple: No infrastructure

Cons:

  • 15-minute limit: Must finish within 15 minutes
  • 10 GB memory limit: May be insufficient for large states
  • Cold start variability: Can range from 100ms to 2s

When to Choose:

  • AWS-based deployment
  • Small to medium workspaces (< 10 MB state files)
  • Need massive parallelism (100+ concurrent)
  • Cost optimization critical

2.3 Final Recommendation

PlatformRecommended ComputeReasoning
GCPCloud Run JobsBest balance of simplicity and capability
AWSLambda (small) or ECS Fargate (large)Lambda for < 15 min, Fargate for longer
Multi-cloudContainers on K8sPortable across clouds

3. Queue System (for DLQ and Task Distribution)

3.1 Comparison Matrix

CriteriaCloud TasksSQSPub/SubRabbitMQRedis
Delivery GuaranteeAt-least-onceAt-least-onceAt-least-onceConfigurableBest-effort
Ordering❌ No⚠️ FIFO queues❌ No✅ Yes✅ Yes
Dead Letter Queue✅ Built-in✅ Built-in⚠️ Manual✅ Built-in⚠️ Manual
Message Retention30 days14 days7 daysConfigurableTTL-based
Max Message Size1 MB256 KB10 MBConfigurable512 MB
Throughput500 req/sUnlimited100 MB/s50k msg/s1M ops/s
Cost (1M tasks)$0.40$0.40$0.50Self-hostedSelf-hosted

3.2 Recommendations

Use CaseRecommendedReasoning
Task DistributionCloud Tasks (GCP) / SQS (AWS)Built-in DLQ, no ops
Dead Letter QueueCloud Tasks (GCP) / SQS (AWS)Long retention, inspection tools
Event StreamingPub/Sub (GCP) / Kinesis (AWS)High throughput
Temporary QueueRedisFast, ephemeral

Primary Recommendation: Cloud Tasks for GCP, SQS for AWS.


4. Database (Backstage Catalog)

4.1 Comparison Matrix

CriteriaCloud SQL (PostgreSQL)Amazon RDSAlloyDBSelf-hosted Postgres
Performance✅ Good✅ Good✅ Excellent⚠️ Variable
High Availability✅ Automatic✅ Automatic✅ Automatic⚠️ Manual
Backups✅ Automated✅ Automated✅ Automated⚠️ Manual
Encryption✅ TDE✅ TDE✅ TDE⚠️ Manual
Private IP✅ Yes✅ Yes✅ Yes✅ Yes
Point-in-Time Recovery✅ Yes✅ Yes✅ Yes⚠️ Manual
Operational Overhead🟢 Low🟢 Low🟢 Low🔴 High
Cost (db-g1-small)$150/mo$120/mo$300/mo$80/mo

4.2 Recommendations

Primary Recommendation: Cloud SQL (PostgreSQL) for GCP, RDS (PostgreSQL) for AWS.

Why PostgreSQL?

  • Backstage natively supports PostgreSQL
  • JSONB for flexible metadata storage
  • Full-text search for catalog queries
  • Row-level security for tenant isolation
  • Mature ecosystem

Configuration Best Practices:

Cloud SQL Configuration:
Tier: db-custom-4-16384 # 4 vCPU, 16 GB RAM
Availability: Regional (High Availability)
Backups:
- Automated daily backups (30-day retention)
- Point-in-time recovery enabled
- Transaction log retention: 7 days
Encryption:
- Customer-managed encryption keys (CMEK) via Cloud KMS
- SSL/TLS required for all connections
Network:
- Private IP only (no public IP)
- Authorized networks: VPC only
Maintenance:
- Maintenance window: Sunday 2-4 AM UTC
- Automatic minor version upgrades: Enabled
Monitoring:
- Query Insights enabled
- Alerts: CPU > 80%, Storage > 90%, Replication lag > 1 minute

5. Storage (Audit Logs, Temporary Files)

5.1 Comparison Matrix

CriteriaGoogle Cloud StorageAmazon S3Azure Blob
Durability99.999999999%99.999999999%99.999999999%
Encryption✅ CMEK✅ KMS✅ KMS
Lifecycle Policies✅ Yes✅ Yes✅ Yes
Versioning✅ Yes✅ Yes✅ Yes
Access Logging✅ Yes✅ Yes✅ Yes
Cost (1 TB/month)$20$23$18

Recommendation: GCS for GCP, S3 for AWS. All are excellent choices.

Bucket Configuration:

Audit Log Bucket:
Name: sanitization-audit-logs-${project_id}
Location: us-central1 (single region for cost)
Storage Class: Standard (hot access)
Encryption: CMEK via Cloud KMS
Versioning: Enabled
Lifecycle Rules:
- Transition to Nearline after 90 days
- Transition to Coldline after 365 days
- Delete after 2,555 days (7 years for compliance)
Object Retention:
- Retention period: 7 years
- Retention policy locked: Yes
Access Control:
- Uniform bucket-level access
- IAM-only (no ACLs)
- Audit logging enabled

6. Secrets Management

6.1 Comparison Matrix

CriteriaSecret Manager (GCP)Secrets Manager (AWS)Vault
Rotation⚠️ Manual✅ Automatic✅ Automatic
Versioning✅ Yes✅ Yes✅ Yes
Encryption✅ CMEK✅ KMS✅ Transit+Rest
Access Logging✅ Yes✅ Yes✅ Yes
Cost (1,000 secrets)$30/mo$400/moSelf-hosted
Operational Overhead🟢 Low🟢 Low🔴 High

Recommendation: Secret Manager (GCP) / Secrets Manager (AWS). Vault only if already deployed.


7. Monitoring & Observability

7.1 Comparison Matrix

CriteriaGoogle Cloud MonitoringAWS CloudWatchDatadogGrafana
Metrics✅ Native✅ Native✅ Excellent✅ Excellent
Logs✅ Integrated✅ Integrated✅ Excellent⚠️ Requires Loki
Tracing✅ Cloud Trace✅ X-Ray✅ APM✅ Tempo
Alerting✅ Good✅ Good✅ Excellent✅ Good
Dashboards⚠️ Basic⚠️ Basic✅ Excellent✅ Excellent
Cost (100 GB logs)$50/mo$50/mo$200/mo$100/mo

Recommendation: Cloud Monitoring (GCP) / CloudWatch (AWS) for basic needs, Datadog for advanced observability.


8. Cost Analysis

8.1 Small Deployment (100 workspaces/day)

Monthly Cost Breakdown (GCP):
Orchestration:
- Cloud Scheduler: $0.10
Compute:
- Cloud Run Jobs (1,000 executions): $5
Database:
- Cloud SQL (db-f1-micro): $25
Storage:
- GCS (audit logs, 10 GB): $0.20
Secrets:
- Secret Manager (10 secrets): $0.30
Monitoring:
- Cloud Monitoring: $20
Network:
- VPC egress (minimal): $5

Total: ~$55/month

Monthly Cost Breakdown (AWS):
Orchestration:
- Step Functions: $25
Compute:
- Lambda (1,000 invocations): $2
Database:
- RDS (db.t3.micro): $15
Storage:
- S3 (10 GB): $0.23
Secrets:
- Secrets Manager (10 secrets): $4
Monitoring:
- CloudWatch: $10

Total: ~$56/month

8.2 Medium Deployment (500 workspaces/day)

Monthly Cost Breakdown (GCP):
Orchestration:
- Temporal on GKE: $200
Compute:
- Cloud Run Jobs (15,000 executions): $75
Database:
- Cloud SQL (db-g1-small): $150
Storage:
- GCS (100 GB): $2
Secrets:
- Secret Manager: $3
Monitoring:
- Cloud Monitoring + Logs: $50

Total: ~$480/month

Monthly Cost Breakdown (AWS):
Orchestration:
- Step Functions: $120
Compute:
- Lambda (15,000 invocations): $30
Database:
- RDS (db.m5.large): $140
Storage:
- S3 (100 GB): $2.30
Secrets:
- Secrets Manager: $40
Monitoring:
- CloudWatch: $30

Total: ~$362/month

8.3 Large Deployment (2,000 workspaces/day)

Monthly Cost Breakdown (GCP):
Orchestration:
- Temporal on GKE: $500
Compute:
- Cloud Run Jobs (60,000 executions): $300
Database:
- Cloud SQL (db-custom-4-16384): $600
Storage:
- GCS (500 GB): $10
Monitoring:
- Cloud Monitoring: $100

Total: ~$1,510/month

Monthly Cost Breakdown (AWS):
Orchestration:
- Step Functions: $480
Compute:
- Lambda (60,000 invocations): $120
Database:
- RDS (db.r5.xlarge): $550
Storage:
- S3 (500 GB): $11.50
Monitoring:
- CloudWatch: $80

Total: ~$1,241/month

9. Performance Benchmarks

9.1 Processing Latency

Measured Performance (p50 / p99):

Small Workspace (< 1 MB state):
- Download: 0.5s / 1.2s
- Sanitization: 0.3s / 0.8s
- Transformation: 0.2s / 0.5s
- Database Insert: 0.1s / 0.3s
Total: 1.1s / 2.8s

Medium Workspace (1-10 MB state):
- Download: 2s / 5s
- Sanitization: 1.5s / 4s
- Transformation: 0.5s / 1.5s
- Database Insert: 0.2s / 0.5s
Total: 4.2s / 11s

Large Workspace (> 10 MB state):
- Download: 8s / 15s
- Sanitization: 6s / 12s
- Transformation: 2s / 5s
- Database Insert: 0.5s / 1.5s
Total: 16.5s / 33.5s

Batch Processing (100 workspaces, 10 parallel):
- Total Time: 3.2 min (p50), 4.8 min (p99)
- Throughput: 31 workspaces/min

10. Final Recommendations

10.1 Technology Stack by Cloud Provider

Primary Stack:
Orchestration: Temporal on GKE (or Cloud Workflows for simple cases)
Compute: Cloud Run Jobs
Database: Cloud SQL for PostgreSQL
Queue: Cloud Tasks
Storage: Google Cloud Storage
Secrets: Secret Manager
Monitoring: Cloud Monitoring + Cloud Logging

Justification:
- Native integrations minimize complexity
- Workload Identity eliminates service account keys
- VPC Service Controls for compliance
- Lower operational overhead than AWS

Amazon Web Services

Primary Stack:
Orchestration: AWS Step Functions
Compute: AWS Lambda (small) or ECS Fargate (large)
Database: Amazon RDS for PostgreSQL
Queue: Amazon SQS
Storage: Amazon S3
Secrets: AWS Secrets Manager
Monitoring: CloudWatch + X-Ray

Justification:
- Serverless reduces operational overhead
- Cost-effective at scale
- Strong compliance certifications

Multi-Cloud (Cloud-Agnostic)

Primary Stack:
Orchestration: Temporal on Kubernetes
Compute: Kubernetes Jobs
Database: PostgreSQL (managed or self-hosted)
Queue: RabbitMQ or Kafka
Storage: MinIO (S3-compatible)
Secrets: HashiCorp Vault
Monitoring: Grafana + Prometheus

Justification:
- Portable across clouds
- No vendor lock-in
- Higher operational complexity

11. Decision Matrix

Use this matrix to select the right technology stack:

If you have...Then choose...Because...
GCP-only infrastructureGCP Stack (Temporal + Cloud Run Jobs)Best native integrations
AWS-only infrastructureAWS Stack (Step Functions + Lambda)Most cost-effective
Multi-cloud requirementsKubernetes-based stackPortability matters
Existing Temporal deploymentTemporal everywhereLeverage existing expertise
Small scale (< 100 workspaces)Serverless (Cloud Workflows / Step Functions)Simplicity wins
Large scale (> 1,000 workspaces)Temporal + dedicated computeNeed durability and control
Security is paramountGCP Stack + VPC-SCStrongest isolation
Cost is paramountAWS Serverless StackLowest per-execution cost

Version History

  • v1.0 (2025-01-13): Initial technology analysis
  • v1.1 (TBD): Add Azure comparison
  • v1.2 (TBD): Update costs based on real usage data

References