System Administration Guide: Ona Platform
This guide provides detailed technical administration information for the Ona Platform, covering API management, data handling, configuration, monitoring, security, and performance optimization.
1. API Reference
Core Endpoints Overview
The Ona Platform exposes a RESTful API through AWS API Gateway with the following core endpoints:
Base URLs:
- Custom Domain:
https://api.asoba.co✅ LIVE - Direct API Gateway:
https://2m5xvm39ef.execute-api.af-south-1.amazonaws.com/prod
Data Upload Endpoints
POST /upload_train
- Purpose: Direct upload of historical data to S3 for model training (dataIngestion service is placeholder)
- Authentication: API key or IAM role
- Content-Type:
text/csvfor data files, orapplication/jsonfor metadata - S3 Upload: Data is uploaded directly to S3 bucket
sa-api-client-input/historical/prefix - Processing: Triggers downstream processing via S3 event to interpolationService and globalTrainingService
- Note: The dataIngestion service is currently a placeholder with no active processing logic
POST /upload_nowcast
- Purpose: Direct upload of real-time data to S3 for forecasting (dataIngestion service is placeholder)
- Authentication: API key or IAM role
- Content-Type:
text/csvfor data files, orapplication/jsonfor metadata - S3 Upload: Data is uploaded directly to S3 bucket
sa-api-client-input/nowcast/prefix - Processing: Triggers downstream processing via S3 event to interpolationService for forecasting
- Note: The dataIngestion service is currently a placeholder with no active processing logic
Forecast Endpoints
GET /forecast
- Purpose: Generate energy production forecast
- Authentication: API key or IAM role
- Query Parameters:
customer_id(required): Customer identifiersite_id(optional): Specific site identifierhorizon_hours(optional): Forecast horizon in hours (default: 48)
- Response:
{ "forecast_id": "uuid", "customer_id": "string", "site_id": "string", "generated_at": "2025-01-15T12:00:00Z", "horizon_hours": 48, "forecasts": [ { "timestamp": "2025-01-15T13:00:00Z", "predicted_power_kw": 1250.5, "confidence_interval": { "lower": 1180.2, "upper": 1320.8 }, "weather_conditions": { "temperature_c": 25.3, "irradiance_w_m2": 850.2, "wind_speed_m_s": 3.2 } } ], "metadata": { "model_version": "v2.1.0", "training_data_points": 8760, "model_accuracy": 0.94 } }
Terminal API Endpoints (OODA Workflow)
POST /terminal/assets
- Purpose: Manage solar assets and components
- Authentication: API key or IAM role
- Content-Type:
application/json - Actions:
add: Create new assetlist: List all assetsget: Retrieve specific asset details
- Request Body (add):
{ "action": "add", "asset_id": "INV-001", "name": "Main Inverter 1", "type": "Solar Inverter", "capacity_kw": 20.0, "location": "Cape Town Solar Farm", "components": [ { "oem": "Sungrow", "model": "SG20KTL", "serial": "SN123456" } ] } - Response:
{ "message": "Asset created successfully", "asset_id": "INV-001" }
POST /terminal/detect
- Purpose: Run fault detection on assets
- Authentication: API key or IAM role
- Actions:
run: Execute fault detectionlist: List detection results
- Request Body:
{ "action": "run", "asset_id": "INV-001" } - Response:
{ "message": "Detection completed", "asset_id": "INV-001", "detections": [] }
POST /terminal/diagnose
- Purpose: Run AI diagnostics on detected faults
- Authentication: API key or IAM role
- Actions:
run: Execute diagnosticslist: List diagnostics results
- Request Body:
{ "action": "run", "asset_id": "INV-001" } - Response:
{ "message": "Diagnostics completed", "asset_id": "INV-001", "diagnostics": [] }
POST /terminal/schedule
- Purpose: Create maintenance schedules
- Authentication: API key or IAM role
- Actions:
create: Create new schedulelist: List all schedules
- Response:
{ "message": "Schedule created", "schedule_id": "sched-uuid" }
POST /terminal/bom
- Purpose: Build bill of materials for maintenance
- Authentication: API key or IAM role
- Actions:
build: Generate BOMlist: List existing BOMs
- Response:
{ "message": "BOM built", "bom_id": "bom-uuid" }
POST /terminal/order
- Purpose: Create work orders
- Authentication: API key or IAM role
- Actions:
create: Create new work orderlist: List all work orders
- Response:
{ "message": "Order created", "order_id": "ord-uuid" }
POST /terminal/track
- Purpose: Track job status and progress
- Authentication: API key or IAM role
- Actions:
subscribe: Subscribe to job trackinglist: List tracking subscriptions
- Response:
{ "message": "Tracking subscription created", "job_id": "job-uuid" }
2. Deployment & Infrastructure Management
Current Deployment Status ✅ LIVE
Platform Status: OPERATIONAL
- Custom Domain:
https://api.asoba.co✅ ACTIVE - API Gateway ID:
2m5xvm39ef✅ ACTIVE - Region:
af-south-1✅ ACTIVE - Environment:
prod✅ ACTIVE
Infrastructure Components:
- Lambda Functions: 5/6 ✅ ACTIVE
ona-dataIngestion-prod(placeholder - no active data ingestion logic)ona-weatherCache-prod✅ ACTIVEona-interpolationService-prod✅ ACTIVEona-globalTrainingService-prod✅ ACTIVEona-forecastingApi-prod✅ ACTIVEona-terminalApi-prod✅ ACTIVE
- Data Collection Services: 2/2 ✅ ACTIVE
ona-huaweiHistorical-prod✅ ACTIVEona-weatherDataUpdater-prod✅ ACTIVE
- S3 Buckets: 2/2 ✅ ACTIVE
sa-api-client-input✅ ACTIVEsa-api-client-output✅ ACTIVE
- DynamoDB Tables: 2/2 ✅ ACTIVE
ona-platform-locations✅ ACTIVEona-platform-weather-cache✅ ACTIVE
- API Gateway: ✅ ACTIVE
- SSL Certificate: ✅ ISSUED
- Route53 DNS: ✅ ACTIVE
Endpoint Testing:
POST /upload_train→ Direct S3 upload (no processing in dataIngestion service)POST /upload_nowcast→ Direct S3 upload (no processing in dataIngestion service)GET /forecast→ 200 OK ✅
Deployment Commands
Platform Deployment:
# Deploy core platform services
./deploy-all.sh
# Deploy terminal/O&M services
./deploy-terminal.sh
Validation:
# Validate platform deployment
./scripts/12-validate-deployment.sh
# Validate terminal deployment
./scripts/23-validate-terminal-deployment.sh
Authentication and Security
API Key Authentication
# Set API key in header
curl -H "X-API-Key: your-api-key" \
-X POST https://api.yourcompany.com/upload_train
# Set API key as query parameter
curl "https://api.yourcompany.com/forecast?customer_id=test&api_key=your-api-key"
IAM Role Authentication
# Use AWS credentials
aws configure
curl -H "Authorization: AWS4-HMAC-SHA256 ..." \
-X POST https://api.yourcompany.com/upload_train
Security Headers
- CORS: Configured for specific origins
- Rate Limiting: 1000 requests per hour per API key
- Request Validation: JSON schema validation
- SSL/TLS: HTTPS only with TLS 1.2+
Request/Response Formats
Standard Response Format
{
"success": true,
"data": {
// Response data
},
"metadata": {
"request_id": "uuid",
"timestamp": "2025-01-15T12:00:00Z",
"version": "v1.0"
},
"errors": []
}
Error Response Format
{
"success": false,
"data": null,
"metadata": {
"request_id": "uuid",
"timestamp": "2025-01-15T12:00:00Z",
"version": "v1.0"
},
"errors": [
{
"code": "VALIDATION_ERROR",
"message": "Invalid customer_id format",
"field": "customer_id",
"details": "Expected format: alphanumeric string"
}
]
}
Rate Limiting and Quotas
Rate Limits
- Standard Tier: 1000 requests/hour
- Premium Tier: 10000 requests/hour
- Enterprise Tier: 100000 requests/hour
Quota Management
# Check current usage
curl -H "X-API-Key: your-api-key" \
https://api.yourcompany.com/quota/usage
# Response
{
"quota_limit": 1000,
"quota_used": 245,
"quota_remaining": 755,
"reset_time": "2025-01-15T13:00:00Z"
}
Quota Exceeded Response
{
"success": false,
"data": null,
"errors": [
{
"code": "QUOTA_EXCEEDED",
"message": "Rate limit exceeded. Try again later.",
"retry_after": 3600
}
]
}
2. Deployment & Infrastructure Management
2.1 Deployment Process
The platform uses a two-script deployment architecture:
- Platform Services (
deploy-all.sh): Core forecasting and data ingestion services - Terminal Services (
deploy-terminal.sh): Operations & Maintenance OODA workflow
Prerequisites:
- AWS CLI configured with appropriate permissions
- Docker installed locally (optional - can use CI-built images)
- Visual Crossing API key
# Create environment file
cat > .env.local << 'EOF'
VISUAL_CROSSING_API_KEY=YOUR_ACTUAL_API_KEY
EOF
# Deploy platform services
./deploy-all.sh
# Deploy terminal/O&M services
./deploy-terminal.sh
What these scripts do:
deploy-all.sh:
- Creates IAM roles and policies
- Provisions S3 buckets and DynamoDB tables
- Sets up ECR repositories
- Note: Docker images are built automatically via GitHub Actions when code is pushed to
main(see CI/CD Deployment Process) - Deploys Lambda functions (pulls images from ECR)
- Configures API Gateway endpoints
- Validates deployment
deploy-terminal.sh:
- Creates terminal-specific SSM parameters
- Provisions terminal DynamoDB tables and S3 prefixes
- Creates ECR repositories for terminal services
- Generates deployment summary
Expected Deployment Output:
✓ Platform deployment completed in 2-3 minutes
✓ Terminal deployment completed in 20 minutes
✓ All Lambda functions created
✓ API Gateway deployed
✓ Endpoints responding with 200 status
Note on Performance: Deployment has been optimized with parallel execution:
- IAM role creation: 10-15 seconds (70% faster)
- Lambda deployment: 1.5 minutes (75% faster)
- API Gateway setup: 8-12 seconds (70% faster)
2.3 CI/CD Deployment Process
The platform uses GitHub Actions for automated code quality checks, Docker image builds, and deployments. This eliminates the need for local Docker builds in most cases.
GitHub Actions Workflow
Location: .github/workflows/build-and-push.yml
When it runs:
- Automatic trigger: Pushes to
mainbranch that modify:services/**pathsscripts/**pathsui/**paths.github/workflows/build-and-push.yml
- Manual trigger: Via GitHub Actions UI → “Build and Push ECR Images” → “Run workflow”
Workflow Structure: The workflow runs in two sequential jobs:
lint-and-test- Code quality and security checks (runs first)build-and-push- Docker image build and push (runs only if lint-and-test passes)
Lint and Test Job
The lint-and-test job performs comprehensive code quality checks before building Docker images:
Python Code Quality:
- Ruff: Fast Python linter for code style and potential errors (excludes
.ipynbnotebooks) - Bandit: Security vulnerability scanner for Python code
- Safety: Dependency vulnerability checker for Python packages
- Custom Python Safety Checker: Project-specific Python code validation
Shell Script Quality:
- ShellCheck: Shell script linting and best practices
JavaScript/UI Quality:
- npm audit: Dependency vulnerability scanning for Node.js packages
- Custom JS Safety Checker: Project-specific JavaScript/HTML validation
Failures:
If any linting or security check fails, the workflow stops and the build-and-push job is skipped. This ensures only validated code is deployed to production.
Build and Push Job
The build-and-push job builds Docker images and pushes them to AWS ECR:
What it builds:
- Base image:
ona-base(used by all service images) - Core services:
ona-dataingestion,ona-weathercache,ona-interpolationservice,ona-globaltrainingservice,ona-forecastingapi - Terminal services:
ona-terminalapi,ona-huaweihistorical,ona-huaweirealtime - SageMaker training image:
ona-platform-training
How it works:
- Checks out code
- Resolves AWS IAM role to assume (from workflow input, secret, or variable)
- Authenticates to AWS using OIDC (no long-lived credentials)
- Sets up ECR registry environment
- Logs into Amazon ECR
- Ensures all required ECR repositories exist
- Scans base dependencies for vulnerabilities
- Builds Docker images for
linux/amd64platform:- Base image first
- Service images (using base image as build arg)
- SageMaker training image (with Deep Learning Containers ECR login)
- Pushes images to ECR with:
- Mutable tag:
${STAGE}(e.g.,prod) - Immutable tag:
${STAGE}-${GIT_SHA}(e.g.,prod-abc1234)
- Mutable tag:
Image Tags:
- Mutable tags:
ona-service:prod(updated on each build) - Immutable tags:
ona-service:prod-abc1234(tied to specific git commit for reproducibility)
After images are built:
# Update Lambda functions to use new images
./scripts/08-create-lambdas.sh
This script updates all Lambda functions to pull the latest images from ECR.
Note: The workflow automatically creates all required ECR repositories if they don’t exist. Repositories are created with image scanning enabled on push.
Manual Docker Build (Alternative): If you need to build locally without pushing to GitHub:
# Build and push images manually
./scripts/07-build-and-push-docker.sh
# Then update Lambda functions
./scripts/08-create-lambdas.sh
Updating a Single Service:
To update just one service (e.g., terminalApi):
- Make code changes
- Commit and push to
mainbranch → GitHub Actions builds automatically - Or manually trigger workflow via GitHub Actions UI
- Run
./scripts/08-create-lambdas.shto update Lambda functions
Monitoring Builds:
# View GitHub Actions workflow runs
gh run list --workflow=build-and-push.yml
# View latest build logs
gh run view --log
# View specific workflow run
gh run view <run-id> --log
2.4 Infrastructure Components
The deployment creates the following AWS resources:
Lambda Functions
ona-dataIngestion-prod(1024MB, 300s timeout)ona-weatherCache-prod(512MB, 300s timeout)ona-interpolationService-prod(3008MB, 900s timeout)ona-globalTrainingService-prod(1024MB, 300s timeout)ona-forecastingApi-prod(3008MB, 60s timeout)ona-terminalApi-prod(1024MB, 300s timeout)
API Gateway
- REST API:
ona-api-prod - Stage:
prod - Endpoints:
POST /upload_trainPOST /upload_nowcastGET /forecast
Storage Resources
- S3 Buckets:
sa-api-client-input,sa-api-client-output - DynamoDB Tables:
ona-platform-locations,ona-platform-weather-cache
ECR Repositories
ona-baseona-dataingestionona-weathercacheona-interpolationserviceona-globaltrainingserviceona-forecastingapiona-terminalapiona-terminaloodaona-terminalassetsona-terminalbom
2.4 DNS and Custom Domain
One-time DNS Setup
cd dns-setup
./setup-dns-infrastructure.sh
This creates SSL certificate for api.asoba.co and configures DNS validation.
Custom Domain Mapping
The custom domain is automatically mapped when the SSL certificate is ready:
# Check certificate status
./dns-setup/check-certificate-status.sh
# Once certificate is ISSUED, custom domain will be mapped automatically
# on the next deployment run
2.5 Troubleshooting Deployment
Common Issues
CI Build Fails
If the workflow fails, check which job failed:
# View recent workflow runs
gh run list --workflow=build-and-push.yml
# View latest workflow run logs
gh run view --log
# View specific workflow run (replace <run-id> with actual ID)
gh run view <run-id> --log
# View only failed steps
gh run view <run-id> --log-failed
Common Failure Scenarios:
- Lint-and-Test Job Fails:
- Ruff errors: Check Python code style issues (unused imports, undefined variables, f-strings without placeholders)
- Bandit errors: Security vulnerabilities detected in Python code
- Safety errors: Dependency vulnerabilities in requirements files
- ShellCheck errors: Shell script syntax or best practice violations
- npm audit errors: Dependency vulnerabilities in UI/Node.js packages
Fixing Lint Errors:
# Run Ruff locally to see errors ruff check . --exclude "*.ipynb" # Run Bandit locally bandit -r . # Run Safety locally safety check --file=services/base/requirements-base.txt # Run ShellCheck locally shellcheck scripts/*.sh # Fix errors before pushing - Build-and-Push Job Fails:
- AWS Authentication errors: Check OIDC role configuration
- ECR login errors: Verify ECR repositories exist
- Docker build errors: Check Dockerfile syntax and dependencies
- Push errors: Verify IAM permissions for ECR push
Verify GitHub OIDC role exists:
aws iam get-role --role-name ona-github-actions-ecr-role
GitHub Actions AWS Credentials Error
If you see Credentials could not be loaded, please check your action inputs, the GitHub repository is missing the AWS role ARN secret:
# Add the AWS role ARN as a GitHub repository secret
gh secret set AWS_ROLE_TO_ASSUME --body "arn:aws:iam::905418405543:role/ona-github-actions-ecr-role"
# Verify the secret was added
gh secret list
# Test the workflow
gh workflow run "Build and Push ECR Images"
Required GitHub Repository Configuration The platform uses GitHub Actions with OIDC authentication to AWS. Ensure these are configured:
- AWS IAM Role:
ona-github-actions-ecr-role(already exists) - GitHub Secret:
AWS_ROLE_TO_ASSUMEwith valuearn:aws:iam::905418405543:role/ona-github-actions-ecr-role - OIDC Provider: GitHub OIDC provider configured in AWS IAM
Manual GitHub Setup (if GitHub CLI not available):
- Go to:
https://github.com/AsobaCloud/platform/settings/secrets/actions - Click “New repository secret”
- Name:
AWS_ROLE_TO_ASSUME - Value:
arn:aws:iam::905418405543:role/ona-github-actions-ecr-role
Lambda Functions Not Responding
# Check Lambda logs
aws logs get-log-events \
--log-group-name /aws/lambda/ona-dataIngestion-prod \
--log-stream-name $(aws logs describe-log-streams \
--log-group-name /aws/lambda/ona-dataIngestion-prod \
--order-by LastEventTime --descending --max-items 1 \
--query 'logStreams[0].logStreamName' --output text)
# Check function status
aws lambda get-function --function-name ona-dataIngestion-prod
API Gateway Not Working
# Check API Gateway deployment
aws apigateway get-deployments --rest-api-id u9xpolnr5m
# Test endpoints directly
curl -X POST https://u9xpolnr5m.execute-api.af-south-1.amazonaws.com/prod/upload_train
Custom Domain Issues
# Check certificate status
aws acm describe-certificate \
--certificate-arn $(cat .certificate-arn) \
--region us-east-1
# Check DNS resolution
nslookup api.asoba.co
2.7 Rollback and Cleanup
Complete Rollback
./rollback.sh
This removes all platform resources while preserving:
- S3 buckets and data
- DynamoDB tables and data
- SSL certificate
- Route53 DNS records
Selective Cleanup
# Remove specific Lambda functions
aws lambda delete-function --function-name ona-dataIngestion-prod
# Remove API Gateway
aws apigateway delete-rest-api --rest-api-id u9xpolnr5m
3. Operations & Maintenance
OODA Workflow Overview
The Ona Platform implements the OODA (Observe-Orient-Decide-Act) loop for systematic asset management:
Observe Phase (< 5 minutes)
Purpose: Real-time fault detection and anomaly identification
Key Operations:
- Continuous monitoring of SCADA/inverter data streams
- Anomaly detection using statistical and ML-based algorithms
- Severity scoring (0.0-1.0) for detected issues
- Integration with energy production forecasts for enhanced accuracy
Administrative Tasks:
# Monitor detection performance
aws logs tail /aws/lambda/ona-detect-prod --follow
# Check detection metrics
aws cloudwatch get-metric-statistics \
--namespace "Ona/Detection" \
--metric-name "DetectionLatency" \
--start-time 2025-01-15T00:00:00Z \
--end-time 2025-01-15T23:59:59Z \
--period 300 \
--statistics Average,Maximum
Orient Phase (< 10 minutes)
Purpose: AI-powered diagnostics and root cause analysis
Key Operations:
- Categorization of detected faults (Weather Damage, OEM Fault, Ops Fault, etc.)
- Risk assessment and severity classification
- Energy-at-Risk (EAR) calculations
- Trend analysis for degradation detection
Administrative Tasks:
# Review diagnostic categories
cat configs/ooda/categories.yaml
# Update loss function weights
aws ssm put-parameter \
--name /ona-platform/prod/loss-function-weights \
--value '{"w_energy": 1.0, "w_cost": 0.3, "w_mttr": 0.2}' \
--type String \
--overwrite
Decide Phase (< 15 minutes)
Purpose: Energy-at-Risk calculations and maintenance scheduling
Key Operations:
- Financial impact assessment of asset issues
- Optimal maintenance scheduling across crews
- Resource allocation and constraint management
- Bill of Materials (BOM) generation
Administrative Tasks:
# Monitor scheduling performance
aws logs tail /aws/lambda/ona-schedule-prod --follow
# Check crew availability
aws dynamodb get-item \
--table-name ona-platform-crews \
--key '{"crew_id": {"S": "crew-001"}}'
Act Phase (Continuous)
Purpose: Automated work order creation and tracking
Key Operations:
- Automated work order generation from BOMs
- Integration with external ticketing systems
- Progress tracking and status updates
- Notification management for stakeholders
Administrative Tasks:
# Monitor work order processing
aws logs tail /aws/lambda/ona-order-prod --follow
# Check tracking subscriptions
aws dynamodb scan \
--table-name ona-platform-tracking \
--filter-expression "status = :active" \
--expression-attribute-values '{":active": {"S": "active"}}'
Daily Operations Guide
Morning Health Check (8:00 AM)
#!/bin/bash
# daily_health_check.sh
echo "=== Ona Platform Health Check ==="
echo "Timestamp: $(date)"
# Check all Lambda functions
echo "Checking Lambda functions..."
for service in dataIngestion weatherCache interpolationService globalTrainingService forecastingApi; do
status=$(aws lambda get-function --function-name ona-${service}-prod --query 'Configuration.State' --output text)
echo " ona-${service}-prod: $status"
done
# Check S3 bucket health
echo "Checking S3 buckets..."
aws s3api head-bucket --bucket sa-api-client-input && echo " Input bucket: OK" || echo " Input bucket: ERROR"
aws s3api head-bucket --bucket sa-api-client-output && echo " Output bucket: OK" || echo " Output bucket: ERROR"
# Check DynamoDB tables
echo "Checking DynamoDB tables..."
aws dynamodb describe-table --table-name ona-platform-locations --query 'Table.TableStatus' --output text
aws dynamodb describe-table --table-name ona-platform-weather-cache --query 'Table.TableStatus' --output text
# Check API Gateway
echo "Checking API Gateway..."
api_id=$(aws apigateway get-rest-apis --query "items[?name=='ona-api-prod'].id" --output text)
if [ -n "$api_id" ]; then
echo " API Gateway: OK (ID: $api_id)"
else
echo " API Gateway: ERROR"
fi
echo "=== Health Check Complete ==="
Midday Performance Review (12:00 PM)
#!/bin/bash
# performance_review.sh
echo "=== Performance Review ==="
# Check error rates
echo "Error rates (last 24 hours):"
for service in dataIngestion weatherCache interpolationService globalTrainingService forecastingApi; do
errors=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=ona-${service}-prod \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' \
--output text)
echo " ona-${service}-prod: ${errors:-0} errors"
done
# Check execution duration
echo "Average execution duration:"
for service in dataIngestion weatherCache interpolationService globalTrainingService forecastingApi; do
duration=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=ona-${service}-prod \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[0].Average' \
--output text)
echo " ona-${service}-prod: ${duration:-0}ms"
done
echo "=== Performance Review Complete ==="
End-of-Day Summary (5:00 PM)
#!/bin/bash
# daily_summary.sh
echo "=== Daily Operations Summary ==="
echo "Date: $(date +%Y-%m-%d)"
# Count processed records
echo "Data Processing Summary:"
historical_count=$(aws s3 ls s3://sa-api-client-input/historical/ --recursive | wc -l)
nowcast_count=$(aws s3 ls s3://sa-api-client-input/nowcast/ --recursive | wc -l)
echo " Historical records processed: $historical_count"
echo " Nowcast records processed: $nowcast_count"
# Count forecasts generated
forecast_count=$(aws logs filter-log-events \
--log-group-name /aws/lambda/ona-forecastingApi-prod \
--start-time $(date -d '24 hours ago' +%s)000 \
--filter-pattern "Forecast generated" \
--query 'events | length(@)' \
--output text)
echo " Forecasts generated: $forecast_count"
# Count work orders created
work_orders=$(aws logs filter-log-events \
--log-group-name /aws/lambda/ona-order-prod \
--start-time $(date -d '24 hours ago' +%s)000 \
--filter-pattern "Work order created" \
--query 'events | length(@)' \
--output text)
echo " Work orders created: $work_orders"
echo "=== Daily Summary Complete ==="
Monitoring and Alerting
CloudWatch Alarms Configuration
# Create error rate alarm
aws cloudwatch put-metric-alarm \
--alarm-name "ona-dataIngestion-error-rate" \
--alarm-description "High error rate in data ingestion service" \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--threshold 5 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=FunctionName,Value=ona-dataIngestion-prod \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts
# Create duration alarm
aws cloudwatch put-metric-alarm \
--alarm-name "ona-interpolationService-duration" \
--alarm-description "High execution duration in interpolation service" \
--metric-name Duration \
--namespace AWS/Lambda \
--statistic Average \
--period 300 \
--threshold 600000 \
--comparison-operator GreaterThanThreshold \
--dimensions Name=FunctionName,Value=ona-interpolationService-prod \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts
SNS Alert Configuration
# Create SNS topic
aws sns create-topic --name ona-platform-alerts
# Subscribe to email notifications
aws sns subscribe \
--topic-arn arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts \
--protocol email \
--notification-endpoint admin@yourcompany.com
# Subscribe to Slack webhook
aws sns subscribe \
--topic-arn arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts \
--protocol https \
--notification-endpoint https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
Troubleshooting Guide
Common O&M Issues
Issue: High False Positive Rate in Detection
# Check detection thresholds
aws ssm get-parameter --name /ona-platform/prod/detection-threshold --query 'Parameter.Value'
# Adjust severity threshold
aws ssm put-parameter \
--name /ona-platform/prod/detection-threshold \
--value "0.7" \
--type String \
--overwrite
# Review detection logs
aws logs filter-log-events \
--log-group-name /aws/lambda/ona-detect-prod \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "severity"
Issue: Slow Diagnostic Processing
# Check diagnostic service performance
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=ona-diagnose-prod \
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 300 \
--statistics Average,Maximum
# Increase Lambda memory if needed
aws lambda update-function-configuration \
--function-name ona-diagnose-prod \
--memory-size 2048
Issue: Scheduling Conflicts
# Check crew availability
aws dynamodb scan \
--table-name ona-platform-crews \
--filter-expression "status = :available" \
--expression-attribute-values '{":available": {"S": "available"}}'
# Review scheduling constraints
cat configs/ooda/scheduling-constraints.yaml
# Check schedule conflicts
aws logs filter-log-events \
--log-group-name /aws/lambda/ona-schedule-prod \
--start-time $(date -d '1 hour ago' +%s)000 \
--filter-pattern "conflict"
3. Data Management
Input Data Requirements
Sensor Data Format
Location: s3://sa-api-client-input/observations/
Format: CSV with specific column requirements
Required Columns:
timestamp,asset_id,temperature_c,voltage_v,power_kw,irradiance_w_m2,wind_speed_m_s
2025-01-15T08:00:00Z,INV-001,45.2,800.5,18.3,850.2,3.2
2025-01-15T08:15:00Z,INV-001,46.1,799.8,17.9,845.1,3.5
Column Specifications:
timestamp: ISO8601 format (UTC)asset_id: Unique asset identifier (string)temperature_c: Temperature in Celsius (float)voltage_v: Voltage in Volts (float)power_kw: Power output in kilowatts (float)irradiance_w_m2: Solar irradiance in W/m² (float, optional)wind_speed_m_s: Wind speed in m/s (float, optional)
Asset Metadata Format
Location: s3://sa-api-client-input/assets.json
Format: JSON with asset configuration
{
"assets": [
{
"id": "INV-001",
"name": "Inverter 001",
"type": "Solar Inverter",
"capacity_kw": 20.0,
"location": {
"latitude": -26.2041,
"longitude": 28.0473,
"address": "Solar Farm A, Johannesburg"
},
"components": [
{
"oem": "Sungrow",
"model": "SG20KTL",
"serial": "SN123456",
"type": "inverter",
"installation_date": "2024-01-15T00:00:00Z"
}
],
"configuration": {
"dc_voltage_max": 1000,
"ac_voltage_nominal": 400,
"frequency_nominal": 50
}
}
]
}
Forecast Data Format
Location: s3://sa-api-client-input/forecasts/
Format: CSV with weather and energy predictions
timestamp,asset_id,predicted_power_kw,temperature_c,irradiance_w_m2,wind_speed_m_s,cloud_cover_percent
2025-01-15T09:00:00Z,INV-001,19.2,26.5,920.3,2.8,15.2
2025-01-15T09:15:00Z,INV-001,19.8,27.1,935.7,3.1,12.8
Output Data Formats
Forecast Results
Location: s3://sa-api-client-output/forecasts/
Format: JSON with comprehensive forecast data
{
"forecast_id": "fc-20250115-001",
"customer_id": "customer-001",
"asset_id": "INV-001",
"generated_at": "2025-01-15T08:00:00Z",
"horizon_hours": 48,
"forecasts": [
{
"timestamp": "2025-01-15T09:00:00Z",
"predicted_power_kw": 19.2,
"confidence_interval": {
"lower_95": 18.1,
"upper_95": 20.3,
"lower_80": 18.5,
"upper_80": 19.9
},
"weather_conditions": {
"temperature_c": 26.5,
"irradiance_w_m2": 920.3,
"wind_speed_m_s": 2.8,
"cloud_cover_percent": 15.2,
"humidity_percent": 65.3
},
"model_metadata": {
"model_version": "v2.1.0",
"training_data_points": 8760,
"model_accuracy": 0.94,
"last_training_date": "2025-01-10T00:00:00Z"
}
}
],
"summary": {
"total_predicted_energy_kwh": 920.4,
"peak_power_kw": 19.8,
"average_power_kw": 19.2,
"capacity_factor": 0.96
}
}
Diagnostic Results
Location: s3://sa-api-client-output/diagnostics/
Format: JSON with fault analysis and recommendations
{
"diagnostic_id": "diag-20250115-001",
"asset_id": "INV-001",
"analysis_timestamp": "2025-01-15T08:00:00Z",
"findings": [
{
"component": {
"oem": "Sungrow",
"model": "SG20KTL",
"serial": "SN123456",
"type": "fan"
},
"severity": 0.8,
"category": "OEM Fault",
"subcategory": "inverter_overtemp",
"description": "High temperature detected with reduced fan performance",
"confidence": 0.92,
"recommended_actions": [
"replace_component",
"inspect_cooling_system"
],
"energy_at_risk": {
"daily_loss_kwh": 15.2,
"daily_loss_usd": 18.24,
"confidence_interval": {
"lower": 12.1,
"upper": 18.3
}
}
}
],
"trend_analysis": {
"degradation_rate_percent_per_year": 2.3,
"performance_trend": "declining",
"next_maintenance_due": "2025-02-15T00:00:00Z"
}
}
Work Order Results
Location: s3://sa-api-client-output/work-orders/
Format: JSON with maintenance scheduling and BOMs
{
"work_order_id": "wo-20250115-001",
"asset_id": "INV-001",
"created_at": "2025-01-15T08:00:00Z",
"priority": "high",
"estimated_duration_hours": 4,
"scheduled_start": "2025-01-16T08:00:00Z",
"scheduled_end": "2025-01-16T12:00:00Z",
"crew_assignment": {
"crew_id": "crew-001",
"technician_count": 2,
"specialization": "inverter_maintenance"
},
"bill_of_materials": [
{
"sku": "SG20KTL-FAN-STD",
"description": "Cooling fan for SG20KTL inverter",
"quantity": 1,
"unit_price_usd": 120.0,
"total_price_usd": 120.0,
"lead_time_days": 7,
"supplier": "Sungrow Parts",
"warranty_months": 24
}
],
"total_cost_usd": 120.0,
"energy_at_risk": {
"daily_loss_usd": 18.24,
"total_risk_usd": 127.68,
"roi_days": 6.6
}
}
Storage Architecture
S3 Bucket Structure
sa-api-client-input/
├── observations/ # Raw sensor data
│ ├── 2025/01/15/
│ │ ├── INV-001_20250115.csv
│ │ └── INV-002_20250115.csv
│ └── archive/ # Compressed historical data
├── historical/ # Training data
│ ├── customer-001/
│ │ ├── 2024_q1.csv
│ │ └── 2024_q2.csv
│ └── customer-002/
├── nowcast/ # Real-time data
│ ├── 2025/01/15/
│ │ ├── INV-001_nowcast.csv
│ │ └── INV-002_nowcast.csv
├── training/ # Processed training data
│ ├── customer-001/
│ │ ├── enriched_data.csv
│ │ └── features.csv
│ └── customer-002/
├── weather/ # Weather data cache
│ ├── cache/
│ │ ├── johannesburg_20250115.json
│ │ └── cape_town_20250115.json
│ └── forecasts/
└── assets.json # Asset metadata
sa-api-client-output/
├── models/ # Trained ML models
│ ├── customer-001/
│ │ ├── lstm_model_v2.1.0.pkl
│ │ └── feature_scaler.pkl
│ └── customer-002/
├── forecasts/ # Generated forecasts
│ ├── 2025/01/15/
│ │ ├── fc-20250115-001.json
│ │ └── fc-20250115-002.json
│ └── archive/
├── diagnostics/ # Diagnostic results
│ ├── 2025/01/15/
│ │ ├── diag-20250115-001.json
│ │ └── diag-20250115-002.json
│ └── archive/
└── work-orders/ # Work order data
├── 2025/01/15/
│ ├── wo-20250115-001.json
│ └── wo-20250115-002.json
└── archive/
DynamoDB Table Structure
The platform utilizes 7 DynamoDB tables with a PAY_PER_REQUEST billing model.
Platform Tables:
1. ona-platform-locations
- Purpose: Stores location and customer data.
- Key Schema:
customer_id(Partition Key, String){ "TableName": "ona-platform-locations", "KeySchema": [ { "AttributeName": "customer_id", "KeyType": "HASH" } ], "AttributeDefinitions": [ { "AttributeName": "customer_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
2. ona-platform-weather-cache
- Purpose: Caches weather data to reduce API calls.
- Key Schema:
location_key(Partition Key, String){ "TableName": "ona-platform-weather-cache", "KeySchema": [ { "AttributeName": "location_key", "KeyType": "HASH" } ], "AttributeDefinitions": [ { "AttributeName": "location_key", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
Terminal Tables:
3. ona-platform-terminal-assets
- Purpose: Stores information about physical assets.
- Key Schema:
asset_id(Partition Key, String){ "TableName": "ona-platform-terminal-assets", "KeySchema": [ { "AttributeName": "asset_id", "KeyType": "HASH" } ], "AttributeDefinitions": [ { "AttributeName": "asset_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
4. ona-platform-terminal-schedules
- Purpose: Stores maintenance schedules.
- Key Schema:
schedule_id(Partition Key, String)asset_id(Sort Key, String){ "TableName": "ona-platform-terminal-schedules", "KeySchema": [ { "AttributeName": "schedule_id", "KeyType": "HASH" }, { "AttributeName": "asset_id", "KeyType": "RANGE" } ], "AttributeDefinitions": [ { "AttributeName": "schedule_id", "AttributeType": "S" }, { "AttributeName": "asset_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
5. ona-platform-terminal-boms
- Purpose: Stores Bill of Materials for maintenance jobs.
- Key Schema:
bom_id(Partition Key, String)asset_id(Sort Key, String){ "TableName": "ona-platform-terminal-boms", "KeySchema": [ { "AttributeName": "bom_id", "KeyType": "HASH" }, { "AttributeName": "asset_id", "KeyType": "RANGE" } ], "AttributeDefinitions": [ { "AttributeName": "bom_id", "AttributeType": "S" }, { "AttributeName": "asset_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
6. ona-platform-terminal-orders
- Purpose: Stores work orders.
- Key Schema:
order_id(Partition Key, String){ "TableName": "ona-platform-terminal-orders", "KeySchema": [ { "AttributeName": "order_id", "KeyType": "HASH" } ], "AttributeDefinitions": [ { "AttributeName": "order_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
7. ona-platform-terminal-tracking
- Purpose: Stores job tracking information.
- Key Schema:
job_id(Partition Key, String){ "TableName": "ona-platform-terminal-tracking", "KeySchema": [ { "AttributeName": "job_id", "KeyType": "HASH" } ], "AttributeDefinitions": [ { "AttributeName": "job_id", "AttributeType": "S" } ], "BillingMode": "PAY_PER_REQUEST" }
Data Retention Policies
Automatic Data Lifecycle Management
# Configure S3 lifecycle policies
aws s3api put-bucket-lifecycle-configuration \
--bucket sa-api-client-input \
--lifecycle-configuration '{
"Rules": [
{
"ID": "ArchiveObservations",
"Status": "Enabled",
"Filter": {
"Prefix": "observations/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
},
{
"Days": 365,
"StorageClass": "DEEP_ARCHIVE"
}
]
},
{
"ID": "DeleteOldForecasts",
"Status": "Enabled",
"Filter": {
"Prefix": "forecasts/"
},
"Expiration": {
"Days": 90
}
}
]
}'
Manual Data Cleanup
# Clean up old diagnostic results
aws s3 rm s3://sa-api-client-output/diagnostics/2024/ --recursive
# Archive old training data
aws s3 cp s3://sa-api-client-input/historical/ s3://sa-api-client-input/archive/historical/ --recursive
# Compress large observation files
gzip s3://sa-api-client-input/observations/2024/INV-001_20241231.csv
Data Validation and Quality Control
Input Data Validation
# Example data validation script
import pandas as pd
import json
from datetime import datetime
def validate_sensor_data(file_path):
"""Validate sensor data format and content"""
try:
df = pd.read_csv(file_path)
# Check required columns
required_columns = ['timestamp', 'asset_id', 'temperature_c', 'voltage_v', 'power_kw']
missing_columns = set(required_columns) - set(df.columns)
if missing_columns:
raise ValueError(f"Missing required columns: {missing_columns}")
# Validate timestamp format
df['timestamp'] = pd.to_datetime(df['timestamp'])
# Check for null values
null_counts = df.isnull().sum()
if null_counts.any():
print(f"Warning: Null values found: {null_counts[null_counts > 0].to_dict()}")
# Validate data ranges
if (df['temperature_c'] < -50).any() or (df['temperature_c'] > 100).any():
raise ValueError("Temperature values out of range")
if (df['voltage_v'] < 0).any() or (df['voltage_v'] > 1500).any():
raise ValueError("Voltage values out of range")
if (df['power_kw'] < 0).any():
raise ValueError("Power values cannot be negative")
return True, "Data validation passed"
except Exception as e:
return False, f"Data validation failed: {str(e)}"
# Usage
is_valid, message = validate_sensor_data('INV-001_20250115.csv')
print(f"Validation result: {is_valid}, Message: {message}")
Data Quality Monitoring
# Monitor data quality metrics
aws cloudwatch put-metric-data \
--namespace "Ona/DataQuality" \
--metric-data '[
{
"MetricName": "DataCompleteness",
"Value": 95.2,
"Unit": "Percent",
"Dimensions": [
{
"Name": "AssetId",
"Value": "INV-001"
}
]
},
{
"MetricName": "DataAccuracy",
"Value": 98.7,
"Unit": "Percent",
"Dimensions": [
{
"Name": "AssetId",
"Value": "INV-001"
}
]
}
]'
Backup and Recovery
Automated Backup Strategy
#!/bin/bash
# backup_data.sh
BACKUP_DATE=$(date +%Y%m%d)
BACKUP_BUCKET="ona-platform-backups"
echo "Starting data backup for $BACKUP_DATE"
# Backup critical data
aws s3 sync s3://sa-api-client-input/assets.json s3://$BACKUP_BUCKET/$BACKUP_DATE/assets.json
aws s3 sync s3://sa-api-client-input/historical/ s3://$BACKUP_BUCKET/$BACKUP_DATE/historical/
aws s3 sync s3://sa-api-client-output/models/ s3://$BACKUP_BUCKET/$BACKUP_DATE/models/
# Backup DynamoDB tables
aws dynamodb create-backup \
--table-name ona-platform-locations \
--backup-name "locations-backup-$BACKUP_DATE"
aws dynamodb create-backup \
--table-name ona-platform-weather-cache \
--backup-name "weather-cache-backup-$BACKUP_DATE"
echo "Backup completed successfully"
Disaster Recovery Procedures
#!/bin/bash
# disaster_recovery.sh
BACKUP_DATE=$1
BACKUP_BUCKET="ona-platform-backups"
if [ -z "$BACKUP_DATE" ]; then
echo "Usage: $0 <backup_date> (format: YYYYMMDD)"
exit 1
fi
echo "Starting disaster recovery from backup $BACKUP_DATE"
# Restore S3 data
aws s3 sync s3://$BACKUP_BUCKET/$BACKUP_DATE/assets.json s3://sa-api-client-input/assets.json
aws s3 sync s3://$BACKUP_BUCKET/$BACKUP_DATE/historical/ s3://sa-api-client-input/historical/
aws s3 sync s3://$BACKUP_BUCKET/$BACKUP_DATE/models/ s3://sa-api-client-output/models/
# Restore DynamoDB tables
aws dynamodb restore-table-from-backup \
--target-table-name ona-platform-locations-restored \
--backup-arn "arn:aws:dynamodb:af-south-1:ACCOUNT:table/ona-platform-locations/backup/BACKUP_ID"
echo "Disaster recovery completed"
4. Configuration & Customization
Environment Configuration
Main Configuration File
Location: config/environment.sh
Purpose: Central configuration for all deployment scripts
#!/bin/bash
# environment.sh - Main configuration file
# AWS Configuration
export AWS_REGION="af-south-1"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export ENVIRONMENT="prod"
# API Configuration
export API_DOMAIN="api.yourcompany.com"
export API_NAME="ona-api-prod"
export STAGE="prod"
# S3 Bucket Configuration
export INPUT_BUCKET="sa-api-client-input"
export OUTPUT_BUCKET="sa-api-client-output"
export MODELS_BUCKET="ona-platform-models"
# DynamoDB Configuration
export LOCATIONS_TABLE="ona-platform-locations"
export WEATHER_CACHE_TABLE="ona-platform-weather-cache"
export CREWS_TABLE="ona-platform-crews"
export TRACKING_TABLE="ona-platform-tracking"
# Service Configuration
export SERVICES=("dataIngestion" "weatherCache" "interpolationService" "globalTrainingService" "forecastingApi")
# Lambda Configuration
export LAMBDA_MEMORY_DATA_INGESTION=1024
export LAMBDA_MEMORY_WEATHER_CACHE=512
export LAMBDA_MEMORY_INTERPOLATION=3008
export LAMBDA_MEMORY_TRAINING=1024
export LAMBDA_MEMORY_FORECASTING=3008
export LAMBDA_TIMEOUT_DATA_INGESTION=300
export LAMBDA_TIMEOUT_WEATHER_CACHE=300
export LAMBDA_TIMEOUT_INTERPOLATION=900
export LAMBDA_TIMEOUT_TRAINING=300
export LAMBDA_TIMEOUT_FORECASTING=60
# ECR Configuration
export ECR_REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
# Standard Tags
export STANDARD_TAGS="Key=Project,Value=ona-platform Key=Environment,Value=${ENVIRONMENT} Key=ManagedBy,Value=terraform"
# Hosted Zone Configuration
export HOSTED_ZONE_ID="Z02057713AMAS6GXTEGNR"
# SNS Configuration
export ALERT_TOPIC_ARN="arn:aws:sns:${AWS_REGION}:${AWS_ACCOUNT_ID}:ona-platform-alerts"
Service-Specific Configuration
Location: config/services.yaml
Purpose: Detailed service configurations
# services.yaml - Service-specific configurations
dataIngestion:
memory: 1024
timeout: 300
environment_variables:
LOG_LEVEL: "INFO"
MAX_FILE_SIZE_MB: 100
ALLOWED_FILE_TYPES: "csv,json"
reserved_concurrency: 10
weatherCache:
memory: 512
timeout: 300
schedule_expression: "rate(15 minutes)"
environment_variables:
LOG_LEVEL: "INFO"
WEATHER_API_TIMEOUT: 30
CACHE_TTL_HOURS: 1
reserved_concurrency: 1
interpolationService:
memory: 3008
timeout: 900
environment_variables:
LOG_LEVEL: "INFO"
ML_MODEL_TIMEOUT: 300
MAX_BATCH_SIZE: 1000
reserved_concurrency: 5
globalTrainingService:
memory: 1024
timeout: 300
environment_variables:
LOG_LEVEL: "INFO"
TRAINING_TIMEOUT: 1800
MODEL_VERSION: "v2.1.0"
reserved_concurrency: 2
forecastingApi:
memory: 3008
timeout: 60
environment_variables:
LOG_LEVEL: "INFO"
FORECAST_HORIZON_HOURS: 48
CONFIDENCE_LEVEL: 0.95
reserved_concurrency: 20
Service-Specific Settings
OODA Configuration
Location: configs/ooda/categories.yaml
Purpose: Define fault categories and subcategories
# categories.yaml - Fault categorization
Weather Damage:
- hail_impact
- wind_stress
- lightning_strike
- flooding_damage
OEM Fault:
- inverter_overtemp
- dc_bus_fault
- communication_error
- component_failure
Ops Fault:
- wrong_setpoint
- maintenance_overdue
- calibration_drift
- configuration_error
Wear and Tear:
- bearing_wear
- capacitor_degradation
- connector_corrosion
- insulation_degradation
End of Life:
- capacity_fade
- efficiency_loss
- component_failure
- performance_degradation
Unknown Needs Further Investigation:
- unknown_pattern
- intermittent_fault
- data_quality_issue
- sensor_malfunction
Loss Function Configuration
Location: configs/ooda/loss_function.yaml
Purpose: Configure decision-making weights
# loss_function.yaml - Decision weights and constraints
weights:
w_energy: 1.0 # Energy loss weight (primary factor)
w_cost: 0.3 # Maintenance cost weight
w_mttr: 0.2 # Mean time to repair weight
w_risk: 0.5 # Risk assessment weight
w_urgency: 0.8 # Urgency factor weight
crew:
crews_available: 2
hours_per_day: 8
max_travel_distance_km: 100
specializations:
- inverter_maintenance
- electrical_work
- mechanical_repair
constraints:
max_concurrent_jobs: 3
min_crew_size: 1
max_crew_size: 4
working_hours:
start: "08:00"
end: "17:00"
timezone: "Africa/Johannesburg"
thresholds:
severity_critical: 0.8
severity_major: 0.6
severity_minor: 0.4
ear_threshold_usd: 50.0
Detection Configuration
Location: configs/ooda/detection.yaml
Purpose: Configure anomaly detection parameters
# detection.yaml - Anomaly detection configuration
algorithms:
statistical:
z_score_threshold: 3.0
rolling_window_minutes: 15
min_data_points: 10
machine_learning:
model_type: "isolation_forest"
contamination: 0.1
random_state: 42
energy_deviation:
forecast_tolerance_percent: 10.0
min_deviation_kw: 1.0
confidence_threshold: 0.8
parameters:
temperature:
normal_range_min: -10
normal_range_max: 60
critical_threshold: 70
voltage:
normal_range_min: 0
normal_range_max: 1200
critical_threshold: 1300
power:
normal_range_min: 0
normal_range_max: 25
critical_threshold: 30
filtering:
exclude_maintenance_windows: true
exclude_night_hours: true
night_hours_start: "18:00"
night_hours_end: "06:00"
Extensible Platform Capabilities
Custom Service Integration
Location: configs/extensions/custom_services.yaml
Purpose: Configure additional modular services
# custom_services.yaml - Extensible service configuration
insurance_automation:
enabled: true
api_endpoint: "https://api.insurance-provider.com"
api_key_parameter: "/ona-platform/prod/insurance-api-key"
claim_threshold_usd: 1000
auto_submission: true
notification_email: "claims@yourcompany.com"
fleet_analytics:
enabled: true
portfolio_analysis: true
cross_site_benchmarking: true
performance_trends: true
reporting_frequency: "weekly"
output_format: "pdf"
soiling_calculations:
enabled: true
detection_method: "performance_analysis"
cleaning_threshold_percent: 5.0
weather_integration: true
cost_per_cleaning_usd: 500
roi_threshold_days: 30
energy_market_integration:
enabled: false
market_provider: "eskom"
api_endpoint: "https://api.eskom.co.za"
price_fetch_frequency: "hourly"
dispatch_optimization: true
electricity_dispatch:
enabled: false
grid_connection_type: "embedded_generation"
max_export_capacity_kw: 1000
tariff_structure: "time_of_use"
optimization_horizon_hours: 24
Custom Integration Templates
Location: templates/integrations/
Purpose: Templates for common integrations
SolarEdge Integration Template:
# solaredge_integration.py
import requests
import json
from datetime import datetime, timedelta
class SolarEdgeIntegration:
def __init__(self, api_key, site_id):
self.api_key = api_key
self.site_id = site_id
self.base_url = "https://monitoringapi.solaredge.com"
def get_overview_data(self):
"""Fetch site overview data"""
url = f"{self.base_url}/site/{self.site_id}/overview"
params = {"api_key": self.api_key}
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()
def get_power_data(self, start_date, end_date):
"""Fetch power data for date range"""
url = f"{self.base_url}/site/{self.site_id}/power"
params = {
"api_key": self.api_key,
"startTime": start_date.isoformat(),
"endTime": end_date.isoformat()
}
response = requests.get(url, params=params)
response.raise_for_status()
return response.json()
def format_for_ona(self, data):
"""Convert SolarEdge data to Ona format"""
formatted_data = []
for point in data['power']['values']:
formatted_data.append({
'timestamp': point['date'],
'asset_id': f"SE-{self.site_id}",
'power_kw': point['value'] / 1000 if point['value'] else 0,
'data_source': 'solaredge'
})
return formatted_data
Huawei FusionSolar Integration Template:
# huawei_integration.py
import requests
import json
from datetime import datetime
class HuaweiIntegration:
def __init__(self, station_code, token):
self.station_code = station_code
self.token = token
self.base_url = "https://intl.fusionsolar.huawei.com/thirdData"
def get_station_real_kpi(self):
"""Fetch real-time station data"""
url = f"{self.base_url}/getStationRealKpi"
headers = {"XSRF-TOKEN": self.token}
data = {"stationCodes": self.station_code}
response = requests.post(url, headers=headers, json=data)
response.raise_for_status()
return response.json()
def format_for_ona(self, data):
"""Convert Huawei data to Ona format"""
formatted_data = []
for station in data['data']:
formatted_data.append({
'timestamp': datetime.now().isoformat(),
'asset_id': f"HW-{station['stationCode']}",
'power_kw': station['realKpi']['power'] / 1000,
'temperature_c': station['realKpi']['temperature'],
'data_source': 'huawei'
})
return formatted_data
Environment-Specific Overrides
Production Configuration
Location: configs/production.yaml
Purpose: Production-specific settings
# production.yaml - Production environment configuration
environment: "prod"
log_level: "INFO"
# Enhanced monitoring
monitoring:
detailed_logging: true
performance_metrics: true
error_tracking: true
alerting_enabled: true
# Security settings
security:
api_rate_limiting: true
request_validation: true
ssl_enforcement: true
ip_whitelisting: false
# Performance optimization
performance:
lambda_provisioned_concurrency: true
s3_transfer_acceleration: true
cloudfront_distribution: true
caching_enabled: true
# Backup and recovery
backup:
automated_backups: true
cross_region_replication: true
point_in_time_recovery: true
retention_days: 30
# Scaling configuration
scaling:
auto_scaling_enabled: true
min_capacity: 2
max_capacity: 100
target_utilization: 70
Development Configuration
Location: configs/development.yaml
Purpose: Development environment settings
# development.yaml - Development environment configuration
environment: "dev"
log_level: "DEBUG"
# Relaxed monitoring
monitoring:
detailed_logging: true
performance_metrics: false
error_tracking: true
alerting_enabled: false
# Relaxed security
security:
api_rate_limiting: false
request_validation: true
ssl_enforcement: false
ip_whitelisting: false
# Development optimization
performance:
lambda_provisioned_concurrency: false
s3_transfer_acceleration: false
cloudfront_distribution: false
caching_enabled: false
# Minimal backup
backup:
automated_backups: false
cross_region_replication: false
point_in_time_recovery: false
retention_days: 7
# Manual scaling
scaling:
auto_scaling_enabled: false
min_capacity: 1
max_capacity: 10
target_utilization: 50
Configuration Management
Parameter Store Management
# Set configuration parameters
aws ssm put-parameter \
--name "/ona-platform/prod/detection-threshold" \
--value "0.7" \
--type "String" \
--description "Anomaly detection severity threshold" \
--overwrite
aws ssm put-parameter \
--name "/ona-platform/prod/weather-api-key" \
--value "YOUR_API_KEY" \
--type "SecureString" \
--description "Visual Crossing weather API key" \
--overwrite
# Get configuration parameters
aws ssm get-parameter \
--name "/ona-platform/prod/detection-threshold" \
--query 'Parameter.Value' \
--output text
# List all platform parameters
aws ssm describe-parameters \
--filters "Key=Name,Values=/ona-platform/prod/" \
--query 'Parameters[].Name' \
--output table
Configuration Validation
#!/bin/bash
# validate_configuration.sh
echo "Validating Ona Platform configuration..."
# Check required parameters
REQUIRED_PARAMS=(
"/ona-platform/prod/visual-crossing-api-key"
"/ona-platform/prod/detection-threshold"
"/ona-platform/prod/loss-function-weights"
)
for param in "${REQUIRED_PARAMS[@]}"; do
if aws ssm get-parameter --name "$param" &>/dev/null; then
echo "✓ $param"
else
echo "✗ $param (missing)"
exit 1
fi
done
# Validate configuration files
echo "Validating configuration files..."
# Check YAML syntax
python -c "import yaml; yaml.safe_load(open('configs/ooda/categories.yaml'))" && echo "✓ categories.yaml"
python -c "import yaml; yaml.safe_load(open('configs/ooda/loss_function.yaml'))" && echo "✓ loss_function.yaml"
# Check environment variables
source config/environment.sh
if [ -n "$AWS_REGION" ] && [ -n "$API_DOMAIN" ]; then
echo "✓ Environment variables"
else
echo "✗ Environment variables (missing)"
exit 1
fi
echo "Configuration validation completed successfully"
Dynamic Configuration Updates
#!/bin/bash
# update_configuration.sh
CONFIG_TYPE=$1
NEW_VALUE=$2
case $CONFIG_TYPE in
"detection-threshold")
aws ssm put-parameter \
--name "/ona-platform/prod/detection-threshold" \
--value "$NEW_VALUE" \
--type "String" \
--overwrite
echo "Detection threshold updated to $NEW_VALUE"
;;
"loss-weights")
aws ssm put-parameter \
--name "/ona-platform/prod/loss-function-weights" \
--value "$NEW_VALUE" \
--type "String" \
--overwrite
echo "Loss function weights updated"
;;
"weather-api-key")
aws ssm put-parameter \
--name "/ona-platform/prod/visual-crossing-api-key" \
--value "$NEW_VALUE" \
--type "SecureString" \
--overwrite
echo "Weather API key updated"
;;
*)
echo "Usage: $0 <config-type> <new-value>"
echo "Available config types: detection-threshold, loss-weights, weather-api-key"
exit 1
;;
esac
5. Monitoring & Observability
CloudWatch Integration
Log Groups Configuration
Purpose: Centralized logging for all platform services
# Create log groups for all services
for service in dataIngestion weatherCache interpolationService globalTrainingService forecastingApi; do
aws logs create-log-group \
--log-group-name "/aws/lambda/ona-${service}-prod" \
--region af-south-1 \
--retention-in-days 30
done
# Create custom log groups for OODA operations
aws logs create-log-group \
--log-group-name "/aws/lambda/ona-detect-prod" \
--region af-south-1 \
--retention-in-days 30
aws logs create-log-group \
--log-group-name "/aws/lambda/ona-diagnose-prod" \
--region af-south-1 \
--retention-in-days 30
aws logs create-log-group \
--log-group-name "/aws/lambda/ona-schedule-prod" \
--region af-south-1 \
--retention-in-days 30
aws logs create-log-group \
--log-group-name "/aws/lambda/ona-order-prod" \
--region af-south-1 \
--retention-in-days 30
Custom Metrics
Purpose: Track business-specific metrics beyond standard AWS metrics
# custom_metrics.py - Custom metrics collection
import boto3
import json
from datetime import datetime
class OnaMetrics:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.namespace = 'Ona/Platform'
def put_detection_metrics(self, asset_id, severity, detection_time_ms):
"""Record detection performance metrics"""
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'DetectionLatency',
'Dimensions': [
{'Name': 'AssetId', 'Value': asset_id},
{'Name': 'Service', 'Value': 'detect'}
],
'Value': detection_time_ms,
'Unit': 'Milliseconds',
'Timestamp': datetime.utcnow()
},
{
'MetricName': 'DetectionSeverity',
'Dimensions': [
{'Name': 'AssetId', 'Value': asset_id},
{'Name': 'Service', 'Value': 'detect'}
],
'Value': severity,
'Unit': 'None',
'Timestamp': datetime.utcnow()
}
]
)
def put_forecast_metrics(self, customer_id, accuracy, horizon_hours):
"""Record forecast accuracy metrics"""
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'ForecastAccuracy',
'Dimensions': [
{'Name': 'CustomerId', 'Value': customer_id},
{'Name': 'HorizonHours', 'Value': str(horizon_hours)}
],
'Value': accuracy,
'Unit': 'Percent',
'Timestamp': datetime.utcnow()
}
]
)
def put_ear_metrics(self, asset_id, ear_usd_day, risk_category):
"""Record Energy-at-Risk metrics"""
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'EnergyAtRisk',
'Dimensions': [
{'Name': 'AssetId', 'Value': asset_id},
{'Name': 'RiskCategory', 'Value': risk_category}
],
'Value': ear_usd_day,
'Unit': 'None',
'Timestamp': datetime.utcnow()
}
]
)
# Usage example
metrics = OnaMetrics()
metrics.put_detection_metrics('INV-001', 0.8, 250)
metrics.put_forecast_metrics('customer-001', 94.2, 48)
metrics.put_ear_metrics('INV-001', 18.24, 'high')
CloudWatch Dashboards
Purpose: Visual monitoring of platform health and performance
# Create comprehensive dashboard
aws cloudwatch put-dashboard \
--dashboard-name "Ona-Platform-Overview" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"x": 0,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["AWS/Lambda", "Invocations", "FunctionName", "ona-dataIngestion-prod"],
["AWS/Lambda", "Invocations", "FunctionName", "ona-weatherCache-prod"],
["AWS/Lambda", "Invocations", "FunctionName", "ona-interpolationService-prod"],
["AWS/Lambda", "Invocations", "FunctionName", "ona-globalTrainingService-prod"],
["AWS/Lambda", "Invocations", "FunctionName", "ona-forecastingApi-prod"]
],
"view": "timeSeries",
"stacked": false,
"region": "af-south-1",
"title": "Lambda Invocations",
"period": 300
}
},
{
"type": "metric",
"x": 12,
"y": 0,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["AWS/Lambda", "Errors", "FunctionName", "ona-dataIngestion-prod"],
["AWS/Lambda", "Errors", "FunctionName", "ona-weatherCache-prod"],
["AWS/Lambda", "Errors", "FunctionName", "ona-interpolationService-prod"],
["AWS/Lambda", "Errors", "FunctionName", "ona-globalTrainingService-prod"],
["AWS/Lambda", "Errors", "FunctionName", "ona-forecastingApi-prod"]
],
"view": "timeSeries",
"stacked": false,
"region": "af-south-1",
"title": "Lambda Errors",
"period": 300
}
},
{
"type": "metric",
"x": 0,
"y": 6,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["AWS/Lambda", "Duration", "FunctionName", "ona-dataIngestion-prod"],
["AWS/Lambda", "Duration", "FunctionName", "ona-weatherCache-prod"],
["AWS/Lambda", "Duration", "FunctionName", "ona-interpolationService-prod"],
["AWS/Lambda", "Duration", "FunctionName", "ona-globalTrainingService-prod"],
["AWS/Lambda", "Duration", "FunctionName", "ona-forecastingApi-prod"]
],
"view": "timeSeries",
"stacked": false,
"region": "af-south-1",
"title": "Lambda Duration",
"period": 300
}
},
{
"type": "metric",
"x": 12,
"y": 6,
"width": 12,
"height": 6,
"properties": {
"metrics": [
["Ona/Platform", "DetectionLatency"],
["Ona/Platform", "ForecastAccuracy"],
["Ona/Platform", "EnergyAtRisk"]
],
"view": "timeSeries",
"stacked": false,
"region": "af-south-1",
"title": "Business Metrics",
"period": 300
}
}
]
}'
Logging and Metrics
Structured Logging
Purpose: Consistent, searchable log format across all services
# structured_logging.py - Standardized logging format
import json
import logging
import boto3
from datetime import datetime
from typing import Dict, Any
class OnaLogger:
def __init__(self, service_name: str, log_level: str = "INFO"):
self.service_name = service_name
self.logger = logging.getLogger(service_name)
self.logger.setLevel(getattr(logging, log_level.upper()))
# Create CloudWatch handler
handler = logging.StreamHandler()
formatter = logging.Formatter('%(message)s')
handler.setFormatter(formatter)
self.logger.addHandler(handler)
def _format_log(self, level: str, message: str, **kwargs) -> str:
"""Format log entry as structured JSON"""
log_entry = {
"timestamp": datetime.utcnow().isoformat(),
"level": level,
"service": self.service_name,
"message": message,
**kwargs
}
return json.dumps(log_entry)
def info(self, message: str, **kwargs):
"""Log info level message"""
self.logger.info(self._format_log("INFO", message, **kwargs))
def warning(self, message: str, **kwargs):
"""Log warning level message"""
self.logger.warning(self._format_log("WARNING", message, **kwargs))
def error(self, message: str, **kwargs):
"""Log error level message"""
self.logger.error(self._format_log("ERROR", message, **kwargs))
def debug(self, message: str, **kwargs):
"""Log debug level message"""
self.logger.debug(self._format_log("DEBUG", message, **kwargs))
def log_detection(self, asset_id: str, severity: float, detection_time_ms: int):
"""Log detection event"""
self.info(
"Fault detection completed",
event_type="detection",
asset_id=asset_id,
severity=severity,
detection_time_ms=detection_time_ms
)
def log_forecast(self, customer_id: str, asset_id: str, accuracy: float):
"""Log forecast generation"""
self.info(
"Forecast generated",
event_type="forecast",
customer_id=customer_id,
asset_id=asset_id,
accuracy=accuracy
)
def log_ear_calculation(self, asset_id: str, ear_usd_day: float, risk_category: str):
"""Log Energy-at-Risk calculation"""
self.info(
"Energy-at-Risk calculated",
event_type="ear_calculation",
asset_id=asset_id,
ear_usd_day=ear_usd_day,
risk_category=risk_category
)
# Usage example
logger = OnaLogger("ona-detect-prod")
logger.log_detection("INV-001", 0.8, 250)
logger.log_forecast("customer-001", "INV-001", 94.2)
logger.log_ear_calculation("INV-001", 18.24, "high")
Performance Metrics Collection
Purpose: Track system performance and identify bottlenecks
# performance_metrics.py - Performance monitoring
import time
import psutil
import boto3
from functools import wraps
from typing import Callable, Any
class PerformanceMonitor:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.namespace = 'Ona/Performance'
def measure_execution_time(self, func: Callable) -> Callable:
"""Decorator to measure function execution time"""
@wraps(func)
def wrapper(*args, **kwargs) -> Any:
start_time = time.time()
try:
result = func(*args, **kwargs)
execution_time = (time.time() - start_time) * 1000 # Convert to milliseconds
# Log successful execution
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'FunctionExecutionTime',
'Dimensions': [
{'Name': 'FunctionName', 'Value': func.__name__},
{'Name': 'Status', 'Value': 'Success'}
],
'Value': execution_time,
'Unit': 'Milliseconds'
}
]
)
return result
except Exception as e:
execution_time = (time.time() - start_time) * 1000
# Log failed execution
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'FunctionExecutionTime',
'Dimensions': [
{'Name': 'FunctionName', 'Value': func.__name__},
{'Name': 'Status', 'Value': 'Error'}
],
'Value': execution_time,
'Unit': 'Milliseconds'
}
]
)
raise e
return wrapper
def collect_system_metrics(self):
"""Collect system-level performance metrics"""
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
disk = psutil.disk_usage('/')
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=[
{
'MetricName': 'CPUUtilization',
'Value': cpu_percent,
'Unit': 'Percent'
},
{
'MetricName': 'MemoryUtilization',
'Value': memory.percent,
'Unit': 'Percent'
},
{
'MetricName': 'DiskUtilization',
'Value': (disk.used / disk.total) * 100,
'Unit': 'Percent'
}
]
)
# Usage example
monitor = PerformanceMonitor()
@monitor.measure_execution_time
def process_data(data):
# Simulate data processing
time.sleep(0.1)
return len(data)
# Collect system metrics
monitor.collect_system_metrics()
Alerting and Notifications
CloudWatch Alarms Configuration
Purpose: Proactive monitoring with automated alerts
# Create comprehensive alarm set
ALARM_CONFIGS=(
"ona-dataIngestion-error-rate:AWS/Lambda:Errors:FunctionName:ona-dataIngestion-prod:5:GreaterThanThreshold:2"
"ona-weatherCache-error-rate:AWS/Lambda:Errors:FunctionName:ona-weatherCache-prod:3:GreaterThanThreshold:2"
"ona-interpolationService-duration:AWS/Lambda:Duration:FunctionName:ona-interpolationService-prod:600000:GreaterThanThreshold:1"
"ona-forecastingApi-duration:AWS/Lambda:Duration:FunctionName:ona-forecastingApi-prod:45000:GreaterThanThreshold:1"
"ona-platform-detection-latency:Ona/Platform:DetectionLatency:5000:GreaterThanThreshold:1"
"ona-platform-forecast-accuracy:Ona/Platform:ForecastAccuracy:80:LessThanThreshold:2"
)
for alarm_config in "${ALARM_CONFIGS[@]}"; do
IFS=':' read -r alarm_name namespace metric_name dimension_name dimension_value threshold comparison_operator evaluation_periods <<< "$alarm_config"
aws cloudwatch put-metric-alarm \
--alarm-name "$alarm_name" \
--alarm-description "Alert for $alarm_name" \
--metric-name "$metric_name" \
--namespace "$namespace" \
--statistic Average \
--period 300 \
--threshold "$threshold" \
--comparison-operator "$comparison_operator" \
--dimensions Name="$dimension_name",Value="$dimension_value" \
--evaluation-periods "$evaluation_periods" \
--alarm-actions "arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts" \
--region af-south-1
done
SNS Alert Configuration
Purpose: Multi-channel notification system
# Create SNS topic
aws sns create-topic --name ona-platform-alerts --region af-south-1
# Subscribe to email notifications
aws sns subscribe \
--topic-arn "arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts" \
--protocol email \
--notification-endpoint "admin@yourcompany.com" \
--region af-south-1
# Subscribe to Slack webhook
aws sns subscribe \
--topic-arn "arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts" \
--protocol https \
--notification-endpoint "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
--region af-south-1
# Subscribe to PagerDuty
aws sns subscribe \
--topic-arn "arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts" \
--protocol https \
--notification-endpoint "https://events.pagerduty.com/integration/YOUR_INTEGRATION_KEY/enqueue" \
--region af-south-1
Custom Alert Processing
Purpose: Intelligent alert filtering and escalation
# alert_processor.py - Smart alert processing
import json
import boto3
from datetime import datetime, timedelta
from typing import Dict, List
class AlertProcessor:
def __init__(self):
self.sns = boto3.client('sns')
self.cloudwatch = boto3.client('cloudwatch')
self.dynamodb = boto3.client('dynamodb')
self.topic_arn = "arn:aws:sns:af-south-1:ACCOUNT:ona-platform-alerts"
def process_alert(self, alarm_name: str, alarm_state: str, alarm_reason: str):
"""Process incoming CloudWatch alarm"""
alert_data = {
"alarm_name": alarm_name,
"state": alarm_state,
"reason": alarm_reason,
"timestamp": datetime.utcnow().isoformat(),
"severity": self._determine_severity(alarm_name, alarm_state),
"escalation_level": self._get_escalation_level(alarm_name)
}
# Check for alert suppression
if self._should_suppress_alert(alarm_name):
return
# Check for alert frequency
if self._is_alert_flooding(alarm_name):
self._escalate_alert(alert_data)
return
# Send appropriate notification
self._send_notification(alert_data)
# Log alert for tracking
self._log_alert(alert_data)
def _determine_severity(self, alarm_name: str, alarm_state: str) -> str:
"""Determine alert severity based on alarm type and state"""
if "error" in alarm_name.lower():
return "critical" if alarm_state == "ALARM" else "warning"
elif "duration" in alarm_name.lower():
return "high" if alarm_state == "ALARM" else "medium"
elif "accuracy" in alarm_name.lower():
return "medium" if alarm_state == "ALARM" else "low"
else:
return "medium"
def _get_escalation_level(self, alarm_name: str) -> int:
"""Determine escalation level for alert"""
critical_alarms = ["error-rate", "duration"]
if any(keyword in alarm_name.lower() for keyword in critical_alarms):
return 1 # Immediate escalation
else:
return 2 # Standard escalation
def _should_suppress_alert(self, alarm_name: str) -> bool:
"""Check if alert should be suppressed (e.g., during maintenance)"""
# Check maintenance window
current_time = datetime.utcnow()
maintenance_start = current_time.replace(hour=2, minute=0, second=0, microsecond=0)
maintenance_end = maintenance_start + timedelta(hours=2)
if maintenance_start <= current_time <= maintenance_end:
return True
return False
def _is_alert_flooding(self, alarm_name: str) -> bool:
"""Check if too many alerts are being sent for the same alarm"""
# Check alert frequency in the last hour
end_time = datetime.utcnow()
start_time = end_time - timedelta(hours=1)
response = self.cloudwatch.get_metric_statistics(
Namespace='AWS/SNS',
MetricName='NumberOfNotificationsPublished',
Dimensions=[
{'Name': 'TopicName', 'Value': 'ona-platform-alerts'}
],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Sum']
)
if response['Datapoints']:
notification_count = response['Datapoints'][0]['Sum']
return notification_count > 10 # Threshold for alert flooding
return False
def _send_notification(self, alert_data: Dict):
"""Send notification via SNS"""
message = {
"default": json.dumps(alert_data),
"email": f"""
Ona Platform Alert
Alarm: {alert_data['alarm_name']}
State: {alert_data['state']}
Severity: {alert_data['severity']}
Time: {alert_data['timestamp']}
Reason: {alert_data['reason']}
Please investigate immediately.
""",
"slack": json.dumps({
"text": f"🚨 Ona Platform Alert",
"attachments": [
{
"color": "danger" if alert_data['severity'] == 'critical' else "warning",
"fields": [
{"title": "Alarm", "value": alert_data['alarm_name'], "short": True},
{"title": "State", "value": alert_data['state'], "short": True},
{"title": "Severity", "value": alert_data['severity'], "short": True},
{"title": "Time", "value": alert_data['timestamp'], "short": True},
{"title": "Reason", "value": alert_data['reason'], "short": False}
]
}
]
})
}
self.sns.publish(
TopicArn=self.topic_arn,
Message=json.dumps(message),
MessageStructure='json'
)
def _escalate_alert(self, alert_data: Dict):
"""Escalate alert to higher-level notification"""
escalated_message = f"""
ESCALATED ALERT - Ona Platform
Multiple alerts received for: {alert_data['alarm_name']}
Severity: {alert_data['severity']}
Time: {alert_data['timestamp']}
This alert is being escalated due to high frequency.
Immediate attention required.
"""
self.sns.publish(
TopicArn=self.topic_arn,
Message=escalated_message,
Subject="ESCALATED: Ona Platform Alert"
)
def _log_alert(self, alert_data: Dict):
"""Log alert for tracking and analysis"""
self.dynamodb.put_item(
TableName='ona-platform-alert-log',
Item={
'alert_id': {'S': f"{alert_data['alarm_name']}-{alert_data['timestamp']}"},
'alarm_name': {'S': alert_data['alarm_name']},
'state': {'S': alert_data['state']},
'severity': {'S': alert_data['severity']},
'timestamp': {'S': alert_data['timestamp']},
'reason': {'S': alert_data['reason']},
'escalation_level': {'N': str(alert_data['escalation_level'])}
}
)
Performance Monitoring
Real-Time Performance Tracking
Purpose: Continuous monitoring of system performance
# performance_tracker.py - Real-time performance monitoring
import time
import threading
import boto3
from collections import defaultdict, deque
from datetime import datetime, timedelta
class PerformanceTracker:
def __init__(self):
self.cloudwatch = boto3.client('cloudwatch')
self.namespace = 'Ona/Performance'
self.metrics_buffer = defaultdict(lambda: deque(maxlen=100))
self.start_time = time.time()
# Start background thread for metrics collection
self.monitoring_thread = threading.Thread(target=self._collect_metrics_loop)
self.monitoring_thread.daemon = True
self.monitoring_thread.start()
def _collect_metrics_loop(self):
"""Background thread for continuous metrics collection"""
while True:
try:
self._collect_system_metrics()
self._collect_application_metrics()
self._flush_metrics_buffer()
time.sleep(60) # Collect metrics every minute
except Exception as e:
print(f"Error in metrics collection: {e}")
time.sleep(60)
def _collect_system_metrics(self):
"""Collect system-level performance metrics"""
import psutil
# CPU metrics
cpu_percent = psutil.cpu_percent(interval=1)
self._add_metric('CPUUtilization', cpu_percent)
# Memory metrics
memory = psutil.virtual_memory()
self._add_metric('MemoryUtilization', memory.percent)
self._add_metric('MemoryAvailable', memory.available / (1024**3)) # GB
# Disk metrics
disk = psutil.disk_usage('/')
disk_percent = (disk.used / disk.total) * 100
self._add_metric('DiskUtilization', disk_percent)
# Network metrics
network = psutil.net_io_counters()
self._add_metric('NetworkBytesSent', network.bytes_sent)
self._add_metric('NetworkBytesReceived', network.bytes_recv)
def _collect_application_metrics(self):
"""Collect application-specific performance metrics"""
# Track request rates
self._add_metric('RequestRate', self._calculate_request_rate())
# Track response times
self._add_metric('AverageResponseTime', self._calculate_avg_response_time())
# Track error rates
self._add_metric('ErrorRate', self._calculate_error_rate())
# Track active connections
self._add_metric('ActiveConnections', self._count_active_connections())
def _add_metric(self, metric_name: str, value: float, dimensions: Dict = None):
"""Add metric to buffer"""
metric_key = f"{metric_name}:{dimensions or ''}"
self.metrics_buffer[metric_key].append({
'timestamp': datetime.utcnow(),
'value': value,
'dimensions': dimensions or {}
})
def _flush_metrics_buffer(self):
"""Flush metrics buffer to CloudWatch"""
metric_data = []
for metric_key, data_points in self.metrics_buffer.items():
if not data_points:
continue
metric_name = metric_key.split(':')[0]
dimensions = data_points[0]['dimensions']
# Calculate statistics for the time window
values = [dp['value'] for dp in data_points]
avg_value = sum(values) / len(values)
max_value = max(values)
min_value = min(values)
# Add average metric
metric_data.append({
'MetricName': metric_name,
'Dimensions': [{'Name': k, 'Value': v} for k, v in dimensions.items()],
'Value': avg_value,
'Unit': 'None',
'Timestamp': datetime.utcnow()
})
# Add max metric
metric_data.append({
'MetricName': f"{metric_name}Max",
'Dimensions': [{'Name': k, 'Value': v} for k, v in dimensions.items()],
'Value': max_value,
'Unit': 'None',
'Timestamp': datetime.utcnow()
})
# Send metrics to CloudWatch in batches
if metric_data:
try:
self.cloudwatch.put_metric_data(
Namespace=self.namespace,
MetricData=metric_data
)
# Clear buffer after successful send
self.metrics_buffer.clear()
except Exception as e:
print(f"Error sending metrics to CloudWatch: {e}")
def _calculate_request_rate(self) -> float:
"""Calculate requests per second"""
# This would be implemented based on your application's request tracking
return 0.0
def _calculate_avg_response_time(self) -> float:
"""Calculate average response time"""
# This would be implemented based on your application's response time tracking
return 0.0
def _calculate_error_rate(self) -> float:
"""Calculate error rate percentage"""
# This would be implemented based on your application's error tracking
return 0.0
def _count_active_connections(self) -> int:
"""Count active connections"""
# This would be implemented based on your application's connection tracking
return 0
def track_function_performance(self, function_name: str, execution_time: float, success: bool):
"""Track individual function performance"""
self._add_metric('FunctionExecutionTime', execution_time, {
'FunctionName': function_name,
'Status': 'Success' if success else 'Error'
})
if success:
self._add_metric('FunctionSuccessRate', 1.0, {'FunctionName': function_name})
else:
self._add_metric('FunctionSuccessRate', 0.0, {'FunctionName': function_name})
# Usage example
tracker = PerformanceTracker()
tracker.track_function_performance('process_data', 150.5, True)
Performance Analysis and Reporting
Purpose: Generate performance reports and identify trends
#!/bin/bash
# performance_report.sh - Generate performance analysis report
REPORT_DATE=$(date +%Y-%m-%d)
REPORT_FILE="performance_report_${REPORT_DATE}.txt"
echo "Ona Platform Performance Report - $REPORT_DATE" > $REPORT_FILE
echo "===============================================" >> $REPORT_FILE
echo "" >> $REPORT_FILE
# Lambda performance analysis
echo "Lambda Function Performance:" >> $REPORT_FILE
echo "----------------------------" >> $REPORT_FILE
for service in dataIngestion weatherCache interpolationService globalTrainingService forecastingApi; do
echo "Service: ona-${service}-prod" >> $REPORT_FILE
# Get average duration
avg_duration=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Duration \
--dimensions Name=FunctionName,Value=ona-${service}-prod \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[0].Average' \
--output text)
# Get error count
error_count=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions Name=FunctionName,Value=ona-${service}-prod \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' \
--output text)
# Get invocation count
invocation_count=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions Name=FunctionName,Value=ona-${service}-prod \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' \
--output text)
echo " Average Duration: ${avg_duration:-0}ms" >> $REPORT_FILE
echo " Error Count: ${error_count:-0}" >> $REPORT_FILE
echo " Invocation Count: ${invocation_count:-0}" >> $REPORT_FILE
echo " Error Rate: $(echo "scale=2; ${error_count:-0} * 100 / ${invocation_count:-1}" | bc)%" >> $REPORT_FILE
echo "" >> $REPORT_FILE
done
# Business metrics analysis
echo "Business Metrics:" >> $REPORT_FILE
echo "----------------" >> $REPORT_FILE
# Detection performance
avg_detection_latency=$(aws cloudwatch get-metric-statistics \
--namespace Ona/Platform \
--metric-name DetectionLatency \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[0].Average' \
--output text)
echo "Average Detection Latency: ${avg_detection_latency:-0}ms" >> $REPORT_FILE
# Forecast accuracy
avg_forecast_accuracy=$(aws cloudwatch get-metric-statistics \
--namespace Ona/Platform \
--metric-name ForecastAccuracy \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Average \
--query 'Datapoints[0].Average' \
--output text)
echo "Average Forecast Accuracy: ${avg_forecast_accuracy:-0}%" >> $REPORT_FILE
echo "" >> $REPORT_FILE
echo "Report generated at: $(date)" >> $REPORT_FILE
# Send report via email
aws ses send-email \
--source "reports@yourcompany.com" \
--destination "ToAddresses=admin@yourcompany.com" \
--message "Subject={Data='Ona Platform Performance Report - $REPORT_DATE'},Body={Text={Data='Performance report attached'}}" \
--region af-south-1
echo "Performance report generated: $REPORT_FILE"
6. Security & Compliance
Security Architecture
Network Security
Purpose: Secure network communication and access control
# Configure VPC security groups
aws ec2 create-security-group \
--group-name ona-platform-lambda-sg \
--description "Security group for Ona Platform Lambda functions" \
--vpc-id vpc-12345678
# Allow HTTPS outbound for API calls
aws ec2 authorize-security-group-egress \
--group-id sg-12345678 \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
# Allow DynamoDB access
aws ec2 authorize-security-group-egress \
--group-id sg-12345678 \
--protocol tcp \
--port 443 \
--source-group sg-dynamodb-sg
# Configure API Gateway throttling
aws apigateway update-stage \
--rest-api-id api12345678 \
--stage-name prod \
--patch-ops '[
{
"op": "replace",
"path": "/throttle/burstLimit",
"value": "1000"
},
{
"op": "replace",
"path": "/throttle/rateLimit",
"value": "500"
}
]'
IAM Security Policies
Purpose: Principle of least privilege access control
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "OnaPlatformS3Access",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::sa-api-client-input/*",
"arn:aws:s3:::sa-api-client-output/*"
],
"Condition": {
"StringEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
},
{
"Sid": "OnaPlatformDynamoDBAccess",
"Effect": "Allow",
"Action": [
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"dynamodb:Query",
"dynamodb:Scan"
],
"Resource": [
"arn:aws:dynamodb:af-south-1:*:table/ona-platform-locations",
"arn:aws:dynamodb:af-south-1:*:table/ona-platform-weather-cache"
]
},
{
"Sid": "OnaPlatformCloudWatchAccess",
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"cloudwatch:PutMetricData"
],
"Resource": "*"
}
]
}
Encryption Configuration
Purpose: Data encryption at rest and in transit
# Enable S3 bucket encryption
aws s3api put-bucket-encryption \
--bucket sa-api-client-input \
--server-side-encryption-configuration '{
"Rules": [
{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}
]
}'
# Enable DynamoDB encryption
aws dynamodb update-table \
--table-name ona-platform-locations \
--sse-specification Enabled=true,SSEType=KMS \
--region af-south-1
# Configure Lambda environment variable encryption
aws lambda update-function-configuration \
--function-name ona-dataIngestion-prod \
--kms-key-arn arn:aws:kms:af-south-1:ACCOUNT:key/key-id \
--region af-south-1
Data Protection
Data Classification and Handling
Purpose: Classify and protect sensitive data appropriately
# data_classification.py - Data classification and protection
import boto3
import json
from enum import Enum
from typing import Dict, List
class DataClassification(Enum):
PUBLIC = "public"
INTERNAL = "internal"
CONFIDENTIAL = "confidential"
RESTRICTED = "restricted"
class DataProtection:
def __init__(self):
self.s3 = boto3.client('s3')
self.kms = boto3.client('kms')
self.classification_rules = {
'customer_id': DataClassification.CONFIDENTIAL,
'asset_id': DataClassification.INTERNAL,
'power_kw': DataClassification.INTERNAL,
'temperature_c': DataClassification.INTERNAL,
'api_key': DataClassification.RESTRICTED,
'financial_data': DataClassification.RESTRICTED
}
def classify_data(self, data: Dict) -> DataClassification:
"""Classify data based on content"""
highest_classification = DataClassification.PUBLIC
for key, value in data.items():
if key in self.classification_rules:
classification = self.classification_rules[key]
if self._is_higher_classification(classification, highest_classification):
highest_classification = classification
return highest_classification
def _is_higher_classification(self, new: DataClassification, current: DataClassification) -> bool:
"""Check if new classification is higher than current"""
classification_order = {
DataClassification.PUBLIC: 1,
DataClassification.INTERNAL: 2,
DataClassification.CONFIDENTIAL: 3,
DataClassification.RESTRICTED: 4
}
return classification_order[new] > classification_order[current]
def apply_data_protection(self, data: Dict, classification: DataClassification) -> Dict:
"""Apply appropriate data protection measures"""
protected_data = data.copy()
if classification in [DataClassification.CONFIDENTIAL, DataClassification.RESTRICTED]:
# Mask sensitive fields
protected_data = self._mask_sensitive_fields(protected_data)
# Add encryption metadata
protected_data['_encryption'] = {
'required': True,
'algorithm': 'AES-256',
'classification': classification.value
}
return protected_data
def _mask_sensitive_fields(self, data: Dict) -> Dict:
"""Mask sensitive fields in data"""
sensitive_fields = ['api_key', 'password', 'secret']
masked_data = data.copy()
for field in sensitive_fields:
if field in masked_data:
masked_data[field] = '***MASKED***'
return masked_data
def audit_data_access(self, user_id: str, data_classification: DataClassification, action: str):
"""Audit data access for compliance"""
audit_log = {
'timestamp': datetime.utcnow().isoformat(),
'user_id': user_id,
'data_classification': data_classification.value,
'action': action,
'compliance_status': 'compliant'
}
# Log to CloudWatch for audit trail
self.cloudwatch.put_metric_data(
Namespace='Ona/Security',
MetricData=[
{
'MetricName': 'DataAccessAudit',
'Dimensions': [
{'Name': 'Classification', 'Value': data_classification.value},
{'Name': 'Action', 'Value': action}
],
'Value': 1,
'Unit': 'Count'
}
]
)
# Usage example
protection = DataProtection()
data = {'customer_id': 'CUST-001', 'power_kw': 150.5, 'api_key': 'secret123'}
classification = protection.classify_data(data)
protected_data = protection.apply_data_protection(data, classification)
protection.audit_data_access('user-123', classification, 'read')
Data Retention and Deletion
Purpose: Implement data lifecycle management for compliance
#!/bin/bash
# data_retention_manager.sh - Automated data retention management
RETENTION_POLICIES=(
"customer_data:7:years"
"sensor_data:3:years"
"forecast_data:1:year"
"log_data:90:days"
"audit_data:7:years"
)
for policy in "${RETENTION_POLICIES[@]}"; do
IFS=':' read -r data_type retention_period retention_unit <<< "$policy"
echo "Processing retention policy for $data_type: $retention_period $retention_unit"
# Calculate cutoff date
case $retention_unit in
"days")
cutoff_date=$(date -d "$retention_period days ago" +%Y-%m-%d)
;;
"years")
cutoff_date=$(date -d "$retention_period years ago" +%Y-%m-%d)
;;
esac
# Delete old data
case $data_type in
"customer_data")
aws s3 rm s3://sa-api-client-input/customers/ --recursive --exclude "*" --include "*$cutoff_date*"
;;
"sensor_data")
aws s3 rm s3://sa-api-client-input/observations/ --recursive --exclude "*" --include "*$cutoff_date*"
;;
"forecast_data")
aws s3 rm s3://sa-api-client-output/forecasts/ --recursive --exclude "*" --include "*$cutoff_date*"
;;
"log_data")
aws logs delete-log-group --log-group-name "/aws/lambda/ona-old-logs-$cutoff_date"
;;
esac
echo "Retention policy applied for $data_type"
done
Access Controls
Multi-Factor Authentication
Purpose: Enhanced authentication security
# Enable MFA for IAM users
aws iam create-virtual-mfa-device \
--virtual-mfa-device-name ona-platform-mfa \
--outfile mfa-qr-code.png \
--bootstrap-method QRCodePNG
# Attach MFA policy
aws iam attach-user-policy \
--user-name ona-platform-admin \
--policy-arn arn:aws:iam::aws:policy/IAMUserChangePassword
# Create MFA-protected policy
aws iam create-policy \
--policy-name OnaPlatformMFAPolicy \
--policy-document '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": "*",
"Condition": {
"BoolIfExists": {
"aws:MultiFactorAuthPresent": "false"
}
}
}
]
}'
API Access Control
Purpose: Secure API access with authentication and authorization
# api_security.py - API security implementation
import jwt
import hashlib
import hmac
import time
from datetime import datetime, timedelta
from typing import Dict, Optional
class APISecurity:
def __init__(self, secret_key: str):
self.secret_key = secret_key
self.token_expiry = 3600 # 1 hour
def generate_api_key(self, customer_id: str, permissions: List[str]) -> str:
"""Generate API key for customer"""
payload = {
'customer_id': customer_id,
'permissions': permissions,
'issued_at': datetime.utcnow().isoformat(),
'expires_at': (datetime.utcnow() + timedelta(days=365)).isoformat()
}
api_key = jwt.encode(payload, self.secret_key, algorithm='HS256')
return api_key
def validate_api_key(self, api_key: str) -> Optional[Dict]:
"""Validate API key and return customer info"""
try:
payload = jwt.decode(api_key, self.secret_key, algorithms=['HS256'])
# Check expiration
expires_at = datetime.fromisoformat(payload['expires_at'])
if datetime.utcnow() > expires_at:
return None
return {
'customer_id': payload['customer_id'],
'permissions': payload['permissions'],
'issued_at': payload['issued_at']
}
except jwt.InvalidTokenError:
return None
def check_permission(self, api_key: str, required_permission: str) -> bool:
"""Check if API key has required permission"""
customer_info = self.validate_api_key(api_key)
if not customer_info:
return False
return required_permission in customer_info['permissions']
def generate_request_signature(self, method: str, path: str, body: str, timestamp: str) -> str:
"""Generate request signature for API authentication"""
message = f"{method}\n{path}\n{body}\n{timestamp}"
signature = hmac.new(
self.secret_key.encode(),
message.encode(),
hashlib.sha256
).hexdigest()
return signature
def validate_request_signature(self, method: str, path: str, body: str, timestamp: str, signature: str) -> bool:
"""Validate request signature"""
expected_signature = self.generate_request_signature(method, path, body, timestamp)
return hmac.compare_digest(signature, expected_signature)
def rate_limit_check(self, api_key: str, endpoint: str) -> bool:
"""Check rate limiting for API key"""
# This would integrate with Redis or DynamoDB for rate limiting
# For now, return True (no rate limiting)
return True
# Usage example
security = APISecurity("your-secret-key")
api_key = security.generate_api_key("customer-001", ["read", "write"])
customer_info = security.validate_api_key(api_key)
has_permission = security.check_permission(api_key, "read")
Compliance Features
SAWEM Compliance
Purpose: South African Wind Energy Market compliance requirements
# sawem_compliance.py - SAWEM compliance implementation
import json
from datetime import datetime, timedelta
from typing import Dict, List
class SAWEMCompliance:
def __init__(self):
self.compliance_rules = {
'data_retention_years': 7,
'audit_trail_required': True,
'reporting_frequency': 'monthly',
'data_accuracy_threshold': 95.0,
'response_time_sla': 300 # seconds
}
def generate_compliance_report(self, start_date: datetime, end_date: datetime) -> Dict:
"""Generate SAWEM compliance report"""
report = {
'report_period': {
'start': start_date.isoformat(),
'end': end_date.isoformat()
},
'compliance_status': 'compliant',
'data_quality': self._assess_data_quality(start_date, end_date),
'performance_metrics': self._assess_performance_metrics(start_date, end_date),
'audit_trail': self._generate_audit_trail(start_date, end_date),
'recommendations': []
}
# Check compliance violations
violations = self._check_compliance_violations(report)
if violations:
report['compliance_status'] = 'non-compliant'
report['violations'] = violations
return report
def _assess_data_quality(self, start_date: datetime, end_date: datetime) -> Dict:
"""Assess data quality metrics"""
return {
'completeness_percentage': 98.5,
'accuracy_percentage': 96.2,
'timeliness_percentage': 99.1,
'consistency_score': 0.94,
'meets_sawem_threshold': True
}
def _assess_performance_metrics(self, start_date: datetime, end_date: datetime) -> Dict:
"""Assess system performance metrics"""
return {
'average_response_time_seconds': 2.3,
'uptime_percentage': 99.9,
'error_rate_percentage': 0.1,
'meets_sla_requirements': True
}
def _generate_audit_trail(self, start_date: datetime, end_date: datetime) -> List[Dict]:
"""Generate audit trail for compliance"""
# This would query actual audit logs
return [
{
'timestamp': '2025-01-15T08:00:00Z',
'user_id': 'system',
'action': 'data_ingestion',
'resource': 'sensor_data',
'status': 'success'
},
{
'timestamp': '2025-01-15T08:15:00Z',
'user_id': 'admin',
'action': 'configuration_change',
'resource': 'detection_threshold',
'status': 'success'
}
]
def _check_compliance_violations(self, report: Dict) -> List[Dict]:
"""Check for compliance violations"""
violations = []
# Check data quality threshold
if report['data_quality']['accuracy_percentage'] < self.compliance_rules['data_accuracy_threshold']:
violations.append({
'type': 'data_quality',
'description': 'Data accuracy below SAWEM threshold',
'severity': 'high'
})
# Check response time SLA
if report['performance_metrics']['average_response_time_seconds'] > self.compliance_rules['response_time_sla']:
violations.append({
'type': 'performance',
'description': 'Response time exceeds SLA',
'severity': 'medium'
})
return violations
def export_compliance_data(self, report: Dict) -> str:
"""Export compliance data in SAWEM format"""
sawem_format = {
'report_header': {
'generator': 'Ona Platform',
'version': '1.0',
'generated_at': datetime.utcnow().isoformat(),
'compliance_standard': 'SAWEM'
},
'report_data': report
}
return json.dumps(sawem_format, indent=2)
# Usage example
compliance = SAWEMCompliance()
start_date = datetime(2025, 1, 1)
end_date = datetime(2025, 1, 31)
report = compliance.generate_compliance_report(start_date, end_date)
sawem_data = compliance.export_compliance_data(report)
Audit and Logging
Purpose: Comprehensive audit logging for compliance
#!/bin/bash
# audit_logger.sh - Comprehensive audit logging
# Enable CloudTrail for API calls
aws cloudtrail create-trail \
--name ona-platform-audit-trail \
--s3-bucket-name ona-platform-audit-logs \
--include-global-service-events \
--is-multi-region-trail
# Start CloudTrail logging
aws cloudtrail start-logging \
--name ona-platform-audit-trail
# Configure Config for resource tracking
aws configservice put-configuration-recorder \
--configuration-recorder '{
"name": "ona-platform-config-recorder",
"roleARN": "arn:aws:iam::ACCOUNT:role/ConfigRole",
"recordingGroup": {
"allSupported": true,
"includeGlobalResourceTypes": true
}
}'
# Create audit log analysis
aws logs create-log-group \
--log-group-name "/aws/audit/ona-platform" \
--retention-in-days 2555 # 7 years for compliance
7. Scaling & Performance
Auto-Scaling Capabilities
Lambda Auto-Scaling Configuration
Purpose: Automatic scaling based on demand
# Configure provisioned concurrency for critical functions
aws lambda put-provisioned-concurrency-config \
--function-name ona-forecastingApi-prod \
--provisioned-concurrency-config '{
"ProvisionedConcurrencyUnits": 10
}'
# Configure reserved concurrency for resource-intensive functions
aws lambda put-reserved-concurrency \
--function-name ona-interpolationService-prod \
--reserved-concurrency-limit 5
# Set up auto-scaling based on CloudWatch metrics
aws application-autoscaling register-scalable-target \
--service-namespace lambda \
--scalable-dimension lambda:function:provisioned-concurrency \
--resource-id "function:ona-forecastingApi-prod" \
--min-capacity 5 \
--max-capacity 50
aws application-autoscaling put-scaling-policy \
--service-namespace lambda \
--scalable-dimension lambda:function:provisioned-concurrency \
--resource-id "function:ona-forecastingApi-prod" \
--policy-name ona-forecasting-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "LambdaProvisionedConcurrencyUtilization"
}
}'
DynamoDB Auto-Scaling
Purpose: Automatic scaling of database capacity
# Enable auto-scaling for DynamoDB tables
aws application-autoscaling register-scalable-target \
--service-namespace dynamodb \
--scalable-dimension dynamodb:table:ReadCapacityUnits \
--resource-id "table/ona-platform-locations" \
--min-capacity 5 \
--max-capacity 100
aws application-autoscaling put-scaling-policy \
--service-namespace dynamodb \
--scalable-dimension dynamodb:table:ReadCapacityUnits \
--resource-id "table/ona-platform-locations" \
--policy-name ona-locations-read-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "DynamoDBReadCapacityUtilization"
}
}'
Performance Optimization
Lambda Performance Tuning
Purpose: Optimize Lambda function performance
# lambda_optimizer.py - Lambda performance optimization
import boto3
import json
import time
from typing import Dict, List
class LambdaOptimizer:
def __init__(self):
self.lambda_client = boto3.client('lambda')
self.cloudwatch = boto3.client('cloudwatch')
def analyze_function_performance(self, function_name: str) -> Dict:
"""Analyze Lambda function performance metrics"""
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=7)
# Get performance metrics
metrics = self.cloudwatch.get_metric_statistics(
Namespace='AWS/Lambda',
MetricName='Duration',
Dimensions=[{'Name': 'FunctionName', 'Value': function_name}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum', 'Minimum']
)
# Analyze memory utilization
memory_metrics = self.cloudwatch.get_metric_statistics(
Namespace='AWS/Lambda',
MetricName='MemoryUtilization',
Dimensions=[{'Name': 'FunctionName', 'Value': function_name}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
return {
'duration_analysis': self._analyze_duration_metrics(metrics),
'memory_analysis': self._analyze_memory_metrics(memory_metrics),
'optimization_recommendations': self._generate_recommendations(metrics, memory_metrics)
}
def _analyze_duration_metrics(self, metrics: Dict) -> Dict:
"""Analyze duration metrics for optimization opportunities"""
if not metrics['Datapoints']:
return {'status': 'insufficient_data'}
avg_duration = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])
max_duration = max(dp['Maximum'] for dp in metrics['Datapoints'])
return {
'average_duration_ms': avg_duration,
'maximum_duration_ms': max_duration,
'optimization_potential': 'high' if avg_duration > 1000 else 'medium' if avg_duration > 500 else 'low'
}
def _analyze_memory_metrics(self, metrics: Dict) -> Dict:
"""Analyze memory utilization metrics"""
if not metrics['Datapoints']:
return {'status': 'insufficient_data'}
avg_memory = sum(dp['Average'] for dp in metrics['Datapoints']) / len(metrics['Datapoints'])
max_memory = max(dp['Maximum'] for dp in metrics['Datapoints'])
return {
'average_memory_percent': avg_memory,
'maximum_memory_percent': max_memory,
'memory_efficiency': 'high' if avg_memory > 80 else 'medium' if avg_memory > 60 else 'low'
}
def _generate_recommendations(self, duration_metrics: Dict, memory_metrics: Dict) -> List[str]:
"""Generate performance optimization recommendations"""
recommendations = []
# Duration-based recommendations
if duration_metrics.get('average_duration_ms', 0) > 1000:
recommendations.append("Consider increasing memory allocation to reduce execution time")
recommendations.append("Review code for optimization opportunities")
# Memory-based recommendations
if memory_metrics.get('average_memory_percent', 0) < 50:
recommendations.append("Consider reducing memory allocation to optimize costs")
elif memory_metrics.get('average_memory_percent', 0) > 90:
recommendations.append("Consider increasing memory allocation to prevent out-of-memory errors")
# General recommendations
recommendations.append("Enable provisioned concurrency for consistent performance")
recommendations.append("Implement connection pooling for external API calls")
recommendations.append("Use caching for frequently accessed data")
return recommendations
def optimize_function_configuration(self, function_name: str, recommendations: List[str]):
"""Apply optimization recommendations to Lambda function"""
current_config = self.lambda_client.get_function_configuration(FunctionName=function_name)
current_memory = current_config['MemorySize']
# Calculate optimal memory based on recommendations
if "increasing memory allocation" in str(recommendations):
new_memory = min(current_memory * 2, 3008) # Double memory, max 3008MB
elif "reducing memory allocation" in str(recommendations):
new_memory = max(current_memory // 2, 128) # Halve memory, min 128MB
else:
new_memory = current_memory
# Update function configuration
if new_memory != current_memory:
self.lambda_client.update_function_configuration(
FunctionName=function_name,
MemorySize=new_memory
)
print(f"Updated {function_name} memory from {current_memory}MB to {new_memory}MB")
# Usage example
optimizer = LambdaOptimizer()
performance_analysis = optimizer.analyze_function_performance('ona-forecastingApi-prod')
optimizer.optimize_function_configuration('ona-forecastingApi-prod', performance_analysis['optimization_recommendations'])
Database Performance Optimization
Purpose: Optimize DynamoDB performance and costs
#!/bin/bash
# database_optimizer.sh - DynamoDB performance optimization
# Analyze table performance
analyze_table_performance() {
local table_name=$1
echo "Analyzing performance for table: $table_name"
# Get read capacity utilization
read_utilization=$(aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedReadCapacityUnits \
--dimensions Name=TableName,Value=$table_name \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' \
--output text)
# Get write capacity utilization
write_utilization=$(aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedWriteCapacityUnits \
--dimensions Name=TableName,Value=$table_name \
--start-time $(date -d '24 hours ago' -u +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600 \
--statistics Sum \
--query 'Datapoints[0].Sum' \
--output text)
echo "Read capacity utilization: ${read_utilization:-0}"
echo "Write capacity utilization: ${write_utilization:-0}"
# Generate optimization recommendations
if [ "${read_utilization:-0}" -lt 100 ]; then
echo "Recommendation: Consider reducing read capacity for $table_name"
elif [ "${read_utilization:-0}" -gt 1000 ]; then
echo "Recommendation: Consider increasing read capacity for $table_name"
fi
if [ "${write_utilization:-0}" -lt 100 ]; then
echo "Recommendation: Consider reducing write capacity for $table_name"
elif [ "${write_utilization:-0}" -gt 1000 ]; then
echo "Recommendation: Consider increasing write capacity for $table_name"
fi
}
# Analyze all platform tables
analyze_table_performance "ona-platform-locations"
analyze_table_performance "ona-platform-weather-cache"
analyze_table_performance "ona-platform-crews"
analyze_table_performance "ona-platform-tracking"
# Optimize table indexes
optimize_table_indexes() {
local table_name=$1
echo "Optimizing indexes for table: $table_name"
# Get table description
table_info=$(aws dynamodb describe-table --table-name $table_name --query 'Table')
# Check for unused indexes
gsi_count=$(echo $table_info | jq '.GlobalSecondaryIndexes | length')
if [ "$gsi_count" -gt 0 ]; then
echo "Found $gsi_count Global Secondary Indexes"
# Analyze GSI usage (this would require more detailed analysis)
echo "Recommendation: Monitor GSI usage and remove unused indexes"
fi
}
optimize_table_indexes "ona-platform-locations"
Resource Management
Cost Optimization
Purpose: Optimize AWS costs while maintaining performance
# cost_optimizer.py - AWS cost optimization
import boto3
import json
from datetime import datetime, timedelta
from typing import Dict, List
class CostOptimizer:
def __init__(self):
self.ce_client = boto3.client('ce') # Cost Explorer
self.lambda_client = boto3.client('lambda')
self.cloudwatch = boto3.client('cloudwatch')
def analyze_costs(self, start_date: datetime, end_date: datetime) -> Dict:
"""Analyze AWS costs for the platform"""
cost_data = self.ce_client.get_cost_and_usage(
TimePeriod={
'Start': start_date.strftime('%Y-%m-%d'),
'End': end_date.strftime('%Y-%m-%d')
},
Granularity='DAILY',
Metrics=['BlendedCost'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'},
{'Type': 'DIMENSION', 'Key': 'USAGE_TYPE'}
]
)
return self._process_cost_data(cost_data)
def _process_cost_data(self, cost_data: Dict) -> Dict:
"""Process cost data into actionable insights"""
total_cost = 0
service_costs = {}
for result in cost_data['ResultsByTime']:
for group in result['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['BlendedCost']['Amount'])
total_cost += cost
if service not in service_costs:
service_costs[service] = 0
service_costs[service] += cost
return {
'total_cost': total_cost,
'service_breakdown': service_costs,
'optimization_recommendations': self._generate_cost_recommendations(service_costs)
}
def _generate_cost_recommendations(self, service_costs: Dict) -> List[Dict]:
"""Generate cost optimization recommendations"""
recommendations = []
# Lambda cost optimization
if 'AWS Lambda' in service_costs:
lambda_cost = service_costs['AWS Lambda']
if lambda_cost > 100: # High Lambda costs
recommendations.append({
'service': 'Lambda',
'recommendation': 'Consider reserved capacity or provisioned concurrency',
'potential_savings': lambda_cost * 0.3,
'priority': 'high'
})
# DynamoDB cost optimization
if 'Amazon DynamoDB' in service_costs:
dynamodb_cost = service_costs['Amazon DynamoDB']
if dynamodb_cost > 50: # High DynamoDB costs
recommendations.append({
'service': 'DynamoDB',
'recommendation': 'Review read/write capacity and consider auto-scaling',
'potential_savings': dynamodb_cost * 0.2,
'priority': 'medium'
})
# S3 cost optimization
if 'Amazon Simple Storage Service' in service_costs:
s3_cost = service_costs['Amazon Simple Storage Service']
if s3_cost > 20: # High S3 costs
recommendations.append({
'service': 'S3',
'recommendation': 'Implement lifecycle policies and compression',
'potential_savings': s3_cost * 0.4,
'priority': 'medium'
})
return recommendations
def implement_cost_optimizations(self, recommendations: List[Dict]):
"""Implement cost optimization recommendations"""
for rec in recommendations:
if rec['service'] == 'Lambda' and 'reserved capacity' in rec['recommendation']:
self._optimize_lambda_costs()
elif rec['service'] == 'DynamoDB' and 'auto-scaling' in rec['recommendation']:
self._optimize_dynamodb_costs()
elif rec['service'] == 'S3' and 'lifecycle policies' in rec['recommendation']:
self._optimize_s3_costs()
def _optimize_lambda_costs(self):
"""Optimize Lambda costs"""
# Enable provisioned concurrency for frequently used functions
functions = ['ona-forecastingApi-prod', 'ona-dataIngestion-prod']
for function_name in functions:
try:
self.lambda_client.put_provisioned_concurrency_config(
FunctionName=function_name,
ProvisionedConcurrencyConfig={
'ProvisionedConcurrencyUnits': 5
}
)
print(f"Enabled provisioned concurrency for {function_name}")
except Exception as e:
print(f"Failed to enable provisioned concurrency for {function_name}: {e}")
def _optimize_dynamodb_costs(self):
"""Optimize DynamoDB costs"""
# Enable auto-scaling for tables
tables = ['ona-platform-locations', 'ona-platform-weather-cache']
for table_name in tables:
try:
# This would implement auto-scaling configuration
print(f"Configured auto-scaling for {table_name}")
except Exception as e:
print(f"Failed to configure auto-scaling for {table_name}: {e}")
def _optimize_s3_costs(self):
"""Optimize S3 costs"""
# Implement lifecycle policies
lifecycle_policy = {
'Rules': [
{
'ID': 'OnaPlatformLifecycle',
'Status': 'Enabled',
'Transitions': [
{
'Days': 30,
'StorageClass': 'STANDARD_IA'
},
{
'Days': 90,
'StorageClass': 'GLACIER'
},
{
'Days': 365,
'StorageClass': 'DEEP_ARCHIVE'
}
]
}
]
}
buckets = ['sa-api-client-input', 'sa-api-client-output']
for bucket in buckets:
try:
# This would implement lifecycle policies
print(f"Configured lifecycle policy for {bucket}")
except Exception as e:
print(f"Failed to configure lifecycle policy for {bucket}: {e}")
# Usage example
optimizer = CostOptimizer()
start_date = datetime.now() - timedelta(days=30)
end_date = datetime.now()
cost_analysis = optimizer.analyze_costs(start_date, end_date)
optimizer.implement_cost_optimizations(cost_analysis['optimization_recommendations'])
Resource Monitoring and Alerting
Purpose: Monitor resource usage and costs
#!/bin/bash
# resource_monitor.sh - Resource usage monitoring
# Monitor Lambda costs
monitor_lambda_costs() {
echo "Monitoring Lambda costs..."
# Get Lambda cost for last 24 hours
lambda_cost=$(aws ce get-cost-and-usage \
--time-period Start=$(date -d '24 hours ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--filter '{
"Dimensions": {
"Key": "SERVICE",
"Values": ["AWS Lambda"]
}
}' \
--query 'ResultsByTime[0].Groups[0].Metrics.BlendedCost.Amount' \
--output text)
echo "Lambda cost (24h): $${lambda_cost}"
# Alert if cost exceeds threshold
if (( $(echo "$lambda_cost > 50" | bc -l) )); then
echo "ALERT: Lambda costs exceed $50 threshold"
# Send alert notification
fi
}
# Monitor DynamoDB costs
monitor_dynamodb_costs() {
echo "Monitoring DynamoDB costs..."
dynamodb_cost=$(aws ce get-cost-and-usage \
--time-period Start=$(date -d '24 hours ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--filter '{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon DynamoDB"]
}
}' \
--query 'ResultsByTime[0].Groups[0].Metrics.BlendedCost.Amount' \
--output text)
echo "DynamoDB cost (24h): $${dynamodb_cost}"
if (( $(echo "$dynamodb_cost > 25" | bc -l) )); then
echo "ALERT: DynamoDB costs exceed $25 threshold"
fi
}
# Monitor S3 costs
monitor_s3_costs() {
echo "Monitoring S3 costs..."
s3_cost=$(aws ce get-cost-and-usage \
--time-period Start=$(date -d '24 hours ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--filter '{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon Simple Storage Service"]
}
}' \
--query 'ResultsByTime[0].Groups[0].Metrics.BlendedCost.Amount' \
--output text)
echo "S3 cost (24h): $${s3_cost}"
if (( $(echo "$s3_cost > 10" | bc -l) )); then
echo "ALERT: S3 costs exceed $10 threshold"
fi
}
# Run monitoring
monitor_lambda_costs
monitor_dynamodb_costs
monitor_s3_costs
# Generate daily cost report
echo "Generating daily cost report..."
total_cost=$(aws ce get-cost-and-usage \
--time-period Start=$(date -d '24 hours ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity DAILY \
--metrics BlendedCost \
--query 'ResultsByTime[0].Total.BlendedCost.Amount' \
--output text)
echo "Total platform cost (24h): $${total_cost}"
echo "Daily cost report generated at $(date)"
5. Configuration Management
5.1 Terminal Environment Configuration
To maintain a clean separation of concerns, terminal-specific configurations have been isolated from the main platform environment.
config/environment.sh: Contains core platform settings (buckets, main services, etc.).config/terminal-environment.sh: Contains all settings specific to the O&M Terminal, including service names, Lambda configurations, and helper functions. It sources the mainenvironment.shto inherit core settings.
This separation makes the terminal configuration self-contained and easier to manage independently.
5.2 Terminal SSM Parameters
The platform uses AWS Systems Manager (SSM) Parameter Store for managing the terminal’s operational parameters. The scripts/14-create-terminal-parameters.sh script is responsible for creating and validating these parameters.
Execution:
# Run this script after the initial deployment
./scripts/14-create-terminal-parameters.sh
Parameters Created (26 total):
The script creates a hierarchical structure under /ona-platform/[stage]/terminal/.
1. OODA Configuration
/ona-platform/[stage]/terminal/ooda/detection-threshold: (String, Default: “0.7”) - Sensitivity for fault identification (0.0-1.0)./ona-platform/[stage]/terminal/ooda/loss-weights: (String, Default:{"w_energy":1.0,"w_cost":0.3,"w_mttr":0.2}) - Weights for decision optimization./ona-platform/[stage]/terminal/ooda/severity-levels: (String, Default:["critical","high","medium","low"]) - Classification levels for alerts./ona-platform/[stage]/terminal/ooda/fault-categories: (String, Default:{"weather":[...],"oem":[...],"ops":[...]}) - Mapping of fault types.
2. Alert Configuration
/ona-platform/[stage]/terminal/alert-topic-arn: (String) - The ARN of the SNS topic for terminal alerts./ona-platform/[stage]/terminal/alert-email: (String, Default: “ops@asoba.co”) - Placeholder email for notifications./ona-platform/[stage]/terminal/alerts-enabled: (String, Default: “true”) - Global toggle to enable or disable alerting.
3. API Configuration
/ona-platform/[stage]/terminal/api/rate-limit: (String, Default: “100”) - API rate limit in requests per minute./ona-platform/[stage]/terminal/api/timeout: (String, Default: “30”) - Timeout in seconds for external API calls./ona-platform/[stage]/terminal/api/debug-mode: (String, Default: “false”) - Toggles detailed API debug logging.
4. Integration Configuration
/ona-platform/[stage]/terminal/integration/parts-api-endpoint: (String, Default: “https://api.example.com/parts”) - Placeholder endpoint for a parts/BOM pricing API./ona-platform/[stage]/terminal/integration/parts-api-key: (SecureString, Default: “placeholder-key-update-me”) - Secure placeholder for the parts API key./ona-platform/[stage]/terminal/integration/weather-api-endpoint: (String) - Endpoint for the Visual Crossing weather service./ona-platform/[stage]/terminal/integration/maintenance-system-endpoint: (String, Default: “https://maintenance.example.com/api”) - Placeholder for an external maintenance system API.
5. Operational Parameters
/ona-platform/[stage]/terminal/operations/crew-count: (String, Default: “2”) - Number of available maintenance crews./ona-platform/[stage]/terminal/operations/work-hours-per-day: (String, Default: “8”) - Work hours per day for scheduling./ona-platform/[stage]/terminal/operations/maintenance-windows: (String, Default:{"weekdays":[...],"start_time":"08:00",...}) - JSON configuration for maintenance windows./ona-platform/[stage]/terminal/operations/priority-thresholds: (String, Default:{"critical":0.9,"high":0.7,...}) - Thresholds for priority scoring.
6. Data Retention
/ona-platform/[stage]/terminal/retention/asset-history-days: (String, Default: “365”) - Retention period for asset history./ona-platform/[stage]/terminal/retention/schedule-days: (String, Default: “90”) - Retention period for schedules./ona-platform/[stage]/terminal/retention/order-days: (String, Default: “180”) - Retention period for order history./ona-platform/[stage]/terminal/retention/tracking-days: (String, Default: “30”) - Retention period for tracking data.
7. Feature Flags
/ona-platform/[stage]/terminal/features/ooda-enabled: (String, Default: “true”) - Enables the OODA loop processing./ona-platform/[stage]/terminal/features/auto-schedule-enabled: (String, Default: “false”) - Enables automated scheduling./ona-platform/[stage]/terminal/features/auto-order-enabled: (String, Default: “false”) - Enables automated work order creation.-
/ona-platform/[stage]/terminal/features/ai-diagnostics-enabled: (String, Default: “false”) - Enables AI-powered diagnostics.
7. Deployment Performance
Optimized Deployment Process
The platform deployment has been optimized for parallel execution, significantly reducing deployment times:
| Deployment Phase | Time | Notes |
|---|---|---|
| IAM Role Creation | 10-15s | Parallel creation across 7 services |
| Lambda Deployment | 1.5 minutes | Concurrent updates with optimized waits |
| API Gateway Setup | 8-12s | Parallel endpoint creation |
| Total Deployment | 1.8-2 minutes | 77% faster than sequential approach |
Performance Features
Parallel Processing:
- IAM roles created concurrently for all services
- Lambda functions updated in parallel
- API Gateway endpoints configured simultaneously
Optimized Wait Operations:
- Reduced redundant Lambda wait commands (6 waits → 4 waits per function)
- Streamlined function update process
- Efficient resource provisioning
Error Handling:
- Failed operations tracked per service
- Detailed error reporting
- Idempotent operations allow safe retries
Monitoring Deployment
View deployment progress in CloudWatch Logs:
# Monitor deployment logs
aws logs tail /ona-platform/scripts --follow
# View specific script execution
aws logs tail /ona-platform/scripts/03-create-iam.sh --follow
Deployment Script Optimizations
03-create-iam.sh - Parallel IAM Role Creation:
- Creates all 7 service roles simultaneously
- Reduced from 35-56s to 10-15s (70% faster)
08-create-lambdas.sh - Parallel Lambda Updates:
- Updates all Lambda functions concurrently
- Optimized wait operations
- Reduced from 7 minutes to 1.5 minutes (75% faster)
10-create-api-gateway.sh - Parallel Endpoint Creation:
- Creates all API Gateway endpoints in parallel
- Includes nested Terminal API resources
- Reduced from 30-50s to 8-12s (70% faster) ```