- Published on
Infrastructure as Code with Terraform: Best Practices from Production
16 min read
- Authors
- Name
- Bhakta Bahadur Thapa
- @Bhakta7thapa
Table of Contents
- Infrastructure as Code with Terraform: Best Practices from Production
- The Foundation: Project Structure
- 1. State Management: The Backbone of Reliability
- Remote State with S3 + DynamoDB
- State Isolation Strategy
- 2. Module Design: Reusability and Maintainability
- VPC Module Example
- Module Variables with Validation
- 3. Environment-Specific Configurations
- Using Locals for Environment Logic
- 4. Security Best Practices
- AWS Provider Configuration
- Secrets Management
- 5. Advanced Patterns and Techniques
- Dynamic Blocks for Flexible Configuration
- For_each vs Count
- 6. Testing and Validation
- Pre-commit Hooks
- Terratest for Integration Testing
- 7. Deployment Automation
- CI/CD Pipeline Script
- Common Pitfalls and How to Avoid Them
- 1. Resource Naming Conflicts
- 2. State Drift
- 3. Large State Files
- Conclusion
Infrastructure as Code with Terraform: Best Practices from Production
After managing cloud infrastructure across multiple environments and teams, I've learned that Infrastructure as Code (IaC) isn't just about automating provisioning—it's about creating maintainable, scalable, and secure infrastructure that your entire team can understand and contribute to.
In this post, I'll share the Terraform patterns and practices that have saved me countless hours and prevented numerous production issues.
The Foundation: Project Structure
The way you organize your Terraform code sets the stage for everything else. Here's the structure I've refined over multiple projects:
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── outputs.tf
│ ├── staging/
│ └── production/
├── modules/
│ ├── vpc/
│ ├── eks/
│ ├── rds/
│ └── monitoring/
├── shared/
│ ├── backend.tf
│ └── versions.tf
└── scripts/
├── deploy.sh
└── validate.sh
1. State Management: The Backbone of Reliability
Remote State with S3 + DynamoDB
Never, and I mean never, use local state in production. Here's my standard backend configuration:
# shared/backend.tf
terraform {
backend "s3" {
bucket = "bhakta-terraform-state-prod"
key = "infrastructure/terraform.tfstate"
region = "us-west-2"
encrypt = true
dynamodb_table = "terraform-lock-table"
# Enable versioning for state recovery
versioning = true
}
}
# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_lock" {
name = "terraform-lock-table"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock Table"
Environment = var.environment
}
}
State Isolation Strategy
I use separate state files for different layers:
# Network layer
terraform {
backend "s3" {
key = "network/terraform.tfstate"
}
}
# Application layer
terraform {
backend "s3" {
key = "applications/terraform.tfstate"
}
}
# Data layer
terraform {
backend "s3" {
key = "data/terraform.tfstate"
}
}
2. Module Design: Reusability and Maintainability
VPC Module Example
Here's how I structure a reusable VPC module:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(var.common_tags, {
Name = "${var.project_name}-vpc"
})
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = merge(var.common_tags, {
Name = "${var.project_name}-private-subnet-${count.index + 1}"
Type = "Private"
})
}
resource "aws_subnet" "public" {
count = length(var.public_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.public_subnet_cidrs[count.index]
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = merge(var.common_tags, {
Name = "${var.project_name}-public-subnet-${count.index + 1}"
Type = "Public"
})
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = merge(var.common_tags, {
Name = "${var.project_name}-igw"
})
}
# NAT Gateway
resource "aws_eip" "nat" {
count = length(var.public_subnet_cidrs)
domain = "vpc"
tags = merge(var.common_tags, {
Name = "${var.project_name}-nat-eip-${count.index + 1}"
})
depends_on = [aws_internet_gateway.main]
}
resource "aws_nat_gateway" "main" {
count = length(var.public_subnet_cidrs)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = merge(var.common_tags, {
Name = "${var.project_name}-nat-gateway-${count.index + 1}"
})
depends_on = [aws_internet_gateway.main]
}
Module Variables with Validation
# modules/vpc/variables.tf
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "VPC CIDR must be a valid IPv4 CIDR block."
}
}
variable "private_subnet_cidrs" {
description = "CIDR blocks for private subnets"
type = list(string)
validation {
condition = length(var.private_subnet_cidrs) >= 2
error_message = "At least 2 private subnets required for high availability."
}
}
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be one of: dev, staging, production."
}
}
3. Environment-Specific Configurations
Using Locals for Environment Logic
# environments/production/main.tf
locals {
environment = "production"
# Environment-specific configurations
instance_types = {
web = "t3.large"
api = "c5.xlarge"
db = "r5.2xlarge"
}
# Auto-scaling configurations
scaling_config = {
min_capacity = 3
max_capacity = 20
target_cpu = 70
}
# Security configurations
enable_encryption = true
backup_retention = 30
common_tags = {
Environment = local.environment
Project = "my-application"
Owner = "bhakta-thapa"
ManagedBy = "terraform"
}
}
module "vpc" {
source = "../../modules/vpc"
project_name = "myapp"
environment = local.environment
vpc_cidr = "10.0.0.0/16"
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnet_cidrs = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
common_tags = local.common_tags
}
4. Security Best Practices
AWS Provider Configuration
# Configure AWS Provider with assumed role
provider "aws" {
region = var.aws_region
assume_role {
role_arn = "arn:aws:iam::${var.account_id}:role/TerraformExecutionRole"
}
default_tags {
tags = {
ManagedBy = "terraform"
Project = var.project_name
}
}
}
Secrets Management
# Use AWS Secrets Manager for sensitive data
data "aws_secretsmanager_secret" "db_credentials" {
name = "${var.project_name}-${var.environment}-db-credentials"
}
data "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = data.aws_secretsmanager_secret.db_credentials.id
}
locals {
db_creds = jsondecode(data.aws_secretsmanager_secret_version.db_credentials.secret_string)
}
resource "aws_db_instance" "main" {
identifier = "${var.project_name}-${var.environment}-db"
engine = "postgres"
engine_version = "14.9"
instance_class = local.instance_types.db
allocated_storage = var.db_allocated_storage
db_name = local.db_creds.database
username = local.db_creds.username
password = local.db_creds.password
vpc_security_group_ids = [aws_security_group.database.id]
db_subnet_group_name = aws_db_subnet_group.main.name
backup_retention_period = local.backup_retention
backup_window = "03:00-04:00"
maintenance_window = "Sun:04:00-Sun:05:00"
storage_encrypted = local.enable_encryption
kms_key_id = aws_kms_key.database.arn
skip_final_snapshot = false
final_snapshot_identifier = "${var.project_name}-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
tags = local.common_tags
}
5. Advanced Patterns and Techniques
Dynamic Blocks for Flexible Configuration
resource "aws_security_group" "web" {
name_prefix = "${var.project_name}-web-"
vpc_id = module.vpc.vpc_id
# Dynamic ingress rules
dynamic "ingress" {
for_each = var.allowed_ports
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = "tcp"
cidr_blocks = ingress.value.cidr_blocks
description = ingress.value.description
}
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = local.common_tags
}
For_each vs Count
I prefer for_each
over count
for most use cases:
# Better: Using for_each
resource "aws_instance" "web" {
for_each = var.web_servers
ami = data.aws_ami.amazon_linux.id
instance_type = each.value.instance_type
subnet_id = each.value.subnet_id
tags = merge(local.common_tags, {
Name = each.key
Role = each.value.role
})
}
# Variable definition
variable "web_servers" {
description = "Map of web servers to create"
type = map(object({
instance_type = string
subnet_id = string
role = string
}))
default = {
"web-1" = {
instance_type = "t3.medium"
subnet_id = "subnet-12345"
role = "frontend"
}
"web-2" = {
instance_type = "t3.medium"
subnet_id = "subnet-67890"
role = "frontend"
}
}
}
6. Testing and Validation
Pre-commit Hooks
I use pre-commit hooks to catch issues early:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.81.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_docs
- id: terraform_tflint
args:
- '--args=--only=terraform_deprecated_interpolation'
- '--args=--only=terraform_deprecated_index'
- '--args=--only=terraform_unused_declarations'
- '--args=--only=terraform_comment_syntax'
- '--args=--only=terraform_documented_outputs'
- '--args=--only=terraform_documented_variables'
- '--args=--only=terraform_typed_variables'
- '--args=--only=terraform_module_pinned_source'
- '--args=--only=terraform_naming_convention'
- '--args=--only=terraform_required_version'
- '--args=--only=terraform_required_providers'
- '--args=--only=terraform_standard_module_structure'
- '--args=--only=terraform_workspace_remote'
Terratest for Integration Testing
// test/terraform_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestTerraformVpcModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"project_name": "test",
"environment": "test",
"vpc_cidr": "10.0.0.0/16",
"private_subnet_cidrs": []string{"10.0.1.0/24", "10.0.2.0/24"},
"public_subnet_cidrs": []string{"10.0.101.0/24", "10.0.102.0/24"},
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
}
7. Deployment Automation
CI/CD Pipeline Script
#!/bin/bash
# scripts/deploy.sh
set -e
ENVIRONMENT=${1:-dev}
ACTION=${2:-plan}
echo "🚀 Deploying to $ENVIRONMENT environment..."
cd "environments/$ENVIRONMENT"
# Initialize Terraform
terraform init -upgrade
# Validate configuration
terraform validate
# Security scanning
echo "🔍 Running security scan..."
checkov -f . --framework terraform
# Plan or Apply
case $ACTION in
"plan")
terraform plan -out=tfplan
;;
"apply")
terraform plan -out=tfplan
echo "📋 Plan generated. Review and approve..."
read -p "Continue with apply? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
terraform apply tfplan
else
echo "❌ Deployment cancelled"
exit 1
fi
;;
"destroy")
terraform plan -destroy -out=tfplan
echo "⚠️ DESTROY plan generated. This will DELETE resources!"
read -p "Are you sure you want to destroy? (y/N): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
terraform apply tfplan
else
echo "❌ Destroy cancelled"
exit 1
fi
;;
*)
echo "❌ Invalid action. Use: plan, apply, or destroy"
exit 1
;;
esac
echo "✅ Deployment completed successfully!"
Common Pitfalls and How to Avoid Them
1. Resource Naming Conflicts
Problem: Resources with hardcoded names cause conflicts across environments.
Solution: Use consistent naming conventions with variables:
resource "aws_s3_bucket" "app_data" {
bucket = "${var.project_name}-${var.environment}-app-data-${random_id.bucket_suffix.hex}"
tags = local.common_tags
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}
2. State Drift
Problem: Manual changes outside Terraform cause state drift.
Solution: Implement drift detection in CI/CD:
# Check for drift
terraform plan -detailed-exitcode
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ Infrastructure drift detected!"
# Send notification or fail the pipeline
exit 1
fi
3. Large State Files
Problem: Monolithic state files become unwieldy and increase blast radius.
Solution: Break infrastructure into logical layers:
# Use data sources to reference other layers
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "bhakta-terraform-state-prod"
key = "network/terraform.tfstate"
region = "us-west-2"
}
}
# Reference outputs from network layer
resource "aws_instance" "app" {
subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
# ... other configuration
}
Conclusion
These patterns have served me well across multiple production environments. The key principles are:
- Consistency - Use standardized structures and naming
- Security - Never expose secrets, use proper IAM roles
- Modularity - Create reusable, testable modules
- Automation - Integrate with CI/CD and testing
- Documentation - Code should be self-documenting
Start with these foundations, and gradually add complexity as your infrastructure needs grow. Remember: good IaC is not just about what you build, but how maintainable and secure it is for your team.
What Terraform patterns have you found most valuable in your infrastructure? I'd love to hear about your experiences!
Next Post Preview: In my next article, I'll dive into "GitOps with ArgoCD: Implementing Continuous Deployment for Kubernetes Applications."
Tags: #Terraform #InfrastructureAsCode #AWS #DevOps #CloudEngineering