Infrastructure as Code with Terraform: Best Practices from Production

After managing cloud infrastructure across multiple environments and teams, I've learned that Infrastructure as Code (IaC) isn't just about automating provisioning—it's about creating maintainable, scalable, and secure infrastructure that your entire team can understand and contribute to.

In this post, I'll share the Terraform patterns and practices that have saved me countless hours and prevented numerous production issues.

The Foundation: Project Structure

The way you organize your Terraform code sets the stage for everything else. Here's the structure I've refined over multiple projects:

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── outputs.tf
│   ├── staging/
│   └── production/
├── modules/
│   ├── vpc/
│   ├── eks/
│   ├── rds/
│   └── monitoring/
├── shared/
│   ├── backend.tf
│   └── versions.tf
└── scripts/
    ├── deploy.sh
    └── validate.sh

1. State Management: The Backbone of Reliability

Remote State with S3 + DynamoDB

Never, and I mean never, use local state in production. Here's my standard backend configuration:

# shared/backend.tf
terraform {
  backend "s3" {
    bucket         = "bhakta-terraform-state-prod"
    key            = "infrastructure/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-lock-table"

    # Enable versioning for state recovery
    versioning = true
  }
}

# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_lock" {
  name           = "terraform-lock-table"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name        = "Terraform State Lock Table"
    Environment = var.environment
  }
}

State Isolation Strategy

I use separate state files for different layers:

# Network layer
terraform {
  backend "s3" {
    key = "network/terraform.tfstate"
  }
}

# Application layer
terraform {
  backend "s3" {
    key = "applications/terraform.tfstate"
  }
}

# Data layer
terraform {
  backend "s3" {
    key = "data/terraform.tfstate"
  }
}

2. Module Design: Reusability and Maintainability

VPC Module Example

Here's how I structure a reusable VPC module:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-vpc"
  })
}

resource "aws_subnet" "private" {
  count = length(var.private_subnet_cidrs)

  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-private-subnet-${count.index + 1}"
    Type = "Private"
  })
}

resource "aws_subnet" "public" {
  count = length(var.public_subnet_cidrs)

  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-public-subnet-${count.index + 1}"
    Type = "Public"
  })
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-igw"
  })
}

# NAT Gateway
resource "aws_eip" "nat" {
  count  = length(var.public_subnet_cidrs)
  domain = "vpc"

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-nat-eip-${count.index + 1}"
  })

  depends_on = [aws_internet_gateway.main]
}

resource "aws_nat_gateway" "main" {
  count = length(var.public_subnet_cidrs)

  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id

  tags = merge(var.common_tags, {
    Name = "${var.project_name}-nat-gateway-${count.index + 1}"
  })

  depends_on = [aws_internet_gateway.main]
}

Module Variables with Validation

# modules/vpc/variables.tf
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string

  validation {
    condition = can(cidrhost(var.vpc_cidr, 0))
    error_message = "VPC CIDR must be a valid IPv4 CIDR block."
  }
}

variable "private_subnet_cidrs" {
  description = "CIDR blocks for private subnets"
  type        = list(string)

  validation {
    condition = length(var.private_subnet_cidrs) >= 2
    error_message = "At least 2 private subnets required for high availability."
  }
}

variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be one of: dev, staging, production."
  }
}

3. Environment-Specific Configurations

Using Locals for Environment Logic

# environments/production/main.tf
locals {
  environment = "production"

  # Environment-specific configurations
  instance_types = {
    web = "t3.large"
    api = "c5.xlarge"
    db  = "r5.2xlarge"
  }

  # Auto-scaling configurations
  scaling_config = {
    min_capacity = 3
    max_capacity = 20
    target_cpu   = 70
  }

  # Security configurations
  enable_encryption = true
  backup_retention  = 30

  common_tags = {
    Environment = local.environment
    Project     = "my-application"
    Owner       = "bhakta-thapa"
    ManagedBy   = "terraform"
  }
}

module "vpc" {
  source = "../../modules/vpc"

  project_name         = "myapp"
  environment         = local.environment
  vpc_cidr           = "10.0.0.0/16"
  private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnet_cidrs  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
  common_tags        = local.common_tags
}

4. Security Best Practices

AWS Provider Configuration

# Configure AWS Provider with assumed role
provider "aws" {
  region = var.aws_region

  assume_role {
    role_arn = "arn:aws:iam::${var.account_id}:role/TerraformExecutionRole"
  }

  default_tags {
    tags = {
      ManagedBy = "terraform"
      Project   = var.project_name
    }
  }
}

Secrets Management

# Use AWS Secrets Manager for sensitive data
data "aws_secretsmanager_secret" "db_credentials" {
  name = "${var.project_name}-${var.environment}-db-credentials"
}

data "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id = data.aws_secretsmanager_secret.db_credentials.id
}

locals {
  db_creds = jsondecode(data.aws_secretsmanager_secret_version.db_credentials.secret_string)
}

resource "aws_db_instance" "main" {
  identifier = "${var.project_name}-${var.environment}-db"

  engine            = "postgres"
  engine_version    = "14.9"
  instance_class    = local.instance_types.db
  allocated_storage = var.db_allocated_storage

  db_name  = local.db_creds.database
  username = local.db_creds.username
  password = local.db_creds.password

  vpc_security_group_ids = [aws_security_group.database.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = local.backup_retention
  backup_window          = "03:00-04:00"
  maintenance_window     = "Sun:04:00-Sun:05:00"

  storage_encrypted = local.enable_encryption
  kms_key_id       = aws_kms_key.database.arn

  skip_final_snapshot = false
  final_snapshot_identifier = "${var.project_name}-${var.environment}-final-snapshot-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"

  tags = local.common_tags
}

5. Advanced Patterns and Techniques

Dynamic Blocks for Flexible Configuration

resource "aws_security_group" "web" {
  name_prefix = "${var.project_name}-web-"
  vpc_id      = module.vpc.vpc_id

  # Dynamic ingress rules
  dynamic "ingress" {
    for_each = var.allowed_ports
    content {
      from_port   = ingress.value.port
      to_port     = ingress.value.port
      protocol    = "tcp"
      cidr_blocks = ingress.value.cidr_blocks
      description = ingress.value.description
    }
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = local.common_tags
}

For_each vs Count

I prefer for_each over count for most use cases:

# Better: Using for_each
resource "aws_instance" "web" {
  for_each = var.web_servers

  ami           = data.aws_ami.amazon_linux.id
  instance_type = each.value.instance_type
  subnet_id     = each.value.subnet_id

  tags = merge(local.common_tags, {
    Name = each.key
    Role = each.value.role
  })
}

# Variable definition
variable "web_servers" {
  description = "Map of web servers to create"
  type = map(object({
    instance_type = string
    subnet_id     = string
    role          = string
  }))

  default = {
    "web-1" = {
      instance_type = "t3.medium"
      subnet_id     = "subnet-12345"
      role          = "frontend"
    }
    "web-2" = {
      instance_type = "t3.medium"
      subnet_id     = "subnet-67890"
      role          = "frontend"
    }
  }
}

6. Testing and Validation

Pre-commit Hooks

I use pre-commit hooks to catch issues early:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.81.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
      - id: terraform_tflint
        args:
          - '--args=--only=terraform_deprecated_interpolation'
          - '--args=--only=terraform_deprecated_index'
          - '--args=--only=terraform_unused_declarations'
          - '--args=--only=terraform_comment_syntax'
          - '--args=--only=terraform_documented_outputs'
          - '--args=--only=terraform_documented_variables'
          - '--args=--only=terraform_typed_variables'
          - '--args=--only=terraform_module_pinned_source'
          - '--args=--only=terraform_naming_convention'
          - '--args=--only=terraform_required_version'
          - '--args=--only=terraform_required_providers'
          - '--args=--only=terraform_standard_module_structure'
          - '--args=--only=terraform_workspace_remote'

Terratest for Integration Testing

// test/terraform_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestTerraformVpcModule(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "project_name":         "test",
            "environment":         "test",
            "vpc_cidr":           "10.0.0.0/16",
            "private_subnet_cidrs": []string{"10.0.1.0/24", "10.0.2.0/24"},
            "public_subnet_cidrs":  []string{"10.0.101.0/24", "10.0.102.0/24"},
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
}

7. Deployment Automation

CI/CD Pipeline Script

#!/bin/bash
# scripts/deploy.sh

set -e

ENVIRONMENT=${1:-dev}
ACTION=${2:-plan}

echo "🚀 Deploying to $ENVIRONMENT environment..."

cd "environments/$ENVIRONMENT"

# Initialize Terraform
terraform init -upgrade

# Validate configuration
terraform validate

# Security scanning
echo "🔍 Running security scan..."
checkov -f . --framework terraform

# Plan or Apply
case $ACTION in
    "plan")
        terraform plan -out=tfplan
        ;;
    "apply")
        terraform plan -out=tfplan
        echo "📋 Plan generated. Review and approve..."
        read -p "Continue with apply? (y/N): " -n 1 -r
        echo
        if [[ $REPLY =~ ^[Yy]$ ]]; then
            terraform apply tfplan
        else
            echo "❌ Deployment cancelled"
            exit 1
        fi
        ;;
    "destroy")
        terraform plan -destroy -out=tfplan
        echo "⚠️  DESTROY plan generated. This will DELETE resources!"
        read -p "Are you sure you want to destroy? (y/N): " -n 1 -r
        echo
        if [[ $REPLY =~ ^[Yy]$ ]]; then
            terraform apply tfplan
        else
            echo "❌ Destroy cancelled"
            exit 1
        fi
        ;;
    *)
        echo "❌ Invalid action. Use: plan, apply, or destroy"
        exit 1
        ;;
esac

echo "✅ Deployment completed successfully!"

Common Pitfalls and How to Avoid Them

1. Resource Naming Conflicts

Problem: Resources with hardcoded names cause conflicts across environments.

Solution: Use consistent naming conventions with variables:

resource "aws_s3_bucket" "app_data" {
  bucket = "${var.project_name}-${var.environment}-app-data-${random_id.bucket_suffix.hex}"

  tags = local.common_tags
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

2. State Drift

Problem: Manual changes outside Terraform cause state drift.

Solution: Implement drift detection in CI/CD:

# Check for drift
terraform plan -detailed-exitcode
EXIT_CODE=$?

if [ $EXIT_CODE -eq 2 ]; then
    echo "⚠️ Infrastructure drift detected!"
    # Send notification or fail the pipeline
    exit 1
fi

3. Large State Files

Problem: Monolithic state files become unwieldy and increase blast radius.

Solution: Break infrastructure into logical layers:

# Use data sources to reference other layers
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "bhakta-terraform-state-prod"
    key    = "network/terraform.tfstate"
    region = "us-west-2"
  }
}

# Reference outputs from network layer
resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
  # ... other configuration
}

Conclusion

These patterns have served me well across multiple production environments. The key principles are:

Consistency - Use standardized structures and naming
Security - Never expose secrets, use proper IAM roles
Modularity - Create reusable, testable modules
Automation - Integrate with CI/CD and testing
Documentation - Code should be self-documenting

Start with these foundations, and gradually add complexity as your infrastructure needs grow. Remember: good IaC is not just about what you build, but how maintainable and secure it is for your team.

What Terraform patterns have you found most valuable in your infrastructure? I'd love to hear about your experiences!

Next Post Preview: In my next article, I'll dive into "GitOps with ArgoCD: Implementing Continuous Deployment for Kubernetes Applications."

Tags: #Terraform #InfrastructureAsCode #AWS #DevOps #CloudEngineering

Table of Contents