Sep 27, 2023

A Guide to Continuous Deployment for the Overly Suspicious

Yes, at CrowdSec we’re quite thorough about security, especially regarding infrastructure integrity and customer data protection. On the other hand, we are keen to follow the best practices in the industry as much as possible.

In this article, we want to present our version of the guide for secure Continuous Deployment (CD). The goal of the article is not to advocate the benefits of CD nor to explain what it is and how it should be done, there is so much literature about that, and we do not pretend to have invented anything.

Instead, we’ll focus on how we addressed the security aspects with as little compromise as possible.

Our environment is deployed on AWS and consists of a few lambdas (131, to be precise), RDS instances, and so on. The source code is hosted on GitHub, and we use GitHub Actions and Terraform.

We considered the following challenges:

No long-lived credentials should be stored on GitHub. Even though we’re sure the GitHub team is great at security, things don’t always go according to plan, and we better stay away from unnecessary risks.
The service account should be able to do what the service account should be able to do. Joking aside, the GitHub runner will impersonate an AWS service account that will have limited authorizations because of the least privilege principle. A service account allowed to deploy a lambda A cannot modify a lambda B, let alone deploy an EC2 to do crypto mining.
Manual validation in a different application (in our case, Slack) before deploying in infrastructure.
Monitoring and alerts (still work in progress, and we’ll cover in a future article)

The goal of the article is to share the lessons learned by addressing those challenges.

Credentials

Fortunately, GitHub Actions provides a way to authenticate using OpenID Connect (OIDC) with the major cloud providers. Long story short (with the AWS example), we have to register GitHub as an OIDC provider in our AWS infrastructure, then AWS will be able to validate a token issued by GitHub that is generated for every workflow by the action.

If the token is validated, AWS will respond with short-lived credentials for a given AWS role. Each workflow can impersonate a different AWS role — we encourage having as many AWS roles as reasonably possible with authorizations as granular as possible.

The process and the procedure to use OIDC in GitHub Actions are very well explained in the GitHub documentation. We strongly recommend reading the article to understand the full picture, but you can find below a summary of the steps you need to follow.

AWS OIDC provider setup

This is done only once and allows AWS to recognize GitHub as an identity provider.

From the AWS console, open IAM/Identity providers and add an identity provider, with OpenID Connect.

Or maybe you rather Terraform it? Excellent choice! Here’s how to do it:

resource "aws_iam_openid_connect_provider" "github" {
  url = "https://token.actions.githubusercontent.com"

  client_id_list = [
    "sts.amazonaws.com",
  ]

  thumbprint_list = [
    "6938fd4d98bab03faadb97b34396831e3780aea1"
  ]
}

AWS role setup

Create one role for each GitHub repository that you want to integrate with AWS. Each role should have limited rights — use only the strict minimum permissions required for the resources it deploys.

We use Terraform to create these roles and, obviously, for chicken and egg reasons and mostly to avoid one role with god permissions, we apply this Terraform manually. Here’s an example of a role in Terraform:


resource "aws_iam_role" "github-action-YOUR-PROJECT" {
  name               = "github-action-YOUR-PROJECT-role"
  assume_role_policy = data.aws_iam_policy_document.YOUR-PROJECT_gh_assume_role_policy.json
  inline_policy {
    name   = "role-YOUR-PROJECT-policy"
    policy = data.aws_iam_policy_document.role-YOUR-PROJECT-policy.json
  }
  inline_policy {
    name   = "role-YOUR-PROJECT-terraform-policy"
    policy = data.aws_iam_policy_document.role-YOUR-PROJECT-terraform-policy.json
  }
}

data "aws_iam_policy_document" "YOUR-PROJECT_gh_assume_role_policy" {
  statement {
    actions = ["sts:AssumeRoleWithWebIdentity"]

    principals {
      type        = "Federated"
      identifiers = ["arn:aws:iam::${local.env.aws_account}:oidc-provider/token.actions.githubusercontent.com"]
    }

    condition {
      test     = "StringEquals"
      variable = "token.actions.githubusercontent.com:aud"
      values   = ["sts.amazonaws.com"]
    }

    condition {
      test     = "StringLike"
      variable = "token.actions.githubusercontent.com:sub"
      values   = ["repo:YOUR_GITHUB_REPO:*"]
    }
  }
}

data "aws_iam_policy_document" "role-YOUR-PROJECT-terraform-policy" {
# Terraform states
  statement {
    actions = [
      "s3:*",
    ]
    resources = [
      "arn:aws:s3:::cs-terraform-backend-${local.env.aws_region}",
      "arn:aws:s3:::cs-terraform-backend-${local.env.aws_region}/*",
    ]
  }
  statement {
    actions = [
      "dynamodb:*",
    ]
    resources = [
      "arn:aws:dynamodb:${local.env.aws_region}:${local.env.aws_account}:table/terraform.lock",
    ]
  }
}

Takeaways:

The assume policy links your GitHub repo with the AWS role, using sub claim in the JWT token.
We separate Terraform remote state permissions from the rest of infrastructure-related permissions. According to the way you manage the Terraform state, you might want to ignore the Terraform policy.
The other permissions are in role-YOUR-PROJECT-policy.json.tf file (tip: name your JSON file with the tf.json extension and you can reference it as data).

Here’s an example of JSON policy:


{
    "data": {
        "aws_iam_policy_document": {
            "role-YOUR-PROJECT-policy": {
                "statement": [
                    {
                        "effect": "Allow",
                        "actions": [
                            ...
                        ],
                        "resources": [
                            ...
                        ]
                    }
                ]
	     }
	}
}

GitHub Actions setup

In this action step, we utilize continuous deployment tools to facilitate the authentication phase and token exchanges between GitHub and AWS, ensuring a seamless integration process.

As a result, the action step will set AWS environment variables with short-lived credentials for the given AWS role. Also, you must add id-token permission to the workflow.


permissions:
  id-token: write # This is required for requesting the JWT
  contents: read  # This is required for actions/checkout
jobs:
  ...
  - steps:
     - ...
     - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: AWS_ROLE_ARN
          role-session-name: github_terraform_action # this is an example
          aws-region: AWS_REGION
     - #terraform plan, etc.
       ...

Least privilege

So, now you have short-lived credentials to AWS. What if an action from your workflow exfiltrates those credentials? I’m sure it cannot happen because you’ve already audited the code of every third-party action you use, and you reference a version/tag/hash of the action, so you’re pretty confident the code could not have been compromised.

Well, just in case you didn’t, you don’t want to give an attacker full control of your infrastructure even for 2 hours.

We decided to use one AWS role for each functional coherent area that is already hosted in the same GitHub repository —that would be 40 roles or so. We give the role that has very limited rights the strict minimum permissions required to do the job.

Has anyone seen what an AWS Permission Policy looks like when you try just not to put * everywhere? My point exactly, it would be insane to manage that by hand for 40+ roles. Here comes our imperfect-but-working-and-good-enough Policies Generator.

The Terraform Policy Generator

Terraform is able to list all the resources in a state since it knows what modifications are required and on which resources, so why not use/write a tool that interprets Terraform logs and generates an AWS Permissions Policy? How hard can that be?

We looked around for tools that already do that and, either we’re not good at looking for existing tools, or there aren’t any! So, we decided to build one and release it as open source for anyone to use.

Behind the scenes: How the Terraform Policies Generator works

The Policies Generator operates with a two-step mechanism to ensure the generated policies align with the resources in your Terraform plan.

Resource extraction

First, the tool goes through your Terraform plan, identifying all resources set to be created, updated, or managed. It does this by parsing the Terraform plan’s JSON data, focusing on both the planned_values and prior_state sections. This gives the generator a comprehensive list of resources it will have to manage permissions.

Policy generation

Next, the generator begins creating IAM policies for these resources. The goal here is to separate read and write permissions, providing granular access controls that stick as closely as possible to the principle of least privilege.

For each AWS service identified in your resources (e.g., lambda, s3, etc.), it does the following.

Read permissions: The generator assigns broad read permissions (like lambda:Describe*, lambda:List*, lambda:Get*, lambda:Read*) to all resources under that service. The rationale is that reading or listing resources generally has fewer security implications than modifying or deleting them. Also, sometimes AWS Terraform provider needs wide read privileges to deploy resources (e.g., the need to list all the available lambdas before deploying the one you want).
Write permissions: The generator takes a more conservative approach to write permissions. For actions like lambda:* that may modify or delete resources, it grants these permissions only for the specific resources extracted from your Terraform plan. This ensures that your CI/CD system has the power to manage resources as described in your plan but only has a free hand to make changes beyond what’s explicitly defined.

In essence, the Policies Generator breaks down your Terraform plan, creates a map of required permissions, and then builds IAM policies that provide just enough access for your CI/CD system to do its job, ensuring a robust and secure infrastructure deployment process.

To use the tool, follow the GitHub documentation.

Terraform Policies Generator in action

First things first, we need to install the tool:


pip install git+ssh://git@github.com/crowdsecurity/tf_policies_generator

Now let’s generate the policy:


terraform plan -out plan.out
terraform show -no-color -json plan.out > output.json
tf_policies_generator -f output.json

This command will generate a policies JSON file called policies.tf.json that looks like this:


{
    "data": {
        "aws_iam_policy_document": {
            "policies": {
                "statement": [
			...
		   ]
            }
       }
   }
}

Note: Make sure to give this file a more relevant name and rename policies to something like role-YOUR-PROJECT-policy, copy it to your terraform folder and you’re done.

Limitations

It’s no surprise that a tool like this didn’t already exist, given how complicated it actually is to create one with so many exceptions to deal with.

Among said exceptions, is the obligation to give READ authorization to a bunch of resources that Terraform will want to list, ending up with * read permissions on a few of them. However, we don’t give * authorizations for the most sensitive resources. For these resources, we should manually modify the generated policy.

Nonetheless, our tool seems good enough even though manual actions are still required here and there.

Manual validation

We use this minimalist yet very efficient GitHub action to implement an approval workflow using Slack.

Basically, our workflows are as follows.

Development environment

In a PR, a comment with tf plan or tf apply, will trigger an on comment workflow. We authenticate with AWS, execute a terraform plan command, and publish the result as PR comments. In the case of apply, the action creates a prompt in a Slack channel and awaits for validation. If there is no validation before timeout or there has been a rejection, the workflow ends with an error, otherwise, Terraform will apply.

Production environment

Quite similar to development workflow but triggered when a release is created or updated. At the end of the procedure, we add in the release notes the user who triggered the deployment and deployment status.

Setting up a manual approval step

We use a GitHub action to force a manual approval step. There’s not so much activity on this repo but this action does no more or no less than we request, so we go for it.

For this, you’ll have to create a Slack application but this is not something we will cover with this article. Let’s just assume you’ve already got your SLACK_APP_TOKEN, SLACK_BOT_TOKEN, SLACK_CHANNEL_ID, and SLACK_SIGNING_SECRET.

The GitHub action step looks like this, inserted just before terraform apply step:


jobs:
  ...
  - steps:
      - ...
      - name: send approval
        if: github.event.release.prerelease == false
        uses: varu3/slack-approval
        env:
          SLACK_APP_TOKEN: ${{ secrets.SLACK_APP_TOKEN }}
          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}
          SLACK_SIGNING_SECRET: ${{ secrets.SLACK_SIGNING_SECRET }}
          SLACK_CHANNEL_ID: ${{ secrets.SLACK_CHANNEL_ID }}
        timeout-minutes: 30
      - name: terraform apply
         ...

This step will publish a message in the Slack channel with two actions possible: Accept / Reject. The workflow will wait for the action for 30 minutes and fail if no action has been selected.

Conclusion

We were very frustrated with deploying our infrastructure from local machines with all the potential quality and traceability risks the “manual” method implies.

But we didn’t want to take any chances with security, and trust us, having colleagues who had fun hacking CI/CD systems in their previous lives (no names here 😂) sets the level quite high.

Our takeaway is that we can reduce the obvious risks of letting a cloud-based runner have access to the infrastructure, with a few constraints and the obligation to still perform a few actions manually.

Also, the implementation of Cloudpanel and CrowdSec integration further enhances our security posture. The solution is not bulletproof (there’s no such thing as bulletproof in security anyway!) but is rather hardened by industry standards.