A detailed guide to securely integrating Amazon Managed Grafana with Terraform

Over the last couple of months, I worked extensively on Amazon Managed Grafana to create dashboards for observability. In the course of automating the deployment using Terraform and a CI-CD system, I learned about a few challenges and identified solutions. I’m sharing my learnings in this note so that you do not make the same mistakes I made. In addition, towards the end of this guide, I have links to detailed notes on provisioning a workspace and dashboards in that workspace using Terraform.

Per AWS-Docs, Amazon Managed Grafana is a fully managed and secure data visualization service that you can use to instantly query, correlate, and visualize operational metrics, logs, and traces from multiple data sources. In addition, Amazon Managed Grafana makes it easy to deploy, operate, and scale Grafana.

Amazon Managed Grafana can be classified into two planes: the control plane and the data plane.
The control plane deals with the Amazon Managed Grafana workspace, SSO/Identity Center integration, and managing access to AWS services like Amazon CloudWatch.
The data plane deals with setting up the data source, creating the dashboard, and managing access to folders and dashboards inside the workspace. Identifying this classification is necessary since it will reduce your learning curve considerably. As you have noticed, since the data plane depends on the control plane, the control plane must be provisioned and configured before the data plane can be created and used.

Coming from the DevOps mindset, I wanted to create a repeatable, testable, modular, and efficient system for creating the workspace and dashboards in Amazon Managed Grafana.

Let me walk through how I achieved the objective, beginning with ensuring that the process is repeatable and testable. The simplest way to ensure that is to use infrastructure as a code tool. By adopting Terraform to provision these resources for the control and data planes and running Bridgecrew checkov on the source code, I ensured just that.

Continuing, let’s focus on how the process is modular.
I began by ideating a single Terraform stack to automate the provisioning of both the control and data planes. So I planned to create all the resources in the project:
-the workspace,
-the Identity Center integration,
-the IAM role and policy for the workspace to assume,
-setting up the data source,
-creating a folder,
-and deploying the dashboard using a single Terraform project.
The control plane was automated using the AWS provider for Terraform. The data plan was automated using the Grafana provider for Terraform. Although you can use both providers in the same Terraform stack, I soon realized that the system would be challenged when deploying the data plane with the control plane, particularly while managing the workspace API key.

The Grafana workspace API key and the Grafana workspace URL are mandatory values for the Grafana provider for Terraform. Hence, that value should be available before the usage of the grafana provider, which is before the terraform init stage. In addition, there is a resource in the AWS provider for Terraform registry called aws_grafana_workspace_api_key that Terraform can use to create the workspace API key.
Hence the process could look like this:
(i) create the control plane resources using the AWS Terraform provider, including the aws_grafana_workspace_api_key,
(ii) store the API key securely, and
(iii) access the API key and provision the data plane resources using the Grafana Terraform provider.

However, there is a limitation to the above process. The workspace API key has a duration field (seconds to live) during which it is active, and once it has expired, there is no way to use that key. So, per Terraform, the resource exists, but it is useless after its expiry -which Terraform does not manage.
Note: To test this out, create an aws_grafana_workspace_api_key resource with seconds_to_live = 60. After 60 seconds (when the key has expired), try to run a terraform apply command. You will get a Apply complete! Resources: 0 added, o changed, o destroyed. message. So after 60 seconds, the key has expired, but per Terraform, there isn’t any change required.
Another Note: I know the AWS provider maintainers will fix this bug at a future date. When I was testing this issue, the latest version of the AWS Terraform provider was 4.48.0.

An alternate approach would be to destroy only the workspace API key after each usage in the data plane such that the process can be automated and work the next time via the CI-CD pipeline that there is a terraform apply step. But you see the challenge? This is different from the conventional approach. We seldom destroy resources after they are provisioned. Moreover, what if there is an error and the process of provisioning resources and destroying the workspace API key does not end successfully, leaving the resource provisioned and un-deleted? The next run of terraform apply would blow up because the aws_grafana_workspace_api_key resource would still be provisioned but not usable since the key has expired.

This limitation restricts the process of creating resources belonging to the control and data plane in one Terraform stack. Hence, my preferred solution was to have separate Terraform stacks for the control and data plane with their CI-CD pipelines.

Regarding creating the workspace API key, I preferred not to use Terraform to create the key. Instead, I created the key using PowerShell (I have used Python to create it too). The question to ask is –why do I need the workspace API key? Because the workspace API key, along with the workspace URL, is a mandatory value to use the Grafana Terraform provider. Creating the data plane implies that the control plane exists and is already provisioned. In that case, my CI-CD process for the data plane could look something like this:
– get the value of the Grafana workspace URL (this value is available from the control plane CI-CD pipeline),
– create an API key using PowerShell or Python,
– pass the value of the Grafana workspace URL and workspace API key to the Terraform code, and 
– provision resources belonging to the data plane using the Grafana Terraform provider, and
– delete the workspace API key using PowerShell or Python.

Why do I believe this process to be better?
By managing the workspace API key outside of Terraform, I ensured the process of provisioning resources does not include variations. Its still terraform init -> plan -> apply. Security-wise, the workspace API key does not leave the environment where it was provisioned -the CI-CD system. The workspace API key is deleted in the same CI-CD pipeline that created it; hence, in the following pipeline run, there are no lingering effects of the previous pipeline run.

And finally, let me address the efficiency aspect of the process. Adopting two individual CI-CD approaches for provisioning resources in the control and data plane of Amazon Managed Grafana workspace makes it easy to understand and maintain. And by managing the workspace API key outside of Terraform, I ensured that the resource provisioning process did not involve any variation or alternate paths.

Finally, to sum it up, here are my recommendations:
-use the AWS Terraform provider to provision the Amazon Managed Grafana workspace (control plane),
-use the Grafana Terraform provider to provision resources inside the Amazon Managed Grafana workspace (data plane),
-manage the workspace API key outside of Terraform (via AWS CLI, Python, or PowerShell), and
-create separate CI-CD pipelines while working with the Amazon Managed Grafana workspace and dashboards.

You can read more about this process along with the link to my GitHub repository, where I have the code stored to provision the control and data plane resources. And I have separate CI-CD pipelines for the control and data plane.

Create Amazon Managed Grafana Workspace using Terraform.
Create Amazon Managed Grafana Dashboard using Terraform.

I hope you found this note helpful. Let me know if you disagree or have alternate views. I’d love to learn more from your use case.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s