Monitoring costs of containerized workloads in EKS using OpenCost and AWS Managed Prometheus / Grafana
- Oleksii Bebych
- Jan 25, 2024
- 4 min read
Problem statement
Using clouds is convenient and has many advantages, like allocating as much workload as you need immediately, deploying globally pretty fast, focusing on business instead of maintaining a data center, etc. But on the other hand, you need to be really careful about costs, understand how cloud providers charge you, and how to monitor your costs continuously. AWS Billing and Cost Management provides you with detailed reports, AWS Budgets can help you with planning and alerting in unforeseen situations, but if we are talking about containerized workloads, especially in EKS, there is no native way to look inside the cluster and understand what kind of workload costs more than expected, is there a way to identify overprovisioned resources and optimize the usage. There are many tools on the market recently, for example:
In this particular case, we wanted to avoid extra payments for licensing new products and utilize the current solutions as much as possible. Opencost was chosen as a free and lightweight application that can be integrated with Prometheus and Grafana, which are currently used for overall monitoring. More information about configuring them is in this post.
Solution overview
Opencost remote write to AMP
In the previous post regarding Prometheus remote write configuration, we covered how to install and configure the Kube Prometheus Stack with AWS Managed Prometheus as a persistent storage for metrics.
Opencost shows a capability to do a similar thing:

Here is an example of values in the Helm chart Prometheus and Opencost Helm charts were installed as part of infrastructure via Terraform along with the EKS cluster itself:
### Variables
variable "amp_workspace_id" { type = string }
variable "opencost_helm_version" { default = "1.29.0" }
variable "opencost_service_account_name" { type = string }
data "aws_iam_policy_document" "opencost-oidc-assume-role-policy" {
statement {
actions = ["sts:AssumeRoleWithWebIdentity"]
effect = "Allow"
condition {
test = "StringEquals"
variable = "${replace(var.iam_openid_provider_url, "https://", "")}:sub"
values = ["system:serviceaccount:${var.namespace}:${var.opencost_service_account_name}"]
}
principals {
identifiers = [var.iam_openid_provider_arn]
type = "Federated"
}
}
}
resource "aws_iam_role" "opencost-irsa-role" {
assume_role_policy = data.aws_iam_policy_document.opencost-oidc-assume-role-policy.json
name = "${var.eks_cluster_name}-${var.opencost_service_account_name}-role"
}
resource "kubernetes_service_account" "opencost-irsa" {
automount_service_account_token = true
metadata {
name = var.opencost_service_account_name
namespace = var.namespace
annotations = {
"eks.amazonaws.com/role-arn" = aws_iam_role.opencost-irsa-role.arn
}
}
}
### Inline IAM Policy
resource "aws_iam_role_policy" "eks-system-opencost" {
name = "opencost-policy"
role = aws_iam_role.opencost-irsa-role.id
policy = <<-EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"aps:RemoteWrite",
"aps:GetSeries",
"aps:GetLabels",
"aps:GetMetricMetadata",
"aps:QueryMetrics"
],
"Resource": "*"
}
]
}
EOF
}
### Opencost helm chart
resource "helm_release" "opencost" {
name = "opencost-charts"
repository = "https://opencost.github.io/opencost-helm-chart"
chart = "opencost"
version = var.opencost_helm_version
create_namespace = false
namespace = var.namespace
values = [<<EOF
serviceAccount:
create: false
name: ${kubernetes_service_account.opencost-irsa.metadata[0].name}
opencost:
ui:
enabled: false
prometheus:
internal:
enabled: false
external:
enable: false
amp:
enabled: true # If true, opencost will be configured to remote_write and query from Amazon Managed Service for Prometheus.
workspaceId: ${var.amp_workspace_id}
sigV4Proxy:
image: public.ecr.aws/aws-observability/aws-sigv4-proxy:1.7
name: aps
port: 8005
region: ${var.aws_region}
host: "aps-workspaces.${var.aws_region}.amazonaws.com" # The hostname for AMP service.
nodeSelector:
pool: system
tolerations:
- key: dedicated
operator: Equal
value: system
effect: NoSchedule
EOF
]
}
The key elements are:
Using the IAM role for the service account (IRSA) to achieve the least privilege principle
Overwrite several Helm Values to use the created Service Account, disable UI, enable remote write to AMP, and schedule Opencost on the "system" nodes separately from the main workload and overcome its taints.
As a result, we did not have new metrics from Opencost in Prometheus. There are no errors in the logs and nothing in discussions on the internet. After that, I started looking for another way.
Scraping Opencost metrics from Prometheus
As Opencost itself did not push its metrics to the Amazon Managed Prometheus, so I checked another thing. Opencost creates a Kubernetes service, that exposes /metrics:
% kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
opencost-charts ClusterIP 172.20.203.255 <none> 9003/TCP 22h
% kubectl port-forward service/opencost-charts -n monitoring 9003:9003
Forwarding from 127.0.0.1:9003 -> 9003
Forwarding from [::1]:9003 -> 9003

So we can add the additional scrape config to our Prometheus operator, which is deployed via Terraform as well along with the EKS cluster:
### Proemtheus-operator helm
resource "helm_release" "prometheus" {
name = "kube-prometheus-stack"
repository = "https://prometheus-community.github.io/helm-charts"
chart = "kube-prometheus-stack"
version = var.helm_version
create_namespace = true
namespace = var.namespace
values = [<<EOF
alertmanager:
enabled: false
prometheus:
serviceAccount:
create: false
name: ${kubernetes_service_account.irsa.metadata[0].name}
prometheusSpec:
additionalScrapeConfigs: |
- job_name: opencost
honor_labels: true
scrape_interval: 1m
scrape_timeout: 10s
scheme: http
metrics_path: /metrics
static_configs:
- targets: ['opencost-charts:9003']
remoteWrite:
- url: ${var.amp_remote_write_url}
sigv4:
region: ${var.aws_region}
queue_config:
max_samples_per_send: 1000
max_shards: 200
capacity: 2500
nodeSelector:
pool: system
tolerations:
- key: dedicated
operator: Equal
value: system
effect: NoSchedule
prometheusOperator:
enabled: true
nodeSelector:
pool: system
tolerations:
- key: dedicated
operator: Equal
value: system
effect: NoSchedule
grafana:
enabled: false
kube-state-metrics:
enabled: true
nodeSelector:
pool: system
tolerations:
- key: dedicated
operator: Equal
value: system
effect: NoSchedule
EOF
]
}
Connecting to the Prometheus UI via port-forward:
% kubectl port-forward service/kube-prometheus-stack-prometheus -n monitoring 9090:9090
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090
Handling connection for 9090
Checking targets:

The target is up, and new metrics are being scraped:

You can find all new metrics on the Opencost website.
So the new architecture looks like this (simplified a bit):

Visualization in Grafana
There are several Grafana dashboards, but not all of them are actually working. I've found a couple of interesting ones.
For example 11270-kubecost: This dashboard gives your Kubernetes cluster costs:
Cluster Wide (Live and Estimative)
Relative price of spot instances
Namespace (Live and Estimative)
Price variation between days and weeks
APP (Live and average)
Price comparison with 7 days ago
PVC Costs
And another one Kubecost Dashboard for Grafana Cloud
Conclusion
Among the numerous available solutions for the monitoring costs of containerized workloads in the EKS cluster, in this post, we looked at free and lightweight OpenCost integrated with AWS Managed Prometheus. OpenCost works as a standalone solution as well; it has its own simple web UI, but in this particular case we already had AWS Managed Prometheus and Grafana for the overall monitoring, so we decided to integrate OpenCost with them and have all visualization and metrics in one place. For some reason, OpenCost does not send metrics directly to AWS Managed Prometheus via the "remote write" configuration, whereas it's documented. So, as a workaround, I tried scraping Opencost metrics from the Prometheus operator, which sends metrics to AWS Managed Prometheus later, and it works. I found a couple of detailed dashboards for Grafana on the internet and demonstrated what they look like. Monitor your spends in clouds =)
Comments