Implementing "sleep" in the CloudFormation stack for the delay caused by IAM eventual consistency
- Oleksii Bebych
- Dec 4, 2023
- 3 min read
Updated: Dec 7, 2023
Problem statement
Our customer uses Customizations for AWS Control Tower for the account vending. A new account in the specific organizational unit should deploy different resources as a baseline, for example, IAM roles, VPC with all networking components, and ECS cluster for further application deployment. ECS cluster creation requires a service-linked role that should be explicitly created in case of using CloudFormation. So, a native CloudFormation feature, "depends on" was used to create a strict order of resource creation.

This is the initial CloudFormation stack:
AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS ECS Fargate cluster'
Parameters:
CapacityProviderTypes:
Type: CommaDelimitedList
AllowedValues:
- FARGATE
- FARGATE_SPOT
EnvironmentTag:
Type: String
Conditions:
IsProd: !Equals
- !Ref EnvironmentTag
- prod
Resources:
FargateClusterRole:
Type: AWS::IAM::ServiceLinkedRole
Properties:
AWSServiceName: ecs.amazonaws.com
FargateCluster:
Type: AWS::ECS::Cluster
DependsOn:
- FargateClusterRole
Properties:
ClusterName: FargeetClusterPal
CapacityProviders: !Ref CapacityProviderTypes
ClusterSettings:
- Name: containerInsights
Value: enabled
DefaultCapacityProviderStrategy:
- CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT]
If the service-linked role did not exist in advance, the stack failed sometimes and the root cause is the following. CloudFormation sends an API call to AWS to create a service-linked role and receives a successful response. But if, at the same time, we try to find the role in the IAM console, it will not be displayed in 100% of cases. It is not obvious, and not all people know it, but some delays are possible during updates in the IAM configurations.
As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any change that you make in IAM (or other AWS services), including tags used in attribute-based access control (ABAC), takes time to become visible from all possible endpoints. Some of the delay results from the time it takes to send the data from server to server, from replication zone to replication zone, and from Region to Region around the world. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out.
So, as a workaround, we had to implement a "sleep" step between the creation of the service-linked role and the ECS cluster itself to give it some time to propagate all changes and make our stack always work.
Proposed solution
Unfortunately, such a simple thing as "sleep" delay is absent in CloudFormation by the day of writing this post. So we had a couple of options.
The first idea was to create the service-linked role somewhere in previous steps of account vending, for example, during the VPC creation, but this is not quite the logically right solution. The service-linked role is related to the ECS stack, so, ideally, it should be created within it.
The second idea was to use CloudFormation custom resource with Lambda function, where we actually can implement whatever we need, including "sleep" timeout.

This is the new CloudFormation stack:
AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS ECS Fargate cluster'
Parameters:
CapacityProviderTypes:
Type: CommaDelimitedList
AllowedValues:
- FARGATE
- FARGATE_SPOT
EnvironmentTag:
Type: String
Conditions:
IsProd: !Equals
- !Ref EnvironmentTag
- prod
Resources:
FargateClusterRole:
Type: AWS::IAM::ServiceLinkedRole
Properties:
AWSServiceName: ecs.amazonaws.com
FargateCluster:
Type: AWS::ECS::Cluster
DependsOn:
- Delay
Properties:
ClusterName: FargeetClusterPal
CapacityProviders: !Ref CapacityProviderTypes
ClusterSettings:
- Name: containerInsights
Value: enabled
DefaultCapacityProviderStrategy:
- CapacityProvider: !If [IsProd, FARGATE, FARGATE_SPOT]
Delay:
Type: 'Custom::Delay'
DependsOn:
- FargateClusterRole
Properties:
ServiceToken: !GetAtt DelayFunction.Arn
TimeToWait: 20
### Custom resource for Delay (sleep), that is natively absent in CloudFormation
LambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: Allow
Principal:
Service:
- lambda.amazonaws.com
Action:
- sts:AssumeRole
Path: /
Policies:
- PolicyName: "lambda-logs"
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action:
- logs:CreateLogGroup
- logs:CreateLogStream
- logs:PutLogEvents
Resource:
- "arn:aws:logs:*:*:*"
DelayFunction:
Type: 'AWS::Lambda::Function'
Properties:
Handler: "index.handler"
Timeout: 120
Role: !GetAtt 'LambdaRole.Arn'
Runtime: python3.10
Code:
ZipFile: |
import json
import cfnresponse
import time
def handler(event, context):
time_to_wait = int(event['ResourceProperties']['TimeToWait'])
print('wait started')
time.sleep(time_to_wait)
responseData = {}
responseData['Data'] = "wait complete"
print("wait completed")
cfnresponse.send(event, context, cfnresponse.SUCCESS, responseData)

As a result, we have a couple of new blocks in the CloudFormation template, which could be replaced by one parameter. Such a feature has been requested since 2020 , but is still absent as a native CloudFormation functionality. Up to now, we can bypass this limitation with custom Lambda resources.
Conclusion
In this post, we looked at CloudFormation custom resource as a tool to implement a "sleep" delay between dependent parts creation within a stack. CloudFormation custom resource is a powerful function, that may be used for many other logics and interactions with third parties.
Comments