On Monday November 27th at 15:35 UTC, our monitoring system alerted us regarding a problem with the Meeting Registration System and RIPE NCC Forum being temporarily unreachable. After further investigation it turned out changes within our AWS Control Tower setup resulted in a lack of control of services running within the AWS environment, making them unreachable.
Our AWS environment utilises Control Tower for governance and operational efficiency. We encountered limitations in updating Control Tower due to the excess number of Service Control Policies (SCPs) applied per Organisational Unit (OU). Service Control Policies state what an AWS account is or is not allowed to do. While AWS permits a maximum of five SCPs per OU, we had six in place.
During a routine check, it was observed that all accounts had a duplicate SCP. This policy appeared to be redundant as no operational differences were observed in Control Tower with its presence. One of these SCPs was inherited from the root OU, and the other was manually added. Investigations into the control-tower repository and other documentation did not yield any rationale for the duplication of this SCP.
To address the redundancy and the limitation on updating Control Tower, the manually added SCP was removed, retaining the one inherited from the root OU. Subsequent tests were conducted in a development account to ensure that the removal did not impact operations adversely. Results of that test showed no indication of any sort of disruption, so the decision was made to remove the SCP in all other accounts.
Initially, the changes appeared to be successful with no immediate issues. However, complications arose following these adjustments where all services within the AWS accounts were no longer reachable. When the first errors came in, engineers tried to log into the AWS account but found that resources were no longer available to them . It wasn’t clear if the accounts were not able to reach them, or if they had been deleted. After it was identified that this was due to the double SCP being removed, it was put back into place, which resolved the issue.
The initial assessment categorised the modification as a low-risk, minor adjustment. This judgement was influenced by the understanding that the duplicate SCPs did not exhibit any operational differences in Control Tower.
This incident underscores the importance of meticulous planning and testing in cloud governance and policy management. By adopting the lessons learned and recommendations, we aim to prevent similar occurrences in the future and maintain a stable and secure AWS environment.