Meeting Registration System and RIPE NCC Forum Temporarily Unreachable

Incident Report for RIPE NCC

Postmortem

On Monday November 27th at 15:35 UTC, our monitoring system alerted us regarding a problem with the Meeting Registration System and RIPE NCC Forum being temporarily unreachable. After further investigation it turned out changes within our AWS Control Tower setup resulted in a lack of control of services running within the AWS environment, making them unreachable.

Background

Our AWS environment utilises Control Tower for governance and operational efficiency. We encountered limitations in updating Control Tower due to the excess number of Service Control Policies (SCPs) applied per Organisational Unit (OU). Service Control Policies state what an AWS account is or is not allowed to do. While AWS permits a maximum of five SCPs per OU, we had six in place.

Incident Details

During a routine check, it was observed that all accounts had a duplicate SCP. This policy appeared to be redundant as no operational differences were observed in Control Tower with its presence. One of these SCPs was inherited from the root OU, and the other was manually added. Investigations into the control-tower repository and other documentation did not yield any rationale for the duplication of this SCP.

Action Taken

To address the redundancy and the limitation on updating Control Tower, the manually added SCP was removed, retaining the one inherited from the root OU. Subsequent tests were conducted in a development account to ensure that the removal did not impact operations adversely. Results of that test showed no indication of any sort of disruption, so the decision was made to remove the SCP in all other accounts.

Outcome

Initially, the changes appeared to be successful with no immediate issues. However, complications arose following these adjustments where all services within the AWS accounts were no longer reachable. When the first errors came in, engineers tried to log into the AWS account but found that resources were no longer available to them . It wasn’t clear if the accounts were not able to reach them, or if they had been deleted. After it was identified that this was due to the double SCP being removed, it was put back into place, which resolved the issue.

Reflection on decision making

The initial assessment categorised the modification as a low-risk, minor adjustment. This judgement was influenced by the understanding that the duplicate SCPs did not exhibit any operational differences in Control Tower.

Lessons Learned

Identifying Manual Configurations: It's crucial to recognise and document any configurations or policies applied manually. This helps to maintain a clear understanding of the existing environment and aids in troubleshooting and future modifications.
Integration with Version Control System: all configurations, especially manual changes, should be documented and tracked in a Version Control System. This practice ensures that all changes are recorded, reviewed, and reversible if necessary.

Recommendations and future work

Document and Track Manual Actions: develop a process for documenting all manual actions in the AWS environment, particularly those not originally included in the Landing Zone configuration. This ensures that all changes are traceable and accounted for.
Integrate Manual Configurations into VCS: incorporate any manually added configurations or policies into the Version Control System. This integration will streamline change management and enhance the transparency of the environment's configuration state.
Enhance Change Management Processes: strengthen change management protocols to include thorough testing and review of both manual and automated changes. This approach minimises the risk of unintended consequences from alterations in the system.

Timeline:

15:10 UTC An engineer removed the double policy from the OU’s which had more than five SCP’s attached.
15:35 UTC Our monitoring solution notifies about an issue regarding the reachability of AWS services.
15:50 UTC Investigations start to identify what is wrong.
16:15 UTC the removed SCP is identified as the cause and is placed back. Shortly afterwards the applications began to come back online.

Conclusion

This incident underscores the importance of meticulous planning and testing in cloud governance and policy management. By adopting the lessons learned and recommendations, we aim to prevent similar occurrences in the future and maintain a stable and secure AWS environment.

Posted Nov 28, 2023 - 16:13 CET

Resolved

The issue has been resolved. All systems are accessible as usual.

Posted Nov 27, 2023 - 17:41 CET

Update

We are continuing to investigate this issue.

Posted Nov 27, 2023 - 17:39 CET

Investigating

We are experiencing issues with our event registration system and the RIPE NCC Forum. RIPE 87 meeting registration and the RIPE NCC General Meeting registration might be inaccessible and RIPE 87 attendees might be unable to access their social options pages. We are investigating these issues.

Posted Nov 27, 2023 - 16:45 CET