Og igen 10:50 - 11.03
Edit opdateret med del af incident rapport:
1. Timeline:
The incident had the following timeline:
- 10:00 AM Issue appeared in the compute node
- 10:06 AM OpenStack team alerted from our monitoring system and started investigation
- 10:07 AM A notification of the issue is posted on twitter
- 10:15 AM The team finds that server cannot be accessed, so a reboot was initiated.
- 10:21 AM OpenStack team started the firmware upgrade process.
- 10:36 AM An update is posted on twitter.
- 10:58 AM Server is up from a final reboot and the team starts to verify all instances
- 11:07 AM Service desk posts on twitter that the issue is over.
2. Incident description:
Due to an error in a memory module in the node, the node got stuck and went offline. This had the effect that all virtual servers on this node were unreachable.