Kubernetes malfunctioning in EU

Incident Report for Kare MIND

Postmortem

On the 15th of April we spotted that a Kubernetes node was going rogue. At the time we assumed it could be due to some underlying infra-structure malfunction. The only error message we got from our cloud supplier container was saying that docker was restarting frequently. At the time we did also carried out a full check of system resources but we didn't find anything particularly broken or strange in any way. The only thing that was caught was a relatively high CPU usage on the ingestion service

On the 15th of April, the issue was "resolved" by draining and destroying the unhealthy node. That did in fact made the issue go away for a couple of days until it started to manifest again.

We have been monitoring the cluster since then and today we have finally identified that the root cause of the issue was one of the microservices having a burst of CPU usage and Docker rebooting before being able to evict it. As a short term fix we have isolated the faulty service in a separated node pool. As a longer term fix we are changing the structure of the service in order to prevent CPU bursts.

Posted Apr 20, 2020 - 15:53 UTC

Resolved

A short term fix has been deployed to production. A longer term improvement is being implemented.

Posted Apr 20, 2020 - 15:42 UTC

Update

We are continuing to monitor for any further issues.

Posted Apr 20, 2020 - 14:59 UTC

Monitoring

The issue has been identified and we are currently monitoring system performance.

Posted Apr 20, 2020 - 14:59 UTC

Identified

We keep having issues with the nodes of our Kubernetes cluster loosing connection with Docker. For the time being all services operational but we are still investigating on the root cause.

Posted Apr 20, 2020 - 12:44 UTC

Investigating

We are currently investigating this issue.

Posted Apr 20, 2020 - 12:26 UTC

This incident affected: MIND EU.