On the 15th of April we spotted that a Kubernetes node was going rogue. At the time we assumed it could be due to some underlying infra-structure malfunction. The only error message we got from our cloud supplier container was saying that docker was restarting frequently. At the time we did also carried out a full check of system resources but we didn't find anything particularly broken or strange in any way. The only thing that was caught was a relatively high CPU usage on the ingestion service
On the 15th of April, the issue was "resolved" by draining and destroying the unhealthy node. That did in fact made the issue go away for a couple of days until it started to manifest again.
We have been monitoring the cluster since then and today we have finally identified that the root cause of the issue was one of the microservices having a burst of CPU usage and Docker rebooting before being able to evict it. As a short term fix we have isolated the faulty service in a separated node pool. As a longer term fix we are changing the structure of the service in order to prevent CPU bursts.