Monitoring cluster Kubernetes, not a small task and needs the same approach non-trivial. To illustrate the kind of failure that can occur, this is an example of one of our AWS deployment.
One of our clusters showed a healthy status, with SkyDNS work and all the pods begin. However, after a few minutes, go to the state SkyDNS CrashLoopBackoff. Container applications rose but dysfunctional because they could not reach the database after the first restart. You can browse this link to know more about Kubernetes monitoring.
Turns cluster down but we could not get a clear understanding of what is happening by looking at events and pods status.
Once connected to the master node and view the log SkyDNS pod, they reveal problems with etcd. High latency on network-attached disk failure cause read and write so that etcd cannot write to the file system. Although it is configured correctly and it seems to work, it is not consistently available as Kubernetes Service.
We evaluated several options for monitoring in the context Kubernetes:
- Traditional monitoring agent
- Alternative approaches such as smoke testing custom specific applications
- Classic monitoring solution
- There is no shortage of traditional monitoring solutions.
It is a lean process, very small (single executable), and tested daemon running on thousands of machines – perfect for a small setup but is typically limited to a single system monitoring. This is the biggest drawback.
One of the problems we found with the use of monitoring is a limited set of tests conducted and the lack of extensibility. Although it is configured, we have to extend its functionality by writing scripts or special purpose application is controlled through a weak interface.
More importantly, we found it very difficult to connect some examples of monitoring into a coherent, highly available, and resilient network with each agency collects its own share information and collaborate to keep this information up to date.