How to Monitor VMware Infrastructure for Free

If your IaaS backbone is formed by bare-metal machines with the industry leading VMware hypervisor, you’ve probably made this choice for a good reason – efficiency. When talking scale with massive I/O, CPU and/or memory loads, the public Cloud/SaaS offerings such as AWS Lambda are still far away from the classic bare-metal model, which yields much higher performance in a controlled and predictable environment. ESXi (product name – VMware vSphere Hypervisor), VMware’s bare-metal hypervisor, is the industry’s standard hypervisor with roughly 70%+ of the market share in this segment because of its robustness and performance. With a footprint of just 150MB it can support up to 128 vCPUs, 6TB or RAM and all kinds of OSs on top of it.

VMware vSphere is the enterprise product that adds a Virtual Center that automates the management of ESXi hypervisors and adds enterprise class features for mission-critical applications, such as the famous vMotion (ability to move a VM between ESXi hosts without downtime), the high-availability, proactive monitoring or disaster recovery features. These features, however, come at a certain price (currently at 6.5k USD for a single vSphere license, according to VMware’s website). On the other hand the ESXi hypervisor is offered for free. In case your software architecture employs stateless design, e.g. through Dockers/Kubernetes, then you can leverage the agility and robustness of stateless architectures and cut down some of the costly vSphere high availability features, such as shadow VM, vMotion, etc…. For example, in case your monitoring system shows that a container is bogged down, just kill it and start another node. From IaaS perspective this greatly simplifies your IaaS and reduces costs by just employing a number of good old and free ESXi-s…given that you have a good monitoring layer laid out.

In this post, we will review how to easily add free, yet enterprise-class monitoring layer for your light-weight VMware virtualisation tier.

What do We Mean by Syslog?

Let’s first get some terminology in place. The standard way to monitor UX systems is syslog. Syslog is an overloaded with meaning term. When people talk syslog they could mean the syslog daemon, called syslogd, that all UX systems have to collect and dump logs (usually in the /var/log directory). They could also mean the standard kernel functions for logging, called syslog. Less frequently they mean the syslog protocol (e.g. RFC-5424) for relaying logs over the network. There are also log relayers, which purpose is to receive, process and retransmit log messages; and there analysers that understand the syslog protocol, can collect large amounts of logs, extract, systemise and analyse them, even with AI predictive analytics which can help you discover failures before they occur.

Here is a simplistic representation of this ecosystem:

Exactly this kind of chain is used in the enterprise VMware vSphere monitoring feature. So our goal is to achieve a similar stack with free tools.

Choosing Your Syslog

Regardless of the transportation mean (UX domain socket for local or TCP/UDP socket for remote), the RFC-5424 syslog protocol should be observed or major loss of information could happen. Sadly in the pleiad of syslog client and server libraries, a very small number are strictly RFC-5424-compliant by default. If you don’t pick the right combination there is significant configuration and matching hustle associated.

Another important aspect is the performance – or how efficiently the local or remote syslogd can collect and relay messages. Among all contenders, syslog-ng seems to be a popular choice. However, when I led the vSphere monitoring team, we evaluated a number of syslogd libraries carefully and the open-source small and light-weight rsyslog showed much greater (40%+) efficiency, when compared to all others.

Albeit not that rich and popular as syslog-ng, Rsyslog also has all the needed features for enterprise class logging, such as throttling, flood protection, queue de/serialisation configuration for rapid-reliable log queueing, custom extensible log rotation, easily configurable and extensible log processing, etc. As a matter of fact, we were so impressed by its overall scoring that we selected it as the VMware vSphere syslogd, playing a key part in the enterprise VMware vSphere Monitoring layer. All that we need to do in order to have that “enterprise” syslogd is install a Linux flavour OS. If by some chance rsyslog is not present, install the rsyslog package as well, using your favourite package-installation tool. Here is how to do it for Ubuntu and Redhat.

Setup and Configuration

The next step is configuring rsyslogd to collect, process and store messages. This could be an altogether separate article, but luckily rsyslog has good documentation, along with example configurations for the most common case. Log filters and actions are the basic terminology. A filter specifies what messages to get while the action – what to do with them, (e.g. store them in a file). Here is a basic configuration example to get started with the log-storage and retention which, although utterly important, are not the focus of this post.

Now that we have set up a robust syslogd for collecting and relaying our logs, let’s configure the ESXi to send their logs to our remote syslogd (rsyslog). Here is the VMware article on the matter.

Finally, you may want to also relay all collected logs to some log-analysis tool or database that has a decent way to visualise the health state of your entire IaaS thanks to the logs you fed in and even predict failures based on integrated ML algorithms. Such tools are VMware’s LogInsight, the free plan of (limitations apply) Splunk, etc. In order to do that first you’ll need to add a line in the rsyslog configuration instructing it to not only store the received logs but also relay them to the remote log analyser. Here is an example syslog.conf that relays all messages to a remote syslogd or log analyser.

This is it. Now you have added a very capable, yet low-cost, monitoring layer to your infrastructure.

— About the author: Doichin Yordanov is R&D engineer with years experience in scalable enterprise infrastructure software. He has been technical lead of the VMware vSphere Log Collection and Monitoring.