I’m using VMware vSphere Hypervisor (ESXi) to host my Puppet Enterprise infrastructure. I’ve already tuned the OS and PE JVM services. Can I improve PE performance without adding more infrastructure nodes?
Version and installation information
PE version: Any
OS version: Any
vSphere version: 3.5 and later (ESXi with relaxed co-scheduling). vSphere and its ESXi component have identical version numbers.
Solution
Avoid common issues without adding nodes using this overview. You can tune your VM so that the VMware scheduler provides the guest OS running Puppet better throughput. Scale up by adding more virtual CPUs efficiently. After you scale up, check metrics and scale down if you need to.
VMware’s authoritative Performance Best Practices series of manuals, KB articles, and documentation provide thorough guides to tune your VMs.
Tuning your VMware installation is outside the scope of Puppet Support. If you need help after reading this article, please open a ticket with VMware support or reach out to us for Professional Services.
Avoid snapshot issues
After you snapshot your OS to the disk, all future changes are written to the disk as a delta with a file name similar to *-000001.delta.vmdk
. Over time, the delta file can get very large, causing I/O performance issues. When a snapshot is consolidated (removed by merging it), the information in the delta file streams in RAM while it is written to the disk. This can cause the hypervisor to slow, and in extreme cases, your VM might freeze. To avoid data loss, wait for the process to complete.
When you’re no longer using old snapshots, get rid of them. If you’re using third-party backup services, you might have performance issues due to snapshot-related files that were not cleaned up. Check for leftover delta files or snapshots.
Update vSphere
If you’re using vSphere version 6.0 or earlier, update to version 6.5 or later to avoid issues. Newer versions of the ESXi scheduler have performance improvements and smarter default settings.
Scaling up: Adding more virtual CPUs
In virtualization, the term CPU refers to your operating system’s logical processors. This number can be determined by cores or threads, either physical or virtual. In ESXi or vSphere, to see the the total number of available logical processors, select the host, and find the Total Processors or Logical Processors field.
Due to the speed of modern processors and latency-sensitive workloads, the distance from the processor to memory impacts performance. In your server’s hardware, the areas where physical cores are closest to the memory banks are called non-uniform memory access (NUMA) nodes. To optimize performance, the VMware scheduler maps the memory for your virtual cores to the same NUMA node if possible. So add new virtual cores that allow access to the closest memory possible.
Typically, each CPU socket has one NUMA node. Since that’s not always the case, refer to your vendor’s documentation for the number of CPU sockets and NUMA nodes on your servers. For the following steps, assume that each CPU socket has one NUMA node.
Steps
-
For a single VM’s hardware, calculate virtual cores per virtual socket.
-
For the hypervisor hardware, calculate physical cores per NUMA node.
-
To ensure that you’re not overcommitting resources, check the number of virtual sockets and NUMA nodes. The number of virtual sockets must be less than the total number of NUMA nodes.
-
Start with one virtual socket. Gradually increase the number of virtual cores until either performance improves, or the number of virtual cores per virtual socket is equal to the number of physical cores per NUMA node.
-
If the number of virtual cores per virtual socket is equal to the number of physical cores per NUMA node and performance hasn’t improved, do one of the following:
-
Disable VM system features such as vSAN and NSX.
-
Add a virtual socket, and evenly divide the virtual cores among the fewest number of NUMA nodes required for the memory and cores that you need.
-
Examples
Refer to your vendor’s documentation for the number of CPU sockets and NUMA nodes on your servers. For the following examples, assume that each CPU socket has one NUMA node.
Suboptimal
8 virtual cores, 4 virtual sockets, 16 physical cores, 2 NUMA nodes
- 8 virtual cores/4 virtual sockets = 2 virtual cores/virtual socket
- 16 physical cores/2 NUMA nodes = 8 physical cores/node
- 4 virtual sockets is greater than 2 NUMA nodes
The number of virtual sockets is greater than the number of NUMA nodes. The memory topology is fragmented, so the OS can’t use the available memory efficiently.
Optimal
8 virtual cores, 2 virtual sockets, 16 physical cores, 2 NUMA nodes
- 8 virtual cores/2 virtual sockets = 4 virtual cores/virtual socket
- 16 physical cores/2 NUMA nodes = 8 physical cores/node
- 2 virtual sockets is equal to 2 NUMA nodes
The number of virtual sockets doesn’t exceed the number of NUMA nodes. If system resources allow it, the scheduler can now optimize for locality or span across NUMA nodes as required.
Optimal
8 virtual cores, 1 virtual socket, 16 physical cores, 2 NUMA nodes
- 8 virtual cores/1 virtual socket = 8 virtual cores/virtual socket
- 16 physical cores/2 NUMA nodes = 8 physical cores/node
- 1 virtual socket is less than 2 NUMA nodes
By ensuring that each physical core is mapped to a nearby NUMA node, the largest scheduling domain is maintained. Since each physical core is mapped to one NUMA node, the full resources of each NUMA node are used. Little excess overhead is available, so to maximize performance, ensure that system features like vSAN and NSX are disabled.
Additional recommendations
In general:
- Keep advanced vNUMA defaults as-is.
- Disable CPU Hot Add.
- Don’t overcommit host resources. Make sure that the number of physical CPUs on your server is greater than the number of virtual CPUs across your VMs.
- When you’ve exceeded the amount of memory or cores allowed by one NUMA node, divide the virtual cores evenly among the fewest number of NUMA nodes required for the memory or cores you need. When they span nodes, do not assign an odd number of cores.
After scaling up: Make sure that you’ve added resources efficiently
To make sure that you’ve added resources efficiently and to ensure that your VM isn’t fighting for resources with itself or other VMs, use esxtop
to look at CPU performance metrics. You can use these metrics to improve performance by changing resource allocations for the Puppet VM and other VMs.
Using esxtop
to view metrics
In this overview, esxtop
is used to view metrics instead of vCenter performance charts. In vCenter, the CPU Ready value discussed below is a summation (shown in milliseconds). This value doesn’t give context about whether the numbers you’re viewing are good or bad. The value is normalized in esxtop
, giving you more context about the performance of VMs. Learn more about CPU summation and CPU% ready.
When connected to the ESXi Shell, you can use esxtop
to check performance metrics. World IDs (WID) match your Virtual Machine’s VMIDs. Press Shift+V to view only virtual machines. Press L to limit your display to a single group. Here is an example output:
4:01:21pm up 2 days 11:22, 102 worlds; CPU load average: 0.15, 0.12, 0.13
PCPU USED(%): 3.7 1.2 1.7 1.8 2.0 17 2.8 1.4 13 10 1.9 38 15 10 12 19 AVG: 10
PCPU UTIL(%): 5.5 3.1 2.7 4.1 2.8 16 2.4 2.8 16 9.2 3.5 50 17 9.5 12 19 AVG: 11
CORE UTIL(%): 7.5 4.8 18 4.1 29 50 25 28 AVG: 20
ID GID NAME NWLD %USED %RUN %SYS %WAIT %VMWAIT %RDY %IDLE %OVRLP %CSTP %MLMTD %SWPWT
2358810 2358810 vSphere Dat 5 23.94 23.78 0.11 488.59 0.00 3.86 172.91 0.14 0.13 0.00 0.00
2871 2871 el1 6 7.15 8.29 0.02 343.25 0.16 1.13 41.21 0.04 0.00 0.00 0.00
12344 2223 vmx 1 0.10 0.03 0.07 99.12 - 0.07 0.00 0.00 0.00 0.00 0.00
To analyze metrics in tools such as perfmon
, export results to .csv
using the batch (-b
) option. Use the -n
option to indicate the number of lines to save. The default interval is 5 seconds:
esxtop -b -n <LINES> > <FILE NAME>.csv
For example:
esxtop -b -n 20 > output.csv
CPU performance metrics
This is an overview of CPU performance metrics. For a more comprehensive guide, see these VMware resources:
- Troubleshooting ESX/ESXi virtual machine performance issues
- Troubleshooting a virtual machine that has stopped responding: VMM and Guest CPU usage comparison.
%RDY (ready percent)
%RDY is the percentage of time that the VM’s virtual cores were ready to be dispatched into a running state, but were not. Contention for resources on the host can cause this value to increase. Keep %RDY under 5%. Values over 10% cause slow VM performance.
Check this metric first.
- To reduce high %RDY, reduce the number of virtual cores in the VM, or reduce the number of other VM virtual cores.
- If you’re having trouble reducing %RDY, figure out whether contention with other VMs is the issue by setting CPU reservation at 100%.
- If you’re willing to trade off higher ready time (scheduling contention) for a lower amount of stolen time (hardware resource contention), you can improve performance by enabling simultaneous multithreading (SMT).
%CSTP (co-stop)
%CSTP is the percentage of time cores skew, moving ahead of or behind schedule due to CPU resource contention.
It is normal for %CSTP to spike, so measure a rolling average. Reducing %CSTP can improve performance. But remember that co-stop is a performance feature that allows the OS to do less work, so some co-stop time is necessary.
Keep %CSTP under 3% average. If %CSTP is high, you can reduce the amount of CPU resource contention by reducing the number of virtual cores on your VM. Having fewer virtual cores reduces the number of scheduling issues.
%VMWAIT
%VMWAIT is the percentage of time a VM’s virtual CPUs spend in a blocked state, typically when trying to access non-CPU hardware. This differs from %WAIT, which includes both %VMWAIT and %IDLE (percentage of time spent in idle cycles). Since it’s normal to have idle cycles while waiting for access, %VMWAIT is a more useful performance metric.
Depending on your version of vSphere/ESXi, this metric might not be in your default view. To toggle and save this field under CPU metrics, press F, C, Shift+F, and then Shift+W.
High %VMWAIT is typically a result of network or storage resource contention issues. An average value of less than 2% is ideal. If %VMWAIT is consistently over 5%, performance is affected. Two common ways to improve %VMWAIT are by using a SAN for storage, and if you’re using iSCSI, enable Jumbo Frames. For more causes of high %VMWAIT, read the Storage heading in VMware’s Troubleshooting ESX/ESXi virtual machine performance issues article.
Other performance fixes
- Disable any unnecessary affinities you have, such as DRS rules or CPU affinities.
- If your service’s sleep/wake rate is high, set Latency Sensitivity to High.
- If a workload is latency sensitive, you might want to use CPU reservation to improve performance.
- Total memory usage on your physical host has a huge impact on performance. If memory use is above 80%, VMware will stop reclaiming memory. Any value above 94% will significantly degrade performance.
If all else fails, you might need to add more physical resources by expanding your Puppet infrastructure.
How can we improve this article?
0 comments
Please sign in to leave a comment.
Related articles