My Puppet Enterprise infrastructure is performing poorly. I get gateway timeouts in the console, nodes go hours without catalogs due to timeouts, and overall CPU usage is extremely high.
These issues might be caused by frequent garbage collection. You can identify and fix the issue by gathering performance metrics and reducing memory pressure in PE services.
Error messages and logs
Frequent garbage collection almost always causes high CPU usage on the primary server by one or more Puppet services. The primary server becomes CPU bound, slowing down all services.
The issue is most frequently caused by
pe-puppetserver but might be caused by any Java-based PE service process: Puppet Server, PE Console, Orchestration services, or PuppetDB.
The issue can be difficult to diagnose since:
There are no obvious errors in the logs other than timeouts due to extreme load.
Issues are temporarily resolved by restarting services, but a few hours after the restart, performance degrades.
Performance metrics show degradation for all services (rather than just one service).
Version and installation information
PE version: All supported versions
Diagnose the issue
To diagnose the issue, upload the garbage collection (GC) log for the most affected service to GCeasy's Universal GC Log Analyzer.
Note: We cannot troubleshoot third-party software. Before uploading any information, it is your responsibility to ensure that you are comfortable with the contents of the log file being given to a third party.
For example, if the
pe-puppetserverservice is using the most CPU, upload the Puppet server log file. That log file has a name similar to
/var/log/puppetlabs/puppetserver/puppetserver_gc.logsince the name changes as files are rotated.
When analysis is complete, check the following metrics for signs of frequent garbage collection:
Throughput: The percentage of time that Java is performing useful work. If the throughput is below 95%, garbage collection is happening too frequently. If you have intermittent issues, gather metrics when performance is affected or gather metrics for long enough that you can ensure the issue occurs. Otherwise, it will be difficult to determine whether GC is the cause of your performance issues.
Interactive Graphs - GC Duration: This graph shows the duration of garbage collection runs. Young GC runs (represented by teal squares in the graph) are the fastest form of garbage collection, where only the newest objects in memory are scanned. Full GC runs (represented by red triangles) scan all memory. It's normal for full GC to take longer and occur less often than young GC runs. However, if full GC runs are running close together, there's an issue. It's usually obvious when looking at this graph whether the performance of the primary server is affected by garbage collection. Here's a graph from an affected primary server:
Resolve the issue
To resolve this issue, reduce memory pressure on the most affected service.
Use one or more of the following methods:
Increase the heap size for the service. This usually is the best method, since memory is inexpensive, and the change doesn't take long to implement. Find instructions to increase heap size in our documentation
The other methods listed below are likely to take a significant amount of time to implement and might have other disadvantages.
Reduce the amount of Puppet code.
Reduce the number of Puppet environments.
Reduce the maximum number of JRuby instances to allow on the Puppet Server. Reducing the number of JRuby instances affects the primary server's maximum catalog compilation bandwidth negatively. If the instances are close to fully utilized, it might cause other issues. Instructions to tune JRuby instances are in our documentation
For other services:
Memory usage for other services is difficult to control. Increasing heap for those services is the best and sometimes only option. Instructions to increase heap size are in our documentation.