After you’ve added hundreds of nodes to your deployment, you might notice that your agents are running slow or timing out. When hundreds of nodes check in simultaneously to request a catalog, it might cause a thundering herd of processes that causes CPU and memory performance to suffer. Use steps in this article to determine if you have a thundering herd.
PE version: All supported versions
OS: Any *nix
Installation type: Any
Solution
To check if you have a thundering herd condition, run a query on the PuppetDB node to show how many nodes check in per minute. In extra large architectures, PuppetDB is on an external PostgreSQL node. For all other architectures, it’s on the primary server.
-
Log into the PuppetDB node as the pe-postgres user.
-
Open the PostgreSQL command line interface by running
sudo su - pe-postgres -s /bin/bash -c "/opt/puppetlabs/server/bin/psql -d pe-puppetdb"
-
Find out how many nodes are checking in per minute for the past 7 days by running the following query:
select date_part('month', start_time) as month, date_part('day', start_time) as day, date_part('hour', start_time) as hour, date_part('minute', start_time) as minute, count(*) from reports where start_time between now() - interval '7 days' and now() GROUP BY date_part('month', start_time), date_part('day', start_time), date_part('hour', start_time), date_part('minute', start_time) ORDER BY date_part('month', start_time) DESC, date_part('day', start_time) DESC, date_part( 'hour', start_time ) DESC, date_part('minute', start_time) DESC;
Your output will look similar to the following:
month | day | hour | minute | count -------+-----+------+--------+------- 10 | 11 | 8 | 11 | 150 10 | 11 | 8 | 10 | 140 10 | 11 | 8 | 9 | 152 10 | 11 | 8 | 8 | 155 10 | 11 | 8 | 7 | 150 10 | 11 | 8 | 6 | 149 10 | 11 | 8 | 5 | 120 10 | 11 | 8 | 4 | 155 10 | 11 | 8 | 3 | 160 10 | 11 | 8 | 2 | 152 10 | 11 | 8 | 1 | 147 10 | 11 | 8 | 0 | 151
-
Check the results to see if you have a pattern of many nodes checking in simultaneously during some minutes, and few nodes checking in at other times.
For example:
month | day | hour | minute | count -------+-----+------+--------+------- 10 | 11 | 8 | 11 | 2 10 | 11 | 8 | 10 | 9 10 | 11 | 8 | 9 | 115 10 | 11 | 8 | 8 | 858 10 | 11 | 8 | 7 | 33 10 | 11 | 8 | 6 | 80 10 | 11 | 8 | 5 | 182 10 | 11 | 8 | 4 | 155 10 | 11 | 8 | 3 | 92 10 | 11 | 8 | 2 | 29 10 | 11 | 8 | 1 | 24 10 | 11 | 8 | 0 | 21
-
Exit the PostgreSQL command line by typing
\q
.
If you have a thundering herd, use the following article to stop it and prevent it from happening again.
Use one or more of the following to prevent a thundering herd in the long term.
- Prevent a thundering herd: Use max-queued-requests
- Prevent a thundering herd: Run Puppet with cron or the
reidmv-puppet_run_scheduler
module - Spread out agent catalog requests using
splay
Confirm that any changes you make are effective by running the query in these steps again.
Comments
0 comments
Please sign in to leave a comment.