After you’ve added hundreds of nodes to your deployment, your agents might be running slow or timing out. When hundreds of nodes check in simultaneously to request a catalog, it might cause a thundering herd of processes that causes CPU and memory performance to suffer. Use steps in this article to determine if you have a thundering herd.
PE version: All supported versions
OS: Any *nix
Installation type: All supported
Solution
To check if you have a thundering herd, run a query on the PuppetDB node to show how many nodes check in per minute. In extra large architectures, PuppetDB is on an external PostgreSQL node. For all other architectures, it’s on the primary server.
-
Log into the PuppetDB node as the pe-postgres user.
-
Open the PostgreSQL command line interface by running
sudo su - pe-postgres -s /bin/bash -c "/opt/puppetlabs/server/bin/psql -d pe-puppetdb"
-
Find out how many nodes are checking in per minute for the past 7 days by running one of the following queries:
-
If you are using PE 2021.6 or later, use this query, which also includes compiler information:
SELECT producers.name AS compiler, date_bin(INTERVAL '1 minute', producer_timestamp, (now() - '7 days'::INTERVAL)::timestamptz) AS date_bucket, count(*) FROM reports INNER JOIN producers ON reports.producer_id = producers.id WHERE producer_timestamp >= '$(date +"%Y-%m-%dT00:00:00%z" -d "7 days ago")'::timestamptz AND producer_timestamp < '$(date +"%Y-%m-%dT00:00:00%z")'::timestamptz GROUP BY compiler, date_bucket ORDER BY compiler, date_bucket;
-
If you are using a version of PE earlier than 2021.6, use this query:
select date_part('month', start_time) as month, date_part('day', start_time) as day, date_part('hour', start_time) as hour, date_part('minute', start_time) as minute, count(*) from reports where start_time between now() - interval '7 days' and now() GROUP BY date_part('month', start_time), date_part('day', start_time), date_part('hour', start_time), date_part('minute', start_time) ORDER BY date_part('month', start_time) DESC, date_part('day', start_time) DESC, date_part( 'hour', start_time ) DESC, date_part('minute', start_time) DESC;
cat <<EOF | runuser -u pe-postgres -- /opt/puppetlabs/server/bin/psql -d pe-puppetdb -f- | tee /tmp/thundering_herd_by_compiler.txt SET statement_timeout = 600000;
-
For either query, your output will look similar to the following:
month | day | hour | minute | count -------+-----+------+--------+------- 10 | 11 | 8 | 11 | 150 10 | 11 | 8 | 10 | 140 10 | 11 | 8 | 9 | 152 10 | 11 | 8 | 8 | 155 10 | 11 | 8 | 7 | 150 10 | 11 | 8 | 6 | 149 10 | 11 | 8 | 5 | 120 10 | 11 | 8 | 4 | 155 10 | 11 | 8 | 3 | 160 10 | 11 | 8 | 2 | 152 10 | 11 | 8 | 1 | 147 10 | 11 | 8 | 0 | 151
-
Check the results to see if you have a pattern of many nodes checking in simultaneously during some minutes, and few nodes checking in at other times.
For example:
month | day | hour | minute | count -------+-----+------+--------+------- 10 | 11 | 8 | 11 | 2 10 | 11 | 8 | 10 | 9 10 | 11 | 8 | 9 | 115 10 | 11 | 8 | 8 | 858 10 | 11 | 8 | 7 | 33 10 | 11 | 8 | 6 | 80 10 | 11 | 8 | 5 | 182 10 | 11 | 8 | 4 | 155 10 | 11 | 8 | 3 | 92 10 | 11 | 8 | 2 | 29 10 | 11 | 8 | 1 | 24 10 | 11 | 8 | 0 | 21
-
Exit the PostgreSQL command line by typing
\q
.If you have a thundering herd, use the following article to stop it and prevent it from happening again.
Use one or more of the following to prevent a thundering herd in the long term.
-
Prevent a thundering herd: Run Puppet with cron or the
reidmv-puppet_run_scheduler
module -
Spread out agent catalog requests using
splay
Confirm that any changes you make are effective by running the query in these steps again.
How can we improve this article?
0 comments
Please sign in to leave a comment.
Related articles