[openstack-ansible] strange execution delays
Hello, I don't think this is an OSA issue per-se, but it's happening on infrastructure deployed with OSA. Using both the Rocky and Stein releases of OSA on Ubuntu 18.04, we're running into a strange a strange issue where the run times of Ansible will eventually take longer and longer. Digging into the issue more, we find that it's when Ansible executes a command within a container and the call to "su" triggers the delay. Over time, the delay increases more and more. The same behavior is seen if we SSH into containers. The only workaround we've been able to identify that works is to stop and start the containers. We have not been able to identify a root cause of this, nor any consistency to what containers this will happen on. Has anyone run into this? Thanks, Joe
Hi Joe, On Mon, Dec 30, 2019 at 12:00 AM Joe Topjian <joe@topjian.net> wrote:
Hello,
I don't think this is an OSA issue per-se, but it's happening on infrastructure deployed with OSA.
Using both the Rocky and Stein releases of OSA on Ubuntu 18.04, we're running into a strange a strange issue where the run times of Ansible will eventually take longer and longer. Digging into the issue more, we find that it's when Ansible executes a command within a container and the call to "su" triggers the delay. Over time, the delay increases more and more. The same behavior is seen if we SSH into containers.
The only workaround we've been able to identify that works is to stop and start the containers.
We have not been able to identify a root cause of this, nor any consistency to what containers this will happen on.
Has anyone run into this?
No, this is pretty strange. Do you have any PAM modules that might be hitting some sorts of external API for auditing purposes that may be throttling you? How is systemd-logind feeling? Anything odd in your system logs?
Thanks, Joe
Hi Mohammad, Do you have any PAM modules that might be hitting some sorts of
external API for auditing purposes that may be throttling you?
Not unless OSA would have configured something. The deployment is *very* standard, heavily leveraging default values. DNS of each container is configured to use LXC host for resolution. The host is using the systemd-based resolver, but is pointing to a local, dedicated upstream resolver. I want to point the problem there, but we've run into this issue in two different locations, one of which has an upstream DNS resolver that I'm confident does not throttle requests. But, hey, it's DNS - maybe it's still the cause.
How is systemd-logind feeling? Anything odd in your system logs?
Yes. We have a feeling it's *something* with systemd, but aren't exactly sure what. Affected containers' logs end up with a lot of the following entries: Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: Successful su for root by root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: + ??? root:root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session opened for user root by (uid=0) Dec 3 20:30:27 infra1-repo-container-a0f194b3 dbus-daemon[47]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) Dec 3 20:30:42 infra1-repo-container-a0f194b3 su[4170]: pam_systemd(su:session): Failed to create session: Connection timed out Dec 3 20:30:43 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session closed for user root But we aren't sure if those timeouts are a symptom of cause. Thanks for your help! Joe
Hi Joe, Those timeouts re almost 99% the reason behind this issue. I'd suggest restarting systemd-logind and seeing how that fares: systemctl restart systemd-logind If the issue persists or happens again, I'm not sure, but those timeouts are 100% a cause of issue here. Thanks, Mohammed On Mon, Dec 30, 2019 at 2:51 PM Joe Topjian <joe@topjian.net> wrote:
Hi Mohammad,
Do you have any PAM modules that might be hitting some sorts of external API for auditing purposes that may be throttling you?
Not unless OSA would have configured something. The deployment is *very* standard, heavily leveraging default values.
DNS of each container is configured to use LXC host for resolution. The host is using the systemd-based resolver, but is pointing to a local, dedicated upstream resolver. I want to point the problem there, but we've run into this issue in two different locations, one of which has an upstream DNS resolver that I'm confident does not throttle requests. But, hey, it's DNS - maybe it's still the cause.
How is systemd-logind feeling? Anything odd in your system logs?
Yes. We have a feeling it's *something* with systemd, but aren't exactly sure what. Affected containers' logs end up with a lot of the following entries:
Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: Successful su for root by root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: + ??? root:root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session opened for user root by (uid=0) Dec 3 20:30:27 infra1-repo-container-a0f194b3 dbus-daemon[47]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) Dec 3 20:30:42 infra1-repo-container-a0f194b3 su[4170]: pam_systemd(su:session): Failed to create session: Connection timed out Dec 3 20:30:43 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session closed for user root
But we aren't sure if those timeouts are a symptom of cause.
Thanks for your help!
Joe
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
Hi Mohammad, Restarting of systemd-logind would sometimes hang indefinitely, which is why we've defaulted to just a hard stop/start of the container. The problem then slowly begins to creep up again. If you haven't seen this behavior, then that's still helpful. We'll scour the environment trying to find *something* that might be causing it. Thanks, Joe On Thu, Jan 2, 2020 at 1:03 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi Joe,
Those timeouts re almost 99% the reason behind this issue. I'd suggest restarting systemd-logind and seeing how that fares:
systemctl restart systemd-logind
If the issue persists or happens again, I'm not sure, but those timeouts are 100% a cause of issue here.
Thanks, Mohammed
On Mon, Dec 30, 2019 at 2:51 PM Joe Topjian <joe@topjian.net> wrote:
Hi Mohammad,
Do you have any PAM modules that might be hitting some sorts of external API for auditing purposes that may be throttling you?
Not unless OSA would have configured something. The deployment is *very*
standard, heavily leveraging default values.
DNS of each container is configured to use LXC host for resolution. The
host is using the systemd-based resolver, but is pointing to a local, dedicated upstream resolver. I want to point the problem there, but we've run into this issue in two different locations, one of which has an upstream DNS resolver that I'm confident does not throttle requests. But, hey, it's DNS - maybe it's still the cause.
How is systemd-logind feeling? Anything odd in your system logs?
Yes. We have a feeling it's *something* with systemd, but aren't exactly
sure what. Affected containers' logs end up with a lot of the following entries:
Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: Successful su
Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: + ??? root:root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]:
Dec 3 20:30:27 infra1-repo-container-a0f194b3 dbus-daemon[47]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) Dec 3 20:30:42 infra1-repo-container-a0f194b3 su[4170]:
Dec 3 20:30:43 infra1-repo-container-a0f194b3 su[4170]:
for root by root pam_unix(su:session): session opened for user root by (uid=0) pam_systemd(su:session): Failed to create session: Connection timed out pam_unix(su:session): session closed for user root
But we aren't sure if those timeouts are a symptom of cause.
Thanks for your help!
Joe
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
I'd suggest looking at the dbus logs too which can include some interesting things, but yeah, this is certainly the source of your issues so I would dig around dbus/systemd-logind Good luck and keep us updated! On Thu, Jan 2, 2020 at 3:54 PM Joe Topjian <joe@topjian.net> wrote:
Hi Mohammad,
Restarting of systemd-logind would sometimes hang indefinitely, which is why we've defaulted to just a hard stop/start of the container. The problem then slowly begins to creep up again.
If you haven't seen this behavior, then that's still helpful. We'll scour the environment trying to find *something* that might be causing it.
Thanks, Joe
On Thu, Jan 2, 2020 at 1:03 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi Joe,
Those timeouts re almost 99% the reason behind this issue. I'd suggest restarting systemd-logind and seeing how that fares:
systemctl restart systemd-logind
If the issue persists or happens again, I'm not sure, but those timeouts are 100% a cause of issue here.
Thanks, Mohammed
On Mon, Dec 30, 2019 at 2:51 PM Joe Topjian <joe@topjian.net> wrote:
Hi Mohammad,
Do you have any PAM modules that might be hitting some sorts of external API for auditing purposes that may be throttling you?
Not unless OSA would have configured something. The deployment is *very* standard, heavily leveraging default values.
DNS of each container is configured to use LXC host for resolution. The host is using the systemd-based resolver, but is pointing to a local, dedicated upstream resolver. I want to point the problem there, but we've run into this issue in two different locations, one of which has an upstream DNS resolver that I'm confident does not throttle requests. But, hey, it's DNS - maybe it's still the cause.
How is systemd-logind feeling? Anything odd in your system logs?
Yes. We have a feeling it's *something* with systemd, but aren't exactly sure what. Affected containers' logs end up with a lot of the following entries:
Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: Successful su for root by root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: + ??? root:root Dec 3 20:30:17 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session opened for user root by (uid=0) Dec 3 20:30:27 infra1-repo-container-a0f194b3 dbus-daemon[47]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms) Dec 3 20:30:42 infra1-repo-container-a0f194b3 su[4170]: pam_systemd(su:session): Failed to create session: Connection timed out Dec 3 20:30:43 infra1-repo-container-a0f194b3 su[4170]: pam_unix(su:session): session closed for user root
But we aren't sure if those timeouts are a symptom of cause.
Thanks for your help!
Joe
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
participants (2)
-
Joe Topjian
-
Mohammed Naser