lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87msc6dmbz.fsf@li-0ccc18cc-2c67-11b2-a85c-a193851e4c5d.ibm.com>
Date: Thu, 24 Apr 2025 09:56:32 +0200
From: Alexander Egorenkov <egorenar@...ux.ibm.com>
To: Doug Smythies <dsmythies@...us.net>, tip-bot2@...utronix.de
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        mingo@...nel.org, peterz@...radead.org, x86@...nel.org,
        Doug Smythies
 <dsmythies@...us.net>
Subject: Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity
 placement bug causing scheduling lag

Hi all,

> That is a very very stressful test. It crashes within a few seconds on my test computer,
> with a " Segmentation fault (core dumped)" message.

Yes, this is an artificial test i came up with to demonstrate the
problem we have with another realistic test which i can hardly
use here for the sake of demonstration. But it reveals the exact
same problem we have with our CI test on s390x test systems.

Let me explain shortly how it happens.

Basically, we have a test system where we execute a test suite and
simultaneously monitor this system on another system via simple SSH
logins (approximately invoked every 15 seconds) whether the test system
is still online and dump automatically if it remains unresponsive for
5m straight. We limit every such SSH login to 10 seconds because
we had situations where SSH sometimes hanged for a long time due to
various problems with networking, test system itself etc., just to make
our monitoring robust.

And since the commit "sched/fair: Fix EEVDF entity placement bug causing
scheduling lag" we regularly see SSH logins (limited to 10s) failing for
5m straight, not a single SSH login succeeds. This happens regularly
with test suites which compile software with GCC and use all CPUs
at 100%. Before the commit, a SSH login required under 1 second.
I cannot judge whether the problem really in this commit, or it is just an
accumulated effect after multiple ones.

FYI:
One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
and 8G of main memory with 20G of swap.

Thanks
Regards
Alex

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ