linux-kernel - Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87msc6dmbz.fsf@li-0ccc18cc-2c67-11b2-a85c-a193851e4c5d.ibm.com>
Date: Thu, 24 Apr 2025 09:56:32 +0200
From: Alexander Egorenkov <egorenar@...ux.ibm.com>
To: Doug Smythies <dsmythies@...us.net>, tip-bot2@...utronix.de
Cc: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
        mingo@...nel.org, peterz@...radead.org, x86@...nel.org,
        Doug Smythies
 <dsmythies@...us.net>
Subject: Re: ll"RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity
 placement bug causing scheduling lag

Hi all,

> That is a very very stressful test. It crashes within a few seconds on my test computer,
> with a " Segmentation fault (core dumped)" message.

Yes, this is an artificial test i came up with to demonstrate the
problem we have with another realistic test which i can hardly
use here for the sake of demonstration. But it reveals the exact
same problem we have with our CI test on s390x test systems.

Let me explain shortly how it happens.

Basically, we have a test system where we execute a test suite and
simultaneously monitor this system on another system via simple SSH
logins (approximately invoked every 15 seconds) whether the test system
is still online and dump automatically if it remains unresponsive for
5m straight. We limit every such SSH login to 10 seconds because
we had situations where SSH sometimes hanged for a long time due to
various problems with networking, test system itself etc., just to make
our monitoring robust.

And since the commit "sched/fair: Fix EEVDF entity placement bug causing
scheduling lag" we regularly see SSH logins (limited to 10s) failing for
5m straight, not a single SSH login succeeds. This happens regularly
with test suites which compile software with GCC and use all CPUs
at 100%. Before the commit, a SSH login required under 1 second.
I cannot judge whether the problem really in this commit, or it is just an
accumulated effect after multiple ones.

FYI:
One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
and 8G of main memory with 20G of swap.

Thanks
Regards
Alex