[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <002401dbb6bd$4527ec00$cf77c400$@telus.net>
Date: Sat, 26 Apr 2025 08:09:55 -0700
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Alexander Egorenkov'" <egorenar@...ux.ibm.com>,
<peterz@...radead.org>
Cc: <linux-kernel@...r.kernel.org>,
<mingo@...nel.org>,
<x86@...nel.org>,
"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [tip: sched/urgent] sched/fair: Fix EEVDF entity placement bug causing scheduling lag
Hi Alexander,
Thank you for your reply.
Note that I have adjusted the address list for this email, because I don't know if bots can get emails, and Peter was not on the
"To" line, and might not have noticed this thread.
@Peter : Off-list I will forward you the other emails, in case you missed them. I apologise if you did see them but haven't had time
to get to them or whatever.
Also note that I know nothing about the scheduler and was only on the original email because I had a "Reported-by" tag.
On 2025.04.24 00:57 Alexander Egorenkov wrote:
> Hi all,
[Doug wrote]
>> That is a very very stressful test. It crashes within a few seconds on my test computer,
>> with a " Segmentation fault (core dumped)" message.
>
> Yes, this is an artificial test i came up with to demonstrate the
> problem we have with another realistic test which i can hardly
> use here for the sake of demonstration. But it reveals the exact
> same problem we have with our CI test on s390x test systems.
>
> Let me explain shortly how it happens.
>
> Basically, we have a test system where we execute a test suite and
> simultaneously monitor this system on another system via simple SSH
> logins (approximately invoked every 15 seconds) whether the test system
> is still online and dump automatically if it remains unresponsive for
> 5m straight. We limit every such SSH login to 10 seconds because
> we had situations where SSH sometimes hanged for a long time due to
> various problems with networking, test system itself etc., just to make
> our monitoring robust.
>
> And since the commit "sched/fair: Fix EEVDF entity placement bug causing
> scheduling lag" we regularly see SSH logins (limited to 10s) failing for
> 5m straight, not a single SSH login succeeds. This happens regularly
> with test suites which compile software with GCC and use all CPUs
> at 100%. Before the commit, a SSH login required under 1 second.
> I cannot judge whether the problem really in this commit, or it is just an
> accumulated effect after multiple ones.
>
> FYI:
> One such system where it happens regularly has 7 cores (5.2Ghz SMT 2x, 14 cpus)
> and 8G of main memory with 20G of swap.
>
> Thanks
> Regards
> Alex
Thanks for the explanation.
I have recreated your situation with a workflow that, while it stresses the CPUs,
doesn't make any entries in /var/log/kern.log and /var/log/syslog.
Under the same conditions, I have confirmed that the ssh login lag doesn't occur
With kernel 6.12, but does with kernel 6.13
My workflow is stuff I have used for many years and wrote myself.
Basically, I create a huge queue of running tasks, with each doing a little work
and then sleeping for a short period. I have 2 methods to achieve similar overall
workflow, and one shows the issue and one does not. I can also create a huge
queue by just increasing the number "yes" tasks to a ridiculous number, but
that does not show your ssh login lag issue.
Anyway, for the workflow that does show your issue, I had a load average of
about 19,500 (20,000 tasks) and ssh login times ranged from 38 to 10 seconds,
with an average of about 13 seconds. ssh login times using kernel 6.12 were
negligible.
... Doug
Powered by blists - more mailing lists