[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <004a01dc952b$471c94a0$d555bde0$@telus.net>
Date: Tue, 3 Feb 2026 08:36:41 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'K Prateek Nayak'" <kprateek.nayak@....com>,
"'Peter Zijlstra'" <peterz@...radead.org>
Cc: <mingo@...nel.org>,
<juri.lelli@...hat.com>,
<vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>,
<rostedt@...dmis.org>,
<bsegall@...gle.com>,
<mgorman@...e.de>,
<vschneid@...hat.com>,
<linux-kernel@...r.kernel.org>,
<wangtao554@...wei.com>,
<quzicheng@...wei.com>,
<wuyun.abel@...edance.com>,
"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [PATCH 0/4] sched: Various reweight_entity() fixes
Hi All,
On 2026.02.03 04:19 K Prateek Nayak wrote:
> On 2/3/2026 4:41 PM, Peter Zijlstra wrote:
>> On Tue, Feb 03, 2026 at 12:15:56PM +0530, K Prateek Nayak wrote:
>>> On 1/30/2026 3:04 PM, Peter Zijlstra wrote:
>>>> Two issues related to reweight_entity() were raised; poking at all that got me
>>>> these patches.
>>>>
>>>> They're in queue.git/sched/core and I spend most of yesterday staring at traces
>>>> trying to find anything wrong. So far, so good.
>>>>
>>>> Please test.
>>>
>>> I put this on top of tip:sched/urgent + tip:sched/core which contains Ingo's
>>> cleanup of removing the union and at some point in the benchmark run I hit:
>>>
>>> BUG: kernel NULL pointer dereference, address: 0000000000000051
... snip ...
> This trips when I'm running a (very) old version of schbench at commit
> e4aa540 ("Make sure rps isn't zero in auto_rps mode.")
>
> I'm running the following on a 512 CPU server:
>
> #!/bin/bash
>
> DIR=$1
> MESSENGERS=1
> MAX_ITERS=2
> SCHBENCH=./schbench
>
> for i in 1 2 4 8 16 32 64 128 256 512 768 1024;
> do
> THISDIR=$DIR/$i-workers
> if [ ! -d $THISDIR ]
> then
> mkdir -p $THISDIR
> fi
> for j in `seq 0 $MAX_ITERS`
> do
> echo "===== Worker $i : Iter $j ======";
> $SCHBENCH -m $MESSENGERS -t $i |& tee $THISDIR/iter-$j.log;
> sleep 2
> done
> done
>
> Fails when it is running with 768 workers. Standalone runs didn't
> fail - have to run a cumulative runner that runs sched-messaging,
> stream, tbench, netperf, first before running schbench :-(
Further to my email from the other day, where all was good [1],
I have continued to test, in particular the severe overload conditions
from [2].
Under heavy overload my test computer just hangs. My multiple
ssh sessions eventually terminate. I have left it for any hours, but
have to reset it in the end.
The first time there were no log entries at all, at least that I could
find.
The second time:
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000051
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 0 P4D 0
kernel: Oops: Oops: 0000 [#1] SMP NOPTI
kernel: CPU: 11 UID: 1000 PID: 3597 Comm: yes Not tainted 6.19.0-rc1-pz #1 PREEMPT(full)
...
The entire relevant part is attached.
Conditions:
Greater than 12,500 X (yes > /dev/null) tasks
But less than 15,000 X ( yes > /dev/null) tasks
I have tested up to 20,000 X (yes > /dev/null) tasks
with previous kernels, including mainline 6.19-rc1.
I would not disagree if you say my operating conditions
are ridiculous.
System:
Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz, 6 cores 12 CPUs.
CPU frequency scaling driver: intel_pstate; Governor powersave.
HWP: Enabled
[1] https://lore.kernel.org/lkml/000d01dc939e$0fc99fe0$2f5cdfa0$@telus.net/
[2] https://lore.kernel.org/lkml/002401dbb6bd$4527ec00$cf77c400$@telus.net/
... Doug
Download attachment "kern.log" of type "application/octet-stream" (66512 bytes)
Powered by blists - more mailing lists