[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9470dab6-dee5-4505-95a2-f6782b648726@paulmck-laptop>
Date: Sun, 8 Oct 2023 18:20:53 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Joel Fernandes <joel@...lfernandes.org>
Cc: "Liam R. Howlett" <Liam.Howlett@...cle.com>,
Naresh Kamboju <naresh.kamboju@...aro.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
stable@...r.kernel.org, patches@...ts.linux.dev,
linux-kernel@...r.kernel.org, torvalds@...ux-foundation.org,
akpm@...ux-foundation.org, linux@...ck-us.net, shuah@...nel.org,
patches@...nelci.org, lkft-triage@...ts.linaro.org, pavel@...x.de,
jonathanh@...dia.com, f.fainelli@...il.com,
sudipm.mukherjee@...il.com, srw@...dewatkins.net, rwarsow@....de,
conor@...nel.org, Chengming Zhou <zhouchengming@...edance.com>,
Peter Zijlstra <peterz@...radead.org>,
Ovidiu Panait <ovidiu.panait@...driver.com>,
Ingo Molnar <mingo@...nel.org>, rcu <rcu@...r.kernel.org>
Subject: Re: [PATCH 5.15 000/183] 5.15.134-rc1 review
On Sat, Oct 07, 2023 at 09:22:55PM -0400, Joel Fernandes wrote:
> On Fri, Oct 6, 2023 at 2:20 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> >
> > On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> > > * Paul E. McKenney <paulmck@...nel.org> [231006 12:47]:
> > > > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > > > * Naresh Kamboju <naresh.kamboju@...aro.org> [231005 13:49]:
> > > > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > > > <gregkh@...uxfoundation.org> wrote:
> > > > > > >
> > > > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > > > There are 183 patches in this series, all will be posted as a response
> > > > > > > to this one. If anyone has any issues with these being applied, please
> > > > > > > let me know.
> > > > > > >
> > > > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > > > Anything received after that time might be too late.
> > > > > > >
> > > > > > > The whole patch series can be found in one patch at:
> > > > > > > https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > > > or in the git tree and branch at:
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > and the diffstat can be found below.
> > > > > > >
> > > > > > > thanks,
> > > > > > >
> > > > > > > greg k-h
> > > > > >
> > > > > > Results from Linaro’s test farm.
> > > > > > Regressions on x86.
> > > > > >
> > > > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > > > with selftest merge config built kernel.
> > > > > >
> > > > > > Reported-by: Linux Kernel Functional Testing <lkft@...aro.org>
> > > > > >
> > > > > > Anyone noticed this kernel warning ?
> > > > > >
> > > > > > This is always reproducible while booting x86 with a given config.
> > > > >
> > > > > >From that config:
> > > > > #
> > > > > # RCU Subsystem
> > > > > #
> > > > > CONFIG_TREE_RCU=y
> > > > > # CONFIG_RCU_EXPERT is not set
> > > > > CONFIG_SRCU=y
> > > > > CONFIG_TREE_SRCU=y
> > > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > > CONFIG_TASKS_RUDE_RCU=y
> > > > > CONFIG_TASKS_TRACE_RCU=y
> > > > > CONFIG_RCU_STALL_COMMON=y
> > > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > > # end of RCU Subsystem
> > > > >
> > > > > #
> > > > > # RCU Debugging
> > > > > #
> > > > > CONFIG_PROVE_RCU=y
> > > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > > CONFIG_RCU_TRACE=y
> > > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > > # end of RCU Debugging
> > > > >
> > > > >
> > > > > >
> > > > > > x86 boot log:
> > > > > > -----
> > > > > > [ 0.000000] Linux version 5.15.134-rc1 (tuxmake@...make)
> > > > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > > > for Debian) 2.40) #1 SMP @1696443178
> > > > > > ...
> > > > > > [ 1.480701] ------------[ cut here ]------------
> > > > > > [ 1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > > > trc_inspect_reader+0x80/0xb0
> > > > > > [ 1.481296] Modules linked in:
> > > > > > [ 1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > > > [ 1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > > 2.5 11/26/2020
> > > > > > [ 1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > > >
> > > > > This function has changed a lot, including the dropping of this
> > > > > WARN_ON_ONCE(). The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > > >
> > > > > Obviously the behaviour I changed was the test for the task being idle.
> > > > > I am not sure how best to short-circuit that test from happening during
> > > > > boot as I am not familiar with the RCU code.
> > > >
> > > > The usual test for RCU's notion of early boot being completed is
> > > > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > > >
> > > > Except that "ofl" should always be false that early in boot, at least
> > > > in mainline.
> > >
> > > Is this still true in the final version of the patch where we set the
> > > boot task as !idle until just before the early boot is finished? I
> > > wouldn't think of this as 'early in boot' anymore as much as the entire
> > > kernel setup. Maybe we need to shorten the time we stay in !idle mode
> > > for earlier kernels?
> >
> > In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
> > during boot, the boot CPU is guaranteed to be online. (As opposed to
> > the boot CPU's idle-task state.)
> >
> > > How frequent is this function called? We could check something for
> > > early boot... or track down where the cpu is put online and restore idle
> > > before that happens?
> >
> > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > that grace period. Its performance is as issue, but not to anywhere
> > near the same extent as (say) rcu_read_lock_trace().
> >
> > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > v6.5).
> > > >
> > > > Lots of latent bugs, to be sure, even with rcutorture. :-/
> > >
> > > The Right Thing is to fix the bug all the way back to the introduction,
> > > but what fallout makes the backport less desirable than living with the
> > > unexposed bug?
> >
> > You are quite right that it is possible for the risk of a backport to
> > exceed the risk of the original bug.
> >
> > I defer to Joel (CCed) on how best to resolve this in -stable.
>
> Maybe I am missing something but this issue should also be happening
> in mainline right?
>
> Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> for recently offlined CPUs") , the warning should still be happening
> due to Liam's "kernel/sched: Modify initial boot task idle setup"
> because the warning is just rearranged a bit but essentially the same.
>
> IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> fix it in mainline (using the ideas described in this thread), then
> backport both that new fix and Liam's patch to 5.15.
>
> Or is there a reason this warning does not show up on the mainline?
>
> My impression is that dropping Liam's patch for the stable release and
> revisiting it later is a better approach since tiny RCU is used way
> less in the wild than tree/tasks RCU. Thoughts?
I think that this one is strange enough that we need to write down the
situation in detail, make sure we have all the corner cases covered in
both mainline and -stable, and decide what to do from there.
Yes, I know, this email thread contains much of this information, but
a little organizing of it would be good.
Would you like to put that together, or should I? If me, I will get
a draft out by the end of this coming Tuesday, Pacific Time.
Thanx, Paul
Powered by blists - more mailing lists