[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dba1c62b-add6-45fc-9fc1-cc8755fb7a14@roeck-us.net>
Date: Mon, 5 Aug 2024 18:16:28 -0700
From: Guenter Roeck <linux@...ck-us.net>
To: Thomas Gleixner <tglx@...utronix.de>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>, stable@...r.kernel.org
Cc: patches@...ts.linux.dev, linux-kernel@...r.kernel.org,
torvalds@...ux-foundation.org, akpm@...ux-foundation.org, shuah@...nel.org,
patches@...nelci.org, lkft-triage@...ts.linaro.org, pavel@...x.de,
jonathanh@...dia.com, f.fainelli@...il.com, sudipm.mukherjee@...il.com,
srw@...dewatkins.net, rwarsow@....de, conor@...nel.org,
allen.lkml@...il.com, broonie@...nel.org,
"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
Helge Deller <deller@....de>, Parisc List <linux-parisc@...r.kernel.org>
Subject: Re: [PATCH 6.10 000/809] 6.10.3-rc3 review
On 8/5/24 14:49, Thomas Gleixner wrote:
> On Mon, Aug 05 2024 at 08:02, Guenter Roeck wrote:
>> On 8/5/24 05:51, Thomas Gleixner wrote:
>>> IRQF_COND_ONESHOT has only an effect when
>>>
>>> 1) Interrupt is shared
>>> 2) First interrupt request has IRQF_ONESHOT set
>>>
>>> Neither #1 nor #2 are true, but maybe your current config enables some moar
>>> devices than the one on your website.
>>>
>>
>> No, it is pretty much the same, except for a more recent C compiler, and it
>> requires qemu v9.0. See http://server.roeck-us.net/qemu/parisc64-6.10.3/.
>>
>> Debugging shows pretty much the same for me, and any log message added
>> to request_irq() makes the problem go away (or be different), and if the problem
>> is seen it doesn't even get to the third interrupt request. I copied a more complete
>> log to bad.log.gz in above page.
>
> At the point where the problem happens is way before the first interrupt
> is requested, so that definitely excludes any side effect of
> COND_ONESHOT.
>
> It happens right after
>
> [ 0.000000] Memory: 991812K/1048576K available (16752K kernel code, 5220K rwdata, 3044K rodata, 760K init, 1424K bss, 56764K reserved, 0K cma-reserved)
> SNIP
> [ 0.000000] ** This system shows unhashed kernel memory addresses **
> SNIP
>
> I.e. the big fat warning about unhashed kernel addresses)
>
> In the good case the first interrupt is requested here:
>
> [ 0.000000] Memory: 992804K/1048576K available (16532K kernel code, 5096K rwdata, 2984K rodata, 744K init, 1412K bss, 55772K reserved, 0K cma-reserved)
> SNIP
> [ 0.000000] ** This system shows unhashed kernel memory addresses **
> SNIP
> [ 0.000000] SLUB: HWalign=16, Order=0-3, MinObjects=0, CPUs=4, Nodes=1
> [ 0.000000] ODEBUG: selftest passed
> [ 0.000000] rcu: Hierarchical RCU implementation.
> [ 0.000000] rcu: RCU event tracing is enabled.
> [ 0.000000] rcu: RCU callback double-/use-after-free debug is enabled.
> [ 0.000000] rcu: RCU debug extended QS entry/exit.
> [ 0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
>
> [ 0.000000] NR_IRQS: 128 <- This is where the interrupt
> subsystem is initialized
> [ 0.000000] genirq: 64: 00215600 <- This is probably the timer interrupt
>
> Looking at the backtrace the fail happens from within start_kernel(),
> i.e. mm_core_init():
>
> slab_err+0x13c/0x190
> check_slab+0xf4/0x140
> alloc_debug_processing+0x58/0x248
> ___slab_alloc+0x22c/0xa38
> __slab_alloc.isra.0+0x60/0x88
> kmem_cache_alloc_node_noprof+0x148/0x3c0
> __kmem_cache_create+0x5d4/0x680
> create_boot_cache+0xc4/0x128
> new_kmalloc_cache+0x1ac/0x1d8
> create_kmalloc_caches+0xac/0x108
> kmem_cache_init+0x1cc/0x340
> mm_core_init+0x458/0x560
> start_kernel+0x9ac/0x11e0
> start_parisc+0x188/0x1b0
>
> The interrupt subsystem is initialized way later and request_irq() only
> works after that point.
>
> I'm 100% sure by now that this commit has absolutely nothing to do with
> the underlying problem. All it does is changing the code size and
> placement of text, data and ....
>
> So I finally managed to reproduce with gcc-13.3.0 from the k.org cross
> tools (the debian 12.2.2 does not).
>
> If I add:
>
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1513,6 +1513,8 @@ static int
> unsigned long flags, thread_mask = 0;
> int ret, nested, shared = 0;
>
> + pr_info("%u: %08x\n", irq, new->flags);
> +
> if (!desc)
> return -EINVAL;
>
> it still reproduces. If I change that to:
>
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -1513,6 +1513,8 @@ static int
> unsigned long flags, thread_mask = 0;
> int ret, nested, shared = 0;
>
> + new->flags &= ~IRQF_COND_ONESHOT;
> +
> if (!desc)
> return -EINVAL;
>
> that does neither cure it (unsurprisingly).
>
> Reverting the "offending" commit cures it.
>
> So I tried your
>
> + pr_info_once("####### First timer interrupt\n");
>
> which cures it too.
>
> So now where is the difference between the printk() in __setup_irq(),
> the new->flag mangling, the revert and the printk() in timer interrupt?
>
> There is ZERO functional change. There is no race either because at that
> point everything is single threaded and interrupts are disabled.
>
> The only thing which changes is the code and data layout as I can
> observe when looking at System.map of the builds. I stared at the diffs
> for a bit, but nothing stood out.
>
> Maybe the PARISC people can shed some light on it.
>
> This reminds me of the x86 32-bit disaster we debugged a few days ago...
>
Looks like it :-(
Thanks for checking !
Guenter
Powered by blists - more mailing lists