linux-kernel - Re: sched/fair: Kernel panics in pick_next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zvr2bLBEYyu1gtNz@linux.ibm.com>
Date: Tue, 1 Oct 2024 00:35:16 +0530
From: Vishal Chourasia <vishalc@...ux.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        luis.machado@....com
Subject: Re: sched/fair: Kernel panics in pick_next_entity

On Mon, Sep 30, 2024 at 04:41:57PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote:
> > I've noticed a kernel panic consistently occurring on the mainline v6.11
> > kernel (see attached dmesg log below). 
> > 
> > The panic occurs almost every time I build the Linux kernel from source.
> > 
> > Steps to Reproduce:
> > 
> > make clean
> > ./scripts/config -e LOCALVERSION_AUTO
> > ./scripts/config --set-str LOCALVERSION -master-with-print
> > make localmodconfig
> > make -j8 -s vmlinux modules
> > 
> > >From my investigation, it seems that the function pick_eevdf() can return NULL.
> > Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> > introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> > it can lead to a null pointer dereference. 
> 
> Even before that commit we relied on that thing not being NULL, notably
> f12e1488^1 has:
> 
>                 se = pick_next_entity(cfs_rq);
>                 cfs_rq = group_cfs_rq(se);
> 
> Which will similarly explode when pick_eevdf() goes wobbly.
> 
> > To determine why pick_eevdf() would return NULL, I added a few printk statements
> > Based on one of the printk logs in the shared dmesg log, it appears that if
> > pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> > are no eligible entities on that 'cfs_rq', it will return NULL. 
> 
> Right, that is not a valid state. Which seems to suggest something went
> sideways with the eligibility thing -- as Luis suggested.
> 
> > I have not been able to think of a quick reproducer to trigger a panic
> > for this case. Hoping if someone can guide me on this.
> > 
> > Note: The following dmesg log also contains a warning reported too. Panic
> > happens later.
> > 
> > ------------[ cut here ]------------
> > !se->on_rq
> > WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> > Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> > CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> > Tainted: [W]=WARN
> > Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> > NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> > REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> > MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> > CFAR: c000000000156a10 IRQMASK: 1
> > GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> > GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> > GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> > GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> > GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> > GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> > GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> > GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> > NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> > LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> > Call Trace:
> > [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> > [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> > [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> > [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> > [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> > [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> > [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> > [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> > [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> > [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> > [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> > [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> > [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> > [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> > [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> > [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> > [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec
> 
> So that is a 'fun' one, I don't remember seeing that before. It says
> we're trying to dequeue a task that is not on the runqueue.
> 
> The big new thing this merge window -- I'm assuming v6.11 is good -- is
> DEQUEUE_DELAYED. Does this error go away if you flip that in
> kernel/sched/features.h ?
Yes, with the below diff. I didn't see any warnings or kernel panic
while running the workload

# git diff
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..38bf8df813d1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -46,7 +46,7 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  *
  * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
-SCHED_FEAT(DELAY_DEQUEUE, true)
+SCHED_FEAT(DELAY_DEQUEUE, false)
 SCHED_FEAT(DELAY_ZERO, true)

 /*