linux-kernel - Re: sched/fair: Kernel panics in pick_next

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240930144157.GH5594@noisy.programming.kicks-ass.net>
Date: Mon, 30 Sep 2024 16:41:57 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Vishal Chourasia <vishalc@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>, luis.machado@....com
Subject: Re: sched/fair: Kernel panics in pick_next_entity

On Thu, Sep 26, 2024 at 06:12:19PM +0530, Vishal Chourasia wrote:
> I've noticed a kernel panic consistently occurring on the mainline v6.11
> kernel (see attached dmesg log below). 
> 
> The panic occurs almost every time I build the Linux kernel from source.
> 
> Steps to Reproduce:
> 
> make clean
> ./scripts/config -e LOCALVERSION_AUTO
> ./scripts/config --set-str LOCALVERSION -master-with-print
> make localmodconfig
> make -j8 -s vmlinux modules
> 
> >From my investigation, it seems that the function pick_eevdf() can return NULL.
> Commit f12e1488 ("sched/fair: Prepare pick_next_task() for delayed dequeue") 
> introduces an access on the return value of pick_eevdf(). If 'se' was NULL, 
> it can lead to a null pointer dereference. 

Even before that commit we relied on that thing not being NULL, notably
f12e1488^1 has:

                se = pick_next_entity(cfs_rq);
                cfs_rq = group_cfs_rq(se);

Which will similarly explode when pick_eevdf() goes wobbly.

> To determine why pick_eevdf() would return NULL, I added a few printk statements
> Based on one of the printk logs in the shared dmesg log, it appears that if
> pick_eevdf() is called for a 'cfs_rq' whose 'cfs_rq->curr' is NULL and there
> are no eligible entities on that 'cfs_rq', it will return NULL. 

Right, that is not a valid state. Which seems to suggest something went
sideways with the eligibility thing -- as Luis suggested.

> I have not been able to think of a quick reproducer to trigger a panic
> for this case. Hoping if someone can guide me on this.
> 
> Note: The following dmesg log also contains a warning reported too. Panic
> happens later.
> 
> ------------[ cut here ]------------
> !se->on_rq
> WARNING: CPU: 1 PID: 92333 at kernel/sched/fair.c:705 update_entity_lag+0xcc/0xf0
> Modules linked in: binfmt_misc bonding tls rfkill ibmveth pseries_rng vmx_crypto nd_pmem nd_btt dax_pmem loop nfnetlink xfs sd_mod papr_scm libnvdimm ibmvscsi scsi_transport_srp pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse
> CPU: 1 UID: 0 PID: 92333 Comm: genksyms Tainted: G        W          6.11.0-master-with-print-10547-g684a64bf32b6-dirty #64
> Tainted: [W]=WARN
> Hardware name: IBM,9080-HEX POWER10 (architected) hv:phyp pSeries
> NIP:  c0000000001cdfcc LR: c0000000001cdfc8 CTR: 0000000000000000
> REGS: c00000005c62ee50 TRAP: 0700   Tainted: G        W           (6.11.0-master-with-print-10547-g684a64bf32b6-dirty)
> MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 00000005
> CFAR: c000000000156a10 IRQMASK: 1
> GPR00: c0000000001cdfc8 c00000005c62f0f0 c000000001b57400 000000000000000a
> GPR04: 00000000ffff7fff c00000005c62eee0 c00000005c62eed8 00000007fb050000
> GPR08: 0000000000000027 0000000000000000 0000000000000000 c000000002758de0
> GPR12: c000000002a18d88 c0000007fffef480 0000000000000000 0000000000000000
> GPR16: c000000002c56d40 0000000000000000 c00000005c62f5b4 0000000000000000
> GPR20: fffffffffffffdef 0000000000000000 0000000000000002 c000000003cd7300
> GPR24: 0000000000000000 0000000000000008 c0000007fd1d3f80 0000000000000000
> GPR28: 0000000000000001 0000000000000009 c0000007fd1d4080 c0000000656a0000
> NIP [c0000000001cdfcc] update_entity_lag+0xcc/0xf0
> LR [c0000000001cdfc8] update_entity_lag+0xc8/0xf0
> Call Trace:
> [c00000005c62f0f0] [c0000000001cdfc8] update_entity_lag+0xc8/0xf0 (unreliable)
> [c00000005c62f160] [c0000000001cea80] dequeue_entity+0xb0/0x6d0
> [c00000005c62f1f0] [c0000000001cf8b0] dequeue_entities+0x150/0x600
> [c00000005c62f2c0] [c0000000001d02a8] dequeue_task_fair+0x158/0x2e0
> [c00000005c62f300] [c0000000001b5ea4] dequeue_task+0x64/0x200
> [c00000005c62f380] [c0000000001cc950] detach_tasks+0x140/0x420
> [c00000005c62f3f0] [c0000000001d6044] sched_balance_rq+0x214/0x7c0
> [c00000005c62f550] [c0000000001d6830] sched_balance_newidle+0x240/0x630
> [c00000005c62f640] [c0000000001d6d0c] pick_next_task_fair+0x7c/0x4a0
> [c00000005c62f6d0] [c0000000001afc50] __pick_next_task+0x60/0x2d0
> [c00000005c62f730] [c0000000010e8ce8] __schedule+0x198/0x840
> [c00000005c62f810] [c0000000010e93d0] schedule+0x40/0x110
> [c00000005c62f880] [c00000000064c574] pipe_read+0x424/0x6a0
> [c00000005c62f960] [c00000000063a0fc] vfs_read+0x30c/0x3d0
> [c00000005c62fa10] [c00000000063adf4] ksys_read+0x104/0x160
> [c00000005c62fa60] [c000000000031678] system_call_exception+0x138/0x2d0
> [c00000005c62fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

So that is a 'fun' one, I don't remember seeing that before. It says
we're trying to dequeue a task that is not on the runqueue.

The big new thing this merge window -- I'm assuming v6.11 is good -- is
DEQUEUE_DELAYED. Does this error go away if you flip that in
kernel/sched/features.h ?