[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87v7ngpa43.ffs@tglx>
Date: Fri, 25 Jul 2025 17:23:56 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Waiman Long <llong@...hat.com>, Pavel Machek <pavel@....cz>
Cc: kernel list <linux-kernel@...r.kernel.org>, mingo@...hat.com,
bp@...en8.de, dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
peterz@...radead.org, will@...nel.org, miriam.rachel.korenblit@...el.com,
linux-wireless@...r.kernel.org, Petr Mladek <pmladek@...e.com>, John
Ogness <jogness@...utronix.de>, Greg Kroah-Hartman
<gregkh@...uxfoundation.org>, Jiri Slaby <jirislaby@...nel.org>, Tejun Heo
<tj@...nel.org>, Simona Vetter <simona@...ll.ch>, Thierry Reding
<thierry.reding@...il.com>
Subject: Re: locking problems in iwlwifi? was Re: 6.16-rcX: crashing way too
often on thinkpad X220
On Thu, Jul 24 2025 at 23:24, Thomas Gleixner wrote:
> On Thu, Jul 24 2025 at 13:55, Waiman Long wrote:
> So lets look at that call chain further
>
> vt_console_print()
> hide_cursor()
> vc->vc_sw->con_cursor(vc, false); --> fbcon_cursor()
>
> fbcon_cursor()
> if (vc->vc_cursor_type & CUR_SW)
> fbcon_del_cursor_work(info)
> cancel_delayed_work_sync(&ops->cursor_work);
>
> Here we are.
It's actually very simple to reproduce in a VM. No wireless, no
suspend/resume, no special hardware required.
Enable framebuffer and framebuffer console. Add console=tty0 to the
kernel command line.
Log into FB console and do
# echo -e '\033[?17;0;64c'
which enables the software cursor, which in turn enables the above
conditional invocation of fbcon_del_cursor_work(). Then force a printk
# echo h >/proc/sysrq-trigger
and watch the show.
[ 1406.245343] BUG: sleeping function called from invalid context at kernel/workqueue.c:4359
[ 1406.245345] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1819, name: bash
[ 1406.245346] preempt_count: 2, expected: 0
[ 1406.245347] RCU nest depth: 1, expected: 0
[ 1406.245348] 7 locks held by bash/1819:
[ 1406.245349] #0: ffff9a730c4ba428 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x73/0xf0
[ 1406.245444] #1: ffffffffa7ade120 (rcu_read_lock){....}-{1:3}, at: __handle_sysrq+0x3d/0x270
[ 1406.245457] #2: ffffffffa7adb500 (console_lock){+.+.}-{0:0}, at: _printk+0x63/0x80
[ 1406.245464] #3: ffffffffa7adb550 (console_srcu){....}-{0:0}, at: console_flush_all+0x3c/0x530
[ 1406.245470] #4: ffffffffa7a6b140 (console_owner){....}-{0:0}, at: console_lock_spinning_enable+0x42/0x70
[ 1406.245475] #5: ffffffffa7a6b0c0 (printk_legacy_map-wait-type-override){....}-{4:4}, at: console_flush_all+0x2fe/0x530
[ 1406.245480] #6: ffffffffa7b8e4d8 (printing_lock){....}-{3:3}, at: vt_console_print+0x5b/0x490
[ 1406.245487] irq event stamp: 92690
[ 1406.245488] hardirqs last enabled at (92689): [<ffffffffa656edef>] console_flush_all+0x49f/0x530
[ 1406.245491] hardirqs last disabled at (92690): [<ffffffffa656edff>] console_flush_all+0x4af/0x530
[ 1406.245493] softirqs last enabled at (91822): [<ffffffffa6455166>] __fpu_restore_sig+0x206/0x6a0
[ 1406.245497] softirqs last disabled at (91820): [<ffffffffa6455088>] __fpu_restore_sig+0x128/0x6a0
[ 1406.245499] Preemption disabled at:
[ 1406.245499] [<ffffffffa6570211>] vprintk_emit+0x1f1/0x430
[ 1406.245505] CPU: 46 UID: 0 PID: 1819 Comm: bash Not tainted 6.16.0-rc7+ #446 PREEMPT(voluntary)
[ 1406.245508] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 1406.245510] Call Trace:
[ 1406.245514] <TASK>
[ 1406.245519] dump_stack_lvl+0x77/0xb0
[ 1406.245526] __might_resched+0x1b2/0x2d0
[ 1406.245539] cancel_delayed_work_sync+0x57/0x90
[ 1406.245548] fbcon_cursor+0xb2/0x140
[ 1406.245558] hide_cursor+0x24/0xb0
[ 1406.245562] vt_console_print+0x471/0x490
[ 1406.245565] ? __pfx_vt_console_print+0x10/0x10
[ 1406.245575] console_flush_all+0x32b/0x530
As I've enabled CONFIG_DEBUG_ATOMIC_SLEEP I get the splat immediately
even when there is no work queued. (I misread that code yesterday with
my tried brain and thought the might_sleep() would be too late).
And clearly this is invoked with interrupts disabled:
[ 1406.245345] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1819, name: bash
Now if there would be work executed at that moment then the lockdep
splat _AND_ the subsequent 'irqs already enabled' splat would
materialize.
I can't be bothered to figure out how to make the work executed because
the above is already a clear prove of the theory :)
It requires that the software cursor is enabled. That's probably rare,
so that explains why this went unnoticed for so long.
Contrary to all the halluzinations about the printk locking and lockdep
correctness, this is a crystal clear bug in the framebuffer console code
and has absolutely nothing to do with the printk() core code.
The workqueue issue got introduced with commit 3b0fb6ab25dd ("fbcon: Use
delayed work for cursor") in 5.19, but that replaced a del_timer_sync(),
which had a similar problem because del_timer_sync() cannot be invoked
in hard interrupt context, but printk() can ...
That del_timer_sync() was conditional up to 4.2, where it became
unconditional in commit 2a17d7e80f1d ("fbcon: unconditionally initialize
cursor blink interval") just to be "fixed" shortly before 4.2-rc1 by
moving the conditional back in commit a5edce421848 ("fbcon: Avoid
deleting a timer in IRQ context").
That conditional did not fix anything as it just pushed the delete into
the rarely executed path, therefore hiding the bug, which existed from
the very beginning when this cursor blinking muck was introduced with
commit acba9cd01974 ("fbcon: cursor blink control") in 2.6.23.
So this software cursor code path was never tested with printk() and
actually the splat which led to commit a5edce421848 ("fbcon: Avoid
deleting a timer in IRQ context") should have made it crystal clear that
this code path is broken:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at ../kernel/time/timer.c:1098 del_timer_sync+0x4c/0x54()
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.0-rc4-next-20150519 #1
Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[] (warn_slowpath_null) from [] (del_timer_sync+0x4c/0x54)
[] (del_timer_sync) from [] (fbcon_del_cursor_timer+0x2c/0x40)
[] (fbcon_del_cursor_timer) from [] (fbcon_cursor+0x9c/0x180)
[] (fbcon_cursor) from [] (hide_cursor+0x30/0x98)
[] (hide_cursor) from [] (vt_console_print+0x2a8/0x340)
[] (vt_console_print) from [] (call_console_drivers.constprop.23+0xc8/0xec)
[] (call_console_drivers.constprop.23) from [] (console_unlock+0x498/0x4f0)
[] (console_unlock) from [] (vprintk_emit+0x1f0/0x508)
[] (vprintk_emit) from [] (vprintk_default+0x24/0x2c)
[] (vprintk_default) from [] (printk+0x70/0x88)
But no, hiding it behind a rarely true conditional without actually
testing it, is way better. Seriously?
The changelog should have been:
This commit papers over the underlying problem by restoring the
condition...
I leave it as an exercise to the framebuffer people how to fix this
trainwreck for real.
Thanks,
tglx
Powered by blists - more mailing lists