[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXzcG27SxN9q1veS@pathway.suse.cz>
Date: Fri, 30 Jan 2026 17:28:11 +0100
From: Petr Mladek <pmladek@...e.com>
To: ysard <ysard_git@....fr>
Cc: John Ogness <john.ogness@...utronix.de>, linux-kernel@...r.kernel.org,
senozhatsky@...omium.org
Subject: Re: Regression: system freeze on resume from suspend introduced by
printk per-console suspended state
On Fri 2026-01-30 16:56:56, Petr Mladek wrote:
> On Thu 2026-01-29 10:34:53, ysard wrote:
> > Summary
> > =======
> >
> > The patch works only when I *uncomment* the 2 `synchronize_srcu(&console_srcu);` lines!
>
> > Works now!:
> > $ sudo sh -c "
> > mkdir -p /var/run/nvidia-sleep \
> > && echo 2 > /var/run/nvidia-sleep/Xorg.vt_number \
> > && chvt 63 \
> > && echo suspend >/proc/driver/nvidia/suspend \
> > && systemctl suspend"
> >
> > Logs:
> > [ 338.901995] [ T3134] printk: Suspending console(s) (use no_console_suspend to debug)
> > [ 338.901997] [ T3134] printk: console_suspend_all
> > [ 338.932763] [ T2672] printk: console_trylock
> > [ 338.948664] [ T2659] printk: console_trylock
> > [ 338.948685] [ T2671] printk: console_trylock
> > [ 338.950716] [ T272] printk: console_trylock
> > [ 338.950747] [ T270] printk: console_trylock
> > [ 338.982194] [ T3134] printk: console_trylock
> > [ 339.020910] [ T3134] printk: console_trylock
> > [ 339.044613] [ T158] printk: console_trylock
> > [ 339.132842] [ T3134] printk: console_trylock
[...]
> > [ 339.614444] [ T9] printk: console_trylock
> > [ 339.634304] [ T3149] printk: console_trylock
> > [ 339.942244] [ T2673] printk: console_trylock
> > [ 340.123772] [ T3134] printk: console_resume_all
> >
> > Proc comm values:
> > 1127: kworker/3:2-cgroup_offline
> > 158: kworker/5:1-mm_percpu_wq
> > 2659: kworker/u32:17-async
> > 2661: kworker/u32:19-async
> > 2671: kworker/u32:29-async
> > 2672: kworker/u32:30-async
> > 2673: kworker/u32:31-kvfree_rcu_reclaim
> > 270: scsi_eh_1
> > 272: scsi_eh_2
> > 310: kworker/2:2-events
> > 3134: not exists or not readable
> > 3149: kworker/u32:38-flush-253:3
> > 65: kworker/u32:2-async
> > 66: kworker/u32:3-async
> > 67: kworker/u32:4-async
> > 9: kworker/0:0-events
>
> It is hard to know what is going there. I guess that many
> console_trylock() calls are from printk(). But they might also
> be from tty or from the nvidia driver code.
>
> I have tried to create a patch which would print backtraces
> of the callers. The output might be interesting. I am going
> to send it in a separate mail.
Please, find the patch below. It can be applied on top of
the preovius one which reverted the problematic commit,
see https://lore.kernel.org/all/aXoWiJhcOaGGlcmk@pathway.suse.cz/
Important: You still need to explicitely uncomment the
sychronize_srcu() calls.
It would be nice to provide three logs with this patch:
1. console_lock() API called by
echo suspend >/proc/driver/nvidia/suspend
2. console_lock() API called by the suspend test
No freeze (expected):
$ sudo sh -c "
mkdir -p /var/run/nvidia-sleep \
&& echo 2 > /var/run/nvidia-sleep/Xorg.vt_number \
&& chvt 63 \
&& systemctl suspend"
3. No freeze with the 1st patch and uncommented synchronize_srcu() calls.:
$ sudo sh -c "
mkdir -p /var/run/nvidia-sleep \
&& echo 2 > /var/run/nvidia-sleep/Xorg.vt_number \
&& chvt 63 \
&& echo suspend >/proc/driver/nvidia/suspend \
&& systemctl suspend"
Feel free to do your own tests. But please do not spend too much
time on it.
It would be nice to find the culprit. Especially I would like to
know whether the problem is in the core kernel code (printk,
suspend, tty) or in the nvidia driver.
But we might also just need to restore the original behavior
of the console lock API during suspend. I mean the short cuts.
They are a kind of optimization anyway.
But as I said, it would be nice to understand the problem. There
might be a race or infinite loop somewhere and the original
console_lock API behavior just hides the problem.
OK, here is the patch which printed backtraces of console_lock
API callers around the suspend for me.
>From ab9e67fc50425eb6574c016c5809865a085c4b44 Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek@...e.com>
Date: Fri, 30 Jan 2026 16:57:25 +0100
Subject: [PATCH] printk: debug: Allow to print backtraces of console_lock API
callers.
This patch allows to print backtraces of console_lock API callers.
The logging must be explicitely enabled by:
echo 1 >/sys/module/printk/parameters/debug_console_lock
I suggest to increase the internal log buffer size, for example,
use on the command line:
log_buf_len=1M
The logging might easily go wild because the console API is
called by printk() and tty code. I have explicitely disabled
the logging in:
+ vprintk_emit()
+ fb_flashcursor()
, see no_debug_console_lock_enter()/exit() calls. Otherwise,
there was an endless loop of printed entries.
Note that I run the tests in a virtualized test system (QEMU)
via ssh. Some more tty API might need to get disabled when
doing the tests using the tty console. But it might hide
the code called during suspend. So, doing the tests via
ssh sounds better to me.
Also the number of printed backtraces is limited to "100".
It helps to keep the system running when printing the report
triggers more callers of the API and causes an infinite loop.
Signed-off-by: Petr Mladek <pmladek@...e.com>
---
drivers/video/fbdev/core/fbcon.c | 9 +++-
include/linux/printk.h | 2 +
kernel/printk/printk.c | 91 ++++++++++++++++++++++++++++++--
3 files changed, 98 insertions(+), 4 deletions(-)
diff --git a/drivers/video/fbdev/core/fbcon.c b/drivers/video/fbdev/core/fbcon.c
index 7be9e865325d..1d5d7af516ed 100644
--- a/drivers/video/fbdev/core/fbcon.c
+++ b/drivers/video/fbdev/core/fbcon.c
@@ -397,14 +397,19 @@ static void fb_flashcursor(struct work_struct *work)
struct vc_data *vc = NULL;
int c;
bool enable;
+ int debug;
int ret;
+ debug = no_debug_console_lock_enter();
+
/* FIXME: we should sort out the unbind locking instead */
/* instead we just fail to flash the cursor if we can't get
* the lock instead of blocking fbcon deinit */
ret = console_trylock();
- if (ret == 0)
+ if (ret == 0) {
+ no_debug_console_lock_exit(debug);
return;
+ }
/* protected by console_lock */
info = par->info;
@@ -416,6 +421,7 @@ static void fb_flashcursor(struct work_struct *work)
fbcon_info_from_console(vc->vc_num) != info ||
vc->vc_deccm != 1) {
console_unlock();
+ no_debug_console_lock_exit(debug);
return;
}
@@ -425,6 +431,7 @@ static void fb_flashcursor(struct work_struct *work)
get_fg_color(vc, info, c),
get_bg_color(vc, info, c));
console_unlock();
+ no_debug_console_lock_exit(debug);
queue_delayed_work(system_power_efficient_wq, &par->cursor_work,
par->cur_blink_jiffies);
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 45c663124c9b..f266b20b7584 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -201,6 +201,8 @@ void __init setup_log_buf(int early);
__printf(1, 2) void dump_stack_set_arch_desc(const char *fmt, ...);
void dump_stack_print_info(const char *log_lvl);
void show_regs_print_info(const char *log_lvl);
+int no_debug_console_lock_enter(void);
+void no_debug_console_lock_exit(int debug);
extern asmlinkage void dump_stack_lvl(const char *log_lvl) __cold;
extern asmlinkage void dump_stack(void) __cold;
void printk_trigger_flush(void);
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 6e0ad926f409..c54c70d7a074 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -2383,11 +2383,52 @@ MODULE_PARM_DESC(debug_non_panic_cpus,
"allow messages from non-panic CPUs in panic()");
#endif
+int debug_console_lock;
+module_param(debug_console_lock, int, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(debug_console_lock,
+ "Print some caller info when console_lock() API is called.");
+
+static DEFINE_PER_CPU(int, no_debug_console_lock_nesting);
+
+int no_debug_console_lock_enter(void)
+{
+ int debug;
+
+ debug = READ_ONCE(debug_console_lock);
+ if (debug) {
+ if (this_cpu_read(no_debug_console_lock_nesting) > 256) {
+ debug_console_lock = 0;
+ pr_err("Overflow of no_debug_console_lock_nesting! Disabled console_lock debuginng.\n");
+ return 0;
+ }
+
+ migrate_disable();
+ this_cpu_inc(no_debug_console_lock_nesting);
+ }
+
+ return debug;
+}
+
+void no_debug_console_lock_exit(int debug)
+{
+ if (debug) {
+ if (this_cpu_read(no_debug_console_lock_nesting) <= 0) {
+ debug_console_lock = 0;
+ pr_err("Underflow of no_debug_console_lock_nesting! Disabled console lock debuginng.\n");
+ migrate_enable();
+ return;
+ }
+ this_cpu_dec(no_debug_console_lock_nesting);
+ migrate_enable();
+ }
+}
+
asmlinkage int vprintk_emit(int facility, int level,
const struct dev_printk_info *dev_info,
const char *fmt, va_list args)
{
struct console_flush_type ft;
+ int debug_caller;
int printed_len;
/* Suppress unimportant messages after panic happens */
@@ -2404,6 +2445,8 @@ asmlinkage int vprintk_emit(int facility, int level,
!panic_triggering_all_cpu_backtrace)
return 0;
+ debug_caller = no_debug_console_lock_enter();
+
printk_get_console_flush_type(&ft);
/* If called from the scheduler, we can not call up(). */
@@ -2448,6 +2491,8 @@ asmlinkage int vprintk_emit(int facility, int level,
else if (!console_irqwork_blocked)
wake_up_klogd();
+ no_debug_console_lock_exit(debug_caller);
+
return printed_len;
}
EXPORT_SYMBOL(vprintk_emit);
@@ -2856,6 +2901,41 @@ static int console_cpu_notify(unsigned int cpu)
return 0;
}
+static void print_console_lock_caller(const char *func_name)
+{
+ static int reports = 0;
+ char tcomm[64];
+ u8 nested;
+
+ if (!debug_console_lock)
+ return;
+
+ nested = raw_cpu_read(no_debug_console_lock_nesting);
+ if (nested)
+ return;
+
+ if (!func_name) {
+ pr_err_once("%s: called without func_name\n", __func__);
+ return;
+ }
+
+ if (reports++ > 100) {
+ debug_console_lock = 0;
+ pr_warn("Too many printed console lock callers. Disabled the debugging.\n");
+ }
+
+ if (current->flags & PF_WQ_WORKER)
+ wq_worker_comm(tcomm, sizeof(tcomm), current);
+ else if (current->flags & PF_KTHREAD)
+ get_kthread_comm(tcomm, sizeof(tcomm), current);
+ else
+ get_task_comm(tcomm, current);
+
+ pr_info("console lock API call [%d]: %s() by pid=%d, comm=%s\n",
+ reports, func_name, current->pid, tcomm);
+ show_stack(current, NULL, KERN_INFO);
+}
+
/**
* console_lock - block the console subsystem from printing
*
@@ -2868,13 +2948,14 @@ void console_lock(void)
{
might_sleep();
+ print_console_lock_caller(__func__);
+
/* On panic, the console_lock must be left to the panic cpu. */
while (panic_on_other_cpu())
msleep(1000);
down_console_sem();
if (console_suspended) {
- printk_store(KERN_INFO "printk: %s\n", __func__);
/*
* Keep console locked, but do not touch
* @console_locked or @console_may_schedule.
@@ -2897,13 +2978,14 @@ EXPORT_SYMBOL(console_lock);
*/
int console_trylock(void)
{
+ print_console_lock_caller(__func__);
+
/* On panic, the console_lock must be left to the panic cpu. */
if (panic_on_other_cpu())
return 0;
if (down_trylock_console_sem())
return 0;
if (console_suspended) {
- printk_store(KERN_INFO "printk: %s\n", __func__);
/*
* The lock was acquired, but unlock directly and report
* failure. Here console_locked=1 and console_may_schedule=1.
@@ -2919,6 +3001,8 @@ EXPORT_SYMBOL(console_trylock);
int is_console_locked(void)
{
+// /* This might produce too many entries. */
+// print_console_lock_caller(__func__);
return console_locked;
}
EXPORT_SYMBOL(is_console_locked);
@@ -3404,8 +3488,9 @@ void console_unlock(void)
{
struct console_flush_type ft;
+ print_console_lock_caller(__func__);
+
if (console_suspended) {
- printk_store(KERN_INFO "printk: %s\n", __func__);
/*
* Simply unlock directly.
* Here console_locked=1 and console_may_schedule=1.
--
2.52.0
Powered by blists - more mailing lists