linux-kernel - Re: [BUG] printk/nbcon.c: watchdog BUG: softlockup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87o77xa584.fsf@jogness.linutronix.de>
Date: Wed, 19 Jun 2024 07:15:31 +0206
From: John Ogness <john.ogness@...utronix.de>
To: Andrew Halaney <ahalaney@...hat.com>, tglx@...utronix.de
Cc: Derek Barbosa <debarbos@...hat.com>, pmladek@...e.com,
 rostedt@...dmis.org, senozhatsky@...omium.org,
 linux-rt-users@...r.kernel.org, linux-kernel@...r.kernel.org,
 williams@...hat.com, jlelli@...hat.com, lgoncalv@...hat.com,
 jwyatt@...hat.com, aubaker@...hat.com
Subject: Re: [BUG] printk/nbcon.c: watchdog BUG: softlockup - CPU#x stuck
 for 78s

[ Explicitly added tglx, hoping he can chime in here. ]

On 2024-06-18, Andrew Halaney <ahalaney@...hat.com> wrote:
>> Shouldn't the scheduler eventually kick the task off the CPU after
>> its timeslice is up?
>
> I trust you better than myself about this, but this is being
> reproduced with a CONFIG_PREEMPT_DYNAMIC=y +
> CONFIG_PREEMPT_VOLUNTARY=y setup (so essentially the current mode is
> VOLUNTARY). Does that actually work that way for a kthread in that
> mode?

It would be good not to trust me better than yourself. I actually have
very little experience with the non-RT preemption models. I will need to
investigate this further.

> Just in case I did something dumb, here's the module I wrote up:
>
> ahalaney@...en2nano ~/git/linux-rt-devel (git)-[tags/v6.10-rc4-rt6-rebase] % cat kernel/printk/test_thread.c                         :(
> /*
>  * Test making a kthread similar to nbcon's (under load)
>  * to see if it also has issues with migrate_swap()
>  */
> #include "linux/nmi.h"
> #include <asm-generic/delay.h>
> #include <linux/kthread.h>
> #include <linux/module.h>
> #include <linux/sched.h>
>
> DEFINE_STATIC_SRCU(test_srcu);
> static DEFINE_SPINLOCK(test_lock);
> static struct task_struct *kt;
> static bool dont_stop = true;
>
> static int test_thread_func(void *unused) {
> 	unsigned long flags;
>
> 	pr_info("Starting the while true loop\n");
> 	do {
> 		int cookie = srcu_read_lock_nmisafe(&test_srcu);
> 		spin_lock_irqsave(&test_lock, flags);
> 		touch_nmi_watchdog();
> 		udelay(5000);  // print a line to serial
> 		spin_unlock_irqrestore(&test_lock, flags);
> 		srcu_read_unlock_nmisafe(&test_srcu, cookie);
> 	} while (dont_stop);
>
> 	return 0;
> }
>
> static int __init test_thread_init(void) {
>
> 	pr_info("Creating test_thread at -20 nice level\n");
> 	kt = kthread_run(test_thread_func, NULL, "test_thread");
> 	if (IS_ERR(kt)) {
> 		pr_err("Failed to make test_thread\n");
> 		return PTR_ERR(kt);
> 	}
> 	sched_set_normal(kt, -20);
>
> 	return 0;
> }
>
> static void __exit test_thread_exit(void) {
> 	dont_stop = false;
> 	kthread_stop(kt);
> }
>
> module_init(test_thread_init);
> module_exit(test_thread_exit);
> MODULE_LICENSE("GPL");

Thanks for the functional test! This should quite accurately reproduce
the situation when the printing thread is unable to catch up to the
amount of incoming messages.

Some function to explicitly trigger the scheduler may be needed. Such as
adding cond_resched() outside the critical section, before repeating the
loop. We would like to remove such explicit preemption points from the
kernel code, but perhaps it is necessary for the VOLUNTARY preemption
scheme.

John