linux-kernel - Re: Query about timer wheel API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <801e569c-4f86-4bb5-a255-b861b86cb773@oracle.com>
Date: Tue, 24 Dec 2024 01:20:48 +1100
From: imran.f.khan@...cle.com
To: Hillf Danton <hdanton@...a.com>
Cc: Thomas Gleixner <tglx@...utronix.de>, Tejun Heo <tj@...nel.org>,
        john.stultz@...aro.org, sboyd@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: Query about timer wheel API

Hello Hillf,
On 23/12/2024 11:51 pm, Hillf Danton wrote:
> On Mon, 23 Dec 2024 11:14:21 +1100 imran.f.khan@...cle.com
>>
>> Recently we have come across some bugs in the RDS code, where a delayed
>> work was being queued on an offlined CPU and as a result of that the
> 
> Such a queue could not happen given irq disabled in queue_delayed_work_on().
> Did you see it upstream?
> 
You mean upstream RDS or upstream workqueue ? For RDS I need to check, but with 
upstream v6.6 kernel, I was able to submit a delayed work to an offlined CPU.
The delayed work would never happen and I can see corresponding timer in timer
list of offlined  CPU (using crash). 
Once the CPU is brought back online, depending on the workload the work handler
gets executed.

I used following test module:

===============

#include <linux/module.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/workqueue.h>
#include <linux/completion.h>
#include <linux/delay.h>
#include <linux/slab.h>
#include <linux/jiffies.h>

#define TIMEOUT 1 /* test timeout in secs */
#define NUM_WORK_ITEMS 1 /* number of work items to submit */


static DEFINE_MUTEX(mutex);

static DEFINE_MUTEX(dwork_func_mutex);

static void delayed_work_func(struct work_struct *data)
{
	int cpu;
	mutex_lock(&dwork_func_mutex);
	cpu = get_cpu();
	pr_err("%s invoked for work: 0x%px on cpu#%d \n", __func__, data, cpu);
	put_cpu();
	mutex_unlock(&dwork_func_mutex);
}

static int param_set_queue_work_on_cpu(const char *val, const struct kernel_param *kp)
{
	int cpu, this_cpu, i;
	struct delayed_work *dwork = NULL;

	if (!mutex_trylock(&mutex))
		return -EBUSY;

	cpu = simple_strtoul(val, NULL, 0);
	/*if (!cpu_present(cpu)) 
		return -EINVAL;*/

	for (i = 0; i < NUM_WORK_ITEMS; i++) {
		dwork = kzalloc(sizeof(struct delayed_work), GFP_KERNEL);
		if(dwork) {
			this_cpu = get_cpu();
			INIT_DELAYED_WORK(dwork, delayed_work_func);
			queue_delayed_work_on(cpu, system_wq, dwork, msecs_to_jiffies(10000));
			pr_err("Submitted dwork 0x%px on %s cpu#%d \n", dwork, cpu_online(cpu)?"online":"offline", cpu);
			put_cpu();
		}

	}
	mutex_unlock(&mutex);
	return 0;
}

module_param_call(queue_work_on_cpu, param_set_queue_work_on_cpu, NULL, NULL, 0600);

static int __init workqueue_study_init(void)
{
	pr_err("module_init \n");
	
	return 0;
}

static void workqueue_study_exit(void)
{
	pr_err("module_exit \n");
}

MODULE_AUTHOR("Imran Khan <imran.eie.85@...il.com>");
MODULE_DESCRIPTION("Workqueue study");
MODULE_LICENSE("GPL");

module_init(workqueue_study_init);
module_exit(workqueue_study_exit);

===========

This module gives an interface at:

/sys/module/<module name>/params/queue_work_on_cpu

Writing X there would submit a delayed_work (delay 10 secs)
to CPU X.

We can see if CPU X is online, submitted work gets executed
after around 10 secs. But if CPU X is offline, the submitted
work handler does not get fired unless the CPU has been brought
back online.

Thanks,
Imran
>> underlying timer was not firing, which in turn meant that the work was
>> never able to make it to the intended worker_pool.