linux-kernel - Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20071024133818.GA82@tv-sign.ru>
Date:	Wed, 24 Oct 2007 17:38:18 +0400
From:	Oleg Nesterov <oleg@...sign.ru>
To:	Gautham R Shenoy <ego@...ibm.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, Rusty Russel <rusty@...tcorp.com.au>,
	Srivatsa Vaddagiri <vatsa@...ibm.com>,
	Dipankar Sarma <dipankar@...ibm.com>,
	Ingo Molnar <mingo@...e.hu>
Subject: Re: [RFC PATCH 4/5] Remove CPU_DEAD/CPU_UP_CANCELLED handling from workqueue.c

On 10/24, Gautham R Shenoy wrote:
>

(reordered)

> With get_online_cpus()/put_online_cpus(), we can eliminate
> the workqueue_mutex and reintroduce the workqueue_lock,
> which is a spinlock which serializes the accesses to the 
> workqueues list.

This change is obviously good, can't it go into the previous patch?

Because,

> Solution is not to cleanup the worker thread. Instead let it remain
> even after the cpu goes offline. Since no one can queue any work
> on an offlined cpu, this thread will be forever sleeping, untill
> someone onlines the cpu.

I still think this patch is questionable. Please look at my previous
response http://marc.info/?l=linux-kernel&m=119262203729543

In short: with this patch it is not possible to guarantee that work->fun()
will run on the correct CPU.

>  static void cleanup_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
>  {
>  	/*
> -	 * Our caller is either destroy_workqueue() or CPU_DEAD,
> -	 * workqueue_mutex protects cwq->thread
> +	 * Our caller is destroy_workqueue().  So warn on a double
> +	 * destroy.
>  	 */
> -	if (cwq->thread == NULL)
> +	if (cwq->thread == NULL) {
> +		WARN_ON(1);

Looks wrong. It is possible that cwq->thread == NULL, because currently we
never "shrink" cpu_populated_map.

> cleanup_workqueue_thread() in the CPU_DEAD and CPU_UP_CANCELLED path
> will cause a deadlock if the worker thread is executing a work item
> which is blocked on get_online_cpus(). This will lead to a irrecoverable
> hang.

Yes. But there is nothing new. Currently, work->func() can't share the locks
with cpu_down's patch. Not only only it can't take workqueue_mutex, it can't
take any other lock which could be taken by notifier callbacks, etc.

Can't we ignore this problem, at least for now? I believe we need intrusive
changes to solve this problem correctly. Perhaps I am wrong, of course, but
I don't see a simple solution.

Another option. Note that get_online_cpus() does more than just pinning
cpu maps, actually it blocks hotplug entirely. Now let's look at
schedule_on_each_cpu(), for example. It doesn't need to block hotplug,
it only needs a stable cpu_online_map.

Suppose for a moment that _cpu_down() does cpu_hotplug_done() earlier,
right after __cpu_die(cpu) which removes CPU from the map (yes, this
is wrong, I know). Now, we don't need to change workqueue_cpu_callback(),
work->func() can use get_online_cpus() without fear of deadlock.

So, can't we introduce 2 nested rw locks? The first one blocks cpu hotplug
(like get_online_cpus does currently), the second one just pins cpu maps.
I think most users needs only this, not more.

What do you think?

(Gautham, I apologize in advance, can't be responsive till weekend).

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/