lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 6 Mar 2015 09:09:58 -0800 From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com> To: Mathieu Desnoyers <mathieu.desnoyers@...icios.com> Cc: Huang Ying <ying.huang@...el.com>, Lai Jiangshan <laijs@...fujitsu.com>, Lai Jiangshan <eag0628@...il.com>, Peter Zijlstra <peterz@...radead.org>, LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...nel.org>, Steven Rostedt <rostedt@...dmis.org>, Linus Torvalds <torvalds@...ux-foundation.org> Subject: Re: Possible lock-less list race in scheduler_ipi() On Thu, Mar 05, 2015 at 11:48:51PM +0000, Mathieu Desnoyers wrote: > Hi, > > I recently reviewed the scheduler_ipi() code by > coincidence, and noticed that the use of llist.h > in there seemed a bit odd. So I'd like to have > more eyes to help me look into this. > > I've been told that there has been issues with > IPIs lately, so here is one possible scenario > I think might be possible: > > - scheduler_ipi() > - sched_ttwu_pending() > - llist_del_all() > (grabs the wake_list with xchg()) > - raw_spin_lock_irqsave(&rq->lock, flags); > - iteration on the llist (owned locally by interrupt handler) > (for each item) > p = llist_entry(llist, struct task_struct, wake_entry); > llist = llist_next(llist); > ttwu_do_activate(rq, p, 0); > - raw_spin_unlock_irqrestore(&rq->lock, flags); > > ttwu_do_activate() ends up calling > ttwu_activate() and ttwu_do_wakeup(), > > I'm worried that ttwu_do_wakeup() might end up > indirectly handing off the task to the following > functions (within another execution context): > > - try_to_wake_up() > - ttwu_queue() > - ttwu_queue_remote adds the process to another > wake_list with a cmpxchg() > > llist_next() is pretty simple: > > static inline struct llist_node *llist_next(struct llist_node *node) > { > return node->next; > } > > It is so simple that I wonder if the compiler would be > within its rights to reorder the load of node->next > after some operations within ttwu_do_activate(), thus > causing corruption of this linked-list due to a > concurrent try_to_wake_up() performed by another core. Well, it clearly cannot leak this load out of the critical section. That said, there is nothing in llist_next() that tells the compiler that it need to be careful with this load. And all three functions, sched_ttwu_pending(), ttwu_do_activate(), and llist_next(), are in the same compilation unit. The compiler therefore has full knowledge, and full ability to rearrange. It can slide this load down past the CONFIG_SMP #ifdef in ttwu_do_activate() and into ttwu_activate(). However, it cannot push it past the call to sched_clock_cpu(), which is defined in the separate compilation unit kernel/sched/clock.c. And sched_clock_cpu() is invoked from update_rq_clock(), which is invoked from enqueue_task(), which is invoked from activate_task(), which is invoked from ttwu_activate(). And even if sched_clock_cpu() was in the same compilation unit, its preempt_disable_notrace() keeps the compiler from reordering. At least it does if the compiler cannot prove that sched_clock_cpu() takes one of two early exits. ;-) So I am not seeing anything important that the compiler could reorder this past. The CPU might do additional reordering, of course. > Am I too paranoid about the possible compiler mishaps > there, or are my concerns justified ? Now try_to_wake_up() does acquire p->pi_lock, but that of course does not exclude sched_ttwu_pending(), which instead acquires rq->lock. And in a virtualized environment, sched_ttwu_pending() can be preempted and delayed pretty much arbitrarily pretty much anywhere, even when its interrupts are disabled. So let's look at the llist code. One question is of course "How does llist_add_batch() avoid the ABA problem?" In this case, the answer appears to be that llist_del_all() is used, and never llist_del_first(). (Mixing use of llist_del_all() and llist_del_first() by multiple threads could cause this ABA problem, even if only one thread at a time used llist_del_first().) The code does appear to assume that if a given task is on the local llist, that it cannot be awakened any other way without acquiring the same runqueue lock that sched_ttwu_pending() holds. If this rule was violated, then of course the list could be corrupted. However, the only accesses to the list appear to be sched_ttwu_pending()'s llist_del_all(), ttwu_queue_remote()'s llist_add(), and a couple of llist_empty() calls. This combination appears to me to be safe. The scheduler_ipi() code assumes that an IPI cannot outrun data. If this is violated, the scheduler_ipi() might still see an empty ->wake_list() even though the sender of the IPI just filled it. The last time I worked with a machine that violated this assumption was about twenty years ago, and I very much hope that such hardware designs have not come back from the dead. But this would cause a hang, not a corrupt list. So, does this trigger any helpful thoughts? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists