lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Mon, 6 Jul 2020 16:59:52 +0200 From: Peter Zijlstra <peterz@...radead.org> To: Dave Jones <davej@...emonkey.org.uk>, Mel Gorman <mgorman@...hsingularity.net>, Linux Kernel <linux-kernel@...r.kernel.org>, mingo@...nel.org, Linus Torvalds <torvalds@...ux-foundation.org> Cc: paul.gortmaker@...driver.com, valentin.schneider@....com Subject: Re: weird loadavg on idle machine post 5.7 On Fri, Jul 03, 2020 at 04:51:53PM -0400, Dave Jones wrote: > On Fri, Jul 03, 2020 at 12:40:33PM +0200, Peter Zijlstra wrote: > > > So ARM/Power/etc.. can speculate the load such that the > > task_contributes_to_load() value is from before ->on_rq. > > > > The compiler might similar re-order things -- although I've not found it > > doing so with the few builds I looked at. > > > > So I think at the very least we should do something like this. But i've > > no idea how to reproduce this problem. > > > > Mel's patch placed it too far down, as the WF_ON_CPU path also relies on > > this, and by not resetting p->sched_contributes_to_load it would skew > > accounting even worse. > > looked promising the first few hours, but as soon as it hit four hours > of uptime, loadavg spiked and is now pinned to at least 1.00 OK, lots of cursing later, I now have the below... The TL;DR is that while schedule() doesn't change p->state once it starts, it does read it quite a bit, and ttwu() will actually change it to TASK_WAKING. So if ttwu() changes it to WAKING before schedule() reads it to do loadavg accounting, things go sideways. The below is extra complicated by the fact that I've had to scrounge up a bunch of load-store ordering without actually adding barriers. It adds yet another control dependency to ttwu(), so take that C standard :-) I've booted it, and build a few kernels with it and checked loadavg drops to 0 after each build, so from that pov all is well, but since I'm not confident I can reproduce the issue, I can't tell this actually fixes anything, except maybe phantoms of my imagination. --- include/linux/sched.h | 4 --- kernel/sched/core.c | 67 +++++++++++++++++++++++++++++++++++++-------------- 2 files changed, 49 insertions(+), 22 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 9bd073a10224..e26c8bbeda00 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -114,10 +114,6 @@ struct task_group; #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0) -#define task_contributes_to_load(task) ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \ - (task->flags & PF_FROZEN) == 0 && \ - (task->state & TASK_NOLOAD) == 0) - #ifdef CONFIG_DEBUG_ATOMIC_SLEEP /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1d3d2d67f398..f245444b4b15 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1313,9 +1313,6 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags) void activate_task(struct rq *rq, struct task_struct *p, int flags) { - if (task_contributes_to_load(p)) - rq->nr_uninterruptible--; - enqueue_task(rq, p, flags); p->on_rq = TASK_ON_RQ_QUEUED; @@ -1325,9 +1322,6 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING; - if (task_contributes_to_load(p)) - rq->nr_uninterruptible++; - dequeue_task(rq, p, flags); } @@ -2228,10 +2222,10 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags, lockdep_assert_held(&rq->lock); -#ifdef CONFIG_SMP if (p->sched_contributes_to_load) rq->nr_uninterruptible--; +#ifdef CONFIG_SMP if (wake_flags & WF_MIGRATED) en_flags |= ENQUEUE_MIGRATED; #endif @@ -2575,7 +2569,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * A similar smb_rmb() lives in try_invoke_on_locked_down_task(). */ smp_rmb(); - if (p->on_rq && ttwu_remote(p, wake_flags)) + if (READ_ONCE(p->on_rq) && ttwu_remote(p, wake_flags)) goto unlock; if (p->in_iowait) { @@ -2584,9 +2578,6 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) } #ifdef CONFIG_SMP - p->sched_contributes_to_load = !!task_contributes_to_load(p); - p->state = TASK_WAKING; - /* * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be * possible to, falsely, observe p->on_cpu == 0. @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) * * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in * __schedule(). See the comment for smp_mb__after_spinlock(). + * + * Form a control-dep-acquire with p->on_rq == 0 above, to ensure + * schedule()'s deactivate_task() has 'happened' and p will no longer + * care about it's own p->state. See the comment in __schedule(). */ - smp_rmb(); + smp_acquire__after_ctrl_dep(); + + /* + * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq + * == 0), which means we need to do an enqueue, change p->state to + * TASK_WAKING such that we can unlock p->pi_lock before doing the + * enqueue, such as ttwu_queue_wakelist(). + */ + p->state = TASK_WAKING; /* * If the owning (remote) CPU is still in the middle of schedule() with @@ -4088,6 +4091,7 @@ static void __sched notrace __schedule(bool preempt) { struct task_struct *prev, *next; unsigned long *switch_count; + unsigned long prev_state; struct rq_flags rf; struct rq *rq; int cpu; @@ -4104,12 +4108,19 @@ static void __sched notrace __schedule(bool preempt) local_irq_disable(); rcu_note_context_switch(preempt); + prev_state = prev->state; + /* - * Make sure that signal_pending_state()->signal_pending() below - * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE) - * done by the caller to avoid the race with signal_wake_up(). + * __set_current_state(@state) + * schedule() signal_wake_up() + * prev_state = p->state set_tsk_thread_flag(p, TIF_SIGPENDING) + * wake_up_state() + * LOCK rq->lock LOCK p->pi_state + * smp_mb__after_spinlock() smp_mb__after_spinlock() + * if (signal_pending_state() if (p->state & @state) + * * - * The membarrier system call requires a full memory barrier + * Also, the membarrier system call requires a full memory barrier * after coming from user-space, before storing to rq->curr. */ rq_lock(rq, &rf); @@ -4120,10 +4131,30 @@ static void __sched notrace __schedule(bool preempt) update_rq_clock(rq); switch_count = &prev->nivcsw; - if (!preempt && prev->state) { - if (signal_pending_state(prev->state, prev)) { + /* + * We must re-load p->state in case ttwu_runnable() changed it + * before we acquired rq->lock. + */ + if (!preempt && prev_state && prev_state == prev->state) { + if (signal_pending_state(prev_state, prev)) { prev->state = TASK_RUNNING; } else { + prev->sched_contributes_to_load = + (prev_state & (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)) == TASK_UNINTERRUPTIBLE && + (prev->flags & PF_FROZEN) == 0; + + if (prev->sched_contributes_to_load) + rq->nr_uninterruptible++; + + /* + * __schedule() ttwu() + * prev_state = prev->state; if (READ_ONCE(p->on_rq) && ...) + * LOCK rq->lock goto out; + * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); + * p->on_rq = 0; p->state = TASK_WAKING; + * + * After this, schedule() must not care about p->state any more. + */ deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); if (prev->in_iowait) {
Powered by blists - more mailing lists