linux-kernel - Re: Loadavg accounting error on arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201116142005.GE3121392@hirez.programming.kicks-ass.net>
Date:   Mon, 16 Nov 2020 15:20:05 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Will Deacon <will@...nel.org>, Davidlohr Bueso <dave@...olabs.net>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: Loadavg accounting error on arm64

On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:

> > Anyway, setting all that aside, I do agree with you that the bitfield usage
> > in task_struct looks pretty suspicious. For example, in __schedule() we
> > have:
> > 
> > 	rq_lock(rq, &rf);
> > 	smp_mb__after_spinlock();
> > 	...
> > 	prev_state = prev->state;
> > 
> > 	if (!preempt && prev_state) {
> > 		if (signal_pending_state(prev_state, prev)) {
> > 			prev->state = TASK_RUNNING;
> > 		} else {
> > 			prev->sched_contributes_to_load =
> > 				(prev_state & TASK_UNINTERRUPTIBLE) &&
> > 				!(prev_state & TASK_NOLOAD) &&
> > 				!(prev->flags & PF_FROZEN);
> > 			...
> > 			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
> > 
> > where deactivate_task() updates p->on_rq directly:
> > 
> > 	p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> > 
> 
> It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> document ttwu()") which changed it. Not sure why that is and didn't
> think about it too deep as it didn't appear to be directly related to
> the problem and didn't have ordering consequences.

I'm confused; that commit didn't change deactivate_task(). Anyway,
->on_rq should be strictly under rq->lock. That said, since there is a
READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
WRITE_ONCE().

> > __ttwu_queue_wakelist() we have:
> > 
> > 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > 
> > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > updates can race and cause the flags to be corrupted?
> > 
> 
> I think this is at least one possibility. I think at least that one
> should only be explicitly set on WF_MIGRATED and explicitly cleared in
> sched_ttwu_pending. While I haven't audited it fully, it might be enough
> to avoid a double write outside of the rq lock on the bitfield but I
> still need to think more about the ordering of sched_contributes_to_load
> and whether it's ordered by p->on_cpu or not.

The scenario you're worried about is something like:

	CPU0							CPU1

	schedule()
		prev->sched_contributes_to_load = X;
		deactivate_task(prev);

								try_to_wake_up()
									if (p->on_rq &&) // false
									if (smp_load_acquire(&p->on_cpu) && // true
									    ttwu_queue_wakelist())
										p->sched_remote_wakeup = Y;

		smp_store_release(prev->on_cpu, 0);



And then the stores of X and Y clobber one another.. Hummph, seems
reasonable. One quick thing to test would be something like this:


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..9844e541c94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,9 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
+	unsigned			:0;
 	unsigned			sched_remote_wakeup:1;
+	unsigned			:0;
 #ifdef CONFIG_PSI
 	unsigned			sched_psi_wake_requeue:1;
 #endif