linux-kernel - Re: Loadavg accounting error on arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201116165415.GG3121392@hirez.programming.kicks-ass.net>
Date:   Mon, 16 Nov 2020 17:54:15 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Mel Gorman <mgorman@...hsingularity.net>
Cc:     Will Deacon <will@...nel.org>, Davidlohr Bueso <dave@...olabs.net>,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: Loadavg accounting error on arm64

On Mon, Nov 16, 2020 at 03:52:32PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 03:20:05PM +0100, Peter Zijlstra wrote:
> > > It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> > > document ttwu()") which changed it. Not sure why that is and didn't
> > > think about it too deep as it didn't appear to be directly related to
> > > the problem and didn't have ordering consequences.
> > 
> > I'm confused; that commit didn't change deactivate_task(). Anyway,
> > ->on_rq should be strictly under rq->lock. That said, since there is a
> > READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
> > WRITE_ONCE().
> > 
> 
> It didn't change deactivate_task but it did this
> 
> -       WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
> -       dequeue_task(rq, p, DEQUEUE_NOCLOCK);
> +       deactivate_task(rq, p, DEQUEUE_NOCLOCK);
> 
> which makes that write a
> 
> p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> 
> As activate_task is also a plain store and I didn't spot a relevant
> ordering problem that would impact loadavg, I concluded it was not
> immediately relevant, just a curiousity.

That's move_queued_task() case, which is irrelevant for the issue at
hand.

> > > > __ttwu_queue_wakelist() we have:
> > > > 
> > > > 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > > > 
> > > > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > > > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > > > updates can race and cause the flags to be corrupted?
> > > > 
> > > 
> > > I think this is at least one possibility. I think at least that one
> > > should only be explicitly set on WF_MIGRATED and explicitly cleared in
> > > sched_ttwu_pending. While I haven't audited it fully, it might be enough
> > > to avoid a double write outside of the rq lock on the bitfield but I
> > > still need to think more about the ordering of sched_contributes_to_load
> > > and whether it's ordered by p->on_cpu or not.
> > 
> > The scenario you're worried about is something like:
> > 
> > 	CPU0							CPU1
> > 
> > 	schedule()
> > 		prev->sched_contributes_to_load = X;
> > 		deactivate_task(prev);
> > 
> > 								try_to_wake_up()
> > 									if (p->on_rq &&) // false
> > 									if (smp_load_acquire(&p->on_cpu) && // true
> > 									    ttwu_queue_wakelist())
> > 										p->sched_remote_wakeup = Y;
> > 
> > 		smp_store_release(prev->on_cpu, 0);
> > 
> 
> Yes, mostly because of what memory-barriers.txt warns about for bitfields
> if they are not protected by the same lock.

I'm not sure memory-barriers.txt is relevant; that's simply two racing
stores and 'obviously' buggered.

> > And then the stores of X and Y clobber one another.. Hummph, seems
> > reasonable. One quick thing to test would be something like this:
> > 
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 7abbdd7f3884..9844e541c94c 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -775,7 +775,9 @@ struct task_struct {
> >  	unsigned			sched_reset_on_fork:1;
> >  	unsigned			sched_contributes_to_load:1;
> >  	unsigned			sched_migrated:1;
> > +	unsigned			:0;
> >  	unsigned			sched_remote_wakeup:1;
> > +	unsigned			:0;
> >  #ifdef CONFIG_PSI
> >  	unsigned			sched_psi_wake_requeue:1;
> >  #endif
> 
> I'll test this after the smp_wmb() test completes. While a clobbering may
> be the issue, I also think the gap between the rq->nr_uninterruptible++
> and smp_store_release(prev->on_cpu, 0) is relevant and a better candidate.

I really don't understand what you wrote in that email...