linux-kernel - Re: weird loadavg on idle machine post 5.7

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <jhj5zb08agb.mognet@arm.com>
Date:   Tue, 07 Jul 2020 00:56:04 +0100
From:   Valentin Schneider <valentin.schneider@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Dave Jones <davej@...emonkey.org.uk>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Linux Kernel <linux-kernel@...r.kernel.org>, mingo@...nel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        paul.gortmaker@...driver.com
Subject: Re: weird loadavg on idle machine post 5.7

On 06/07/20 15:59, Peter Zijlstra wrote:
> OK, lots of cursing later, I now have the below...
>
> The TL;DR is that while schedule() doesn't change p->state once it
> starts, it does read it quite a bit, and ttwu() will actually change it
> to TASK_WAKING. So if ttwu() changes it to WAKING before schedule()
> reads it to do loadavg accounting, things go sideways.
>
> The below is extra complicated by the fact that I've had to scrounge up
> a bunch of load-store ordering without actually adding barriers. It adds
> yet another control dependency to ttwu(), so take that C standard :-)
>
> I've booted it, and build a few kernels with it and checked loadavg
> drops to 0 after each build, so from that pov all is well, but since
> I'm not confident I can reproduce the issue, I can't tell this actually
> fixes anything, except maybe phantoms of my imagination.
>

As you said on IRC, the one apparent race would lead into "negative"
rq->nr_uninterruptible accounting, i.e. we'd skip some increments so would
end up with more decrements. As for the described issue, I think we were
both expecting "positive" accounting, i.e. more increments than decrements,
leading into an artificially inflated loadavg.

In any case, this should get rid of any existing race. I'll need some more
time (and more aperol spritz) to go through it all though.

> @@ -2605,8 +2596,20 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
>        *
>        * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
>        * __schedule().  See the comment for smp_mb__after_spinlock().
> +	 *
> +	 * Form a control-dep-acquire with p->on_rq == 0 above, to ensure
> +	 * schedule()'s deactivate_task() has 'happened' and p will no longer
> +	 * care about it's own p->state. See the comment in __schedule().
>        */
> -	smp_rmb();
> +	smp_acquire__after_ctrl_dep();

Apologies for asking again, but I'm foolishly hopeful I'll someday be able
to grok those things without half a dozen tabs open with documentation and
Paul McKenney papers.

Do I get it right that the 'acquire' part hints this is equivalent to
issuing a load-acquire on whatever was needed to figure out whether or not
the take the branch (in this case, p->on_rq, amongst other things); IOW
ensures any memory access appearing later in program order has to happen
after the load?

That at least explains to me the load->{load,store} wording in
smp_acquire__after_ctrl_dep().

> +
> +	/*
> +	 * We're doing the wakeup (@success == 1), they did a dequeue (p->on_rq
> +	 * == 0), which means we need to do an enqueue, change p->state to
> +	 * TASK_WAKING such that we can unlock p->pi_lock before doing the
> +	 * enqueue, such as ttwu_queue_wakelist().
> +	 */
> +	p->state = TASK_WAKING;
>
>       /*
>        * If the owning (remote) CPU is still in the middle of schedule() with