lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20180822172825.GA1317@cmpxchg.org>
Date:   Wed, 22 Aug 2018 13:28:25 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Tejun Heo <tj@...nel.org>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Daniel Drake <drake@...lessm.com>,
        Vinayak Menon <vinmenon@...eaurora.org>,
        Christopher Lameter <cl@...ux.com>,
        Mike Galbraith <efault@....de>,
        Shakeel Butt <shakeelb@...gle.com>,
        Peter Enderborg <peter.enderborg@...y.com>, linux-mm@...ck.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...com
Subject: Re: [PATCH 8/9] psi: pressure stall information for CPU, memory, and
 IO

On Wed, Aug 22, 2018 at 11:10:24AM +0200, Peter Zijlstra wrote:
> On Tue, Aug 21, 2018 at 04:11:15PM -0400, Johannes Weiner wrote:
> > On Fri, Aug 03, 2018 at 07:21:39PM +0200, Peter Zijlstra wrote:
> > > On Wed, Aug 01, 2018 at 11:19:57AM -0400, Johannes Weiner wrote:
> > > > +			time = READ_ONCE(groupc->times[s]);
> > > > +			/*
> > > > +			 * In addition to already concluded states, we
> > > > +			 * also incorporate currently active states on
> > > > +			 * the CPU, since states may last for many
> > > > +			 * sampling periods.
> > > > +			 *
> > > > +			 * This way we keep our delta sampling buckets
> > > > +			 * small (u32) and our reported pressure close
> > > > +			 * to what's actually happening.
> > > > +			 */
> > > > +			if (test_state(groupc->tasks, cpu, s)) {
> > > > +				/*
> > > > +				 * We can race with a state change and
> > > > +				 * need to make sure the state_start
> > > > +				 * update is ordered against the
> > > > +				 * updates to the live state and the
> > > > +				 * time buckets (groupc->times).
> > > > +				 *
> > > > +				 * 1. If we observe task state that
> > > > +				 * needs to be recorded, make sure we
> > > > +				 * see state_start from when that
> > > > +				 * state went into effect or we'll
> > > > +				 * count time from the previous state.
> > > > +				 *
> > > > +				 * 2. If the time delta has already
> > > > +				 * been added to the bucket, make sure
> > > > +				 * we don't see it in state_start or
> > > > +				 * we'll count it twice.
> > > > +				 *
> > > > +				 * If the time delta is out of
> > > > +				 * state_start but not in the time
> > > > +				 * bucket yet, we'll miss it entirely
> > > > +				 * and handle it in the next period.
> > > > +				 */
> > > > +				smp_rmb();
> > > > +				time += cpu_clock(cpu) - groupc->state_start;
> > > > +			}
> > > 
> > > As is, groupc->state_start needs a READ_ONCE() above and a WRITE_ONCE()
> > > below. But like stated earlier, doing an update in scheduler_tick() is
> > > probably easier.
> > 
> > I've wrapped these in READ_ONCE/WRITE_ONCE.
> 
> I just realized, these are u64, so READ_ONCE/WRITE_ONCE will not work
> correct on 32bit.

Ah, right.

Actually, that race described in the comment above - "If the time
delta is out of state_start but not in the time bucket yet, we'll miss
it entirely and handle it in the next period" - can cause bogus time
samples if state persists for more than 2s. Because if we observed a
live state and included it in our private copy of the time bucket
(times_prev), missing the delta in transit to the time bucket in the
next aggregation results in times_prev being ahead of 'time', which
causes the delta to underflow into a bogusly large sample.

Memory barriers alone cannot guarantee full coherency here (neither
seeing the delta twice, nor missing it entirely) so I'm switching this
over to seqcount to make sure the aggregator sees something sensible.

And then I don't need the READ_ONCE/WRITE_ONCE.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ