linux-kernel - Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Mon, 30 Jan 2017 16:38:08 +0100
From:   Frederic Weisbecker <fweisbec@...il.com>
To:     Stanislaw Gruszka <sgruszka@...hat.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Tony Luck <tony.luck@...el.com>,
        Wanpeng Li <wanpeng.li@...mail.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Mackerras <paulus@...ba.org>,
        Ingo Molnar <mingo@...nel.org>,
        Fenghua Yu <fenghua.yu@...el.com>,
        Rik van Riel <riel@...hat.com>,
        Martin Schwidefsky <schwidefsky@...ibm.com>
Subject: Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs

On Mon, Jan 30, 2017 at 02:54:51PM +0100, Stanislaw Gruszka wrote:
> On Sat, Jan 28, 2017 at 04:28:13PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote:
> > > On 32 bit architectures 64bit store/load is not atomic and if not
> > > protected - 64bit variables can be mangled. I do not see any protection
> > > (lock) between utime/stime store and load in the patch and seems that
> > > {u/s}time store/load can be performed at the same time. Though problem
> > > is very very improbable it still can happen at least theoretically when
> > > lower and upper 32 bits are changed at the same time i.e. process
> > > {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and
> > > 64bit {u,s}time is stored and loaded at the same time on different
> > > cpus. As said this is very improbable situation, but eventually could
> > > be possible on long lived processes.
> > 
> > "Improbable situation" doesn't appply to Linux. With millions (billion?)
> > of machines using it, a rare issue in the core turns into likely to happen
> > somewhere in the planet every second.
> > 
> > So it's definetly a race we want to consider. Note it goes beyond the scope
> > of this patchset as the issue was already there before since cputime_t can already
> > map to u64 on 32 bits systems upstream. But this patchset definetly extends
> > the issue on all 32 bits configs.
> > 
> > kcpustat has the same issue upstream. It's is made of u64 on all configs.
> 
> I would like to add what are possible consequences if value will be
> mangled. For sum_exec_runtime, utime and stime we could get wrong values
> on cpu-clock related syscalls like clock_gettime() or clock_nanosleep()
> and cpu-clock timers like timer_create(CLOCK_PROCESS_CPUTIME_ID) can be
> triggered before or long after expected. For kcpustat this seems to be
> wrong values read by procfs and 3 drivers (cpufreq, appldata, macintosh).

Yep, all agreed.

> 
> > > I considering fixing problem of sum_exec_runtime possible mangling
> > > by using prev_sum_exec_runtime:
> > > 
> > > u64 read_sum_exec_runtime(struct task_struct *t)
> > > {
> > >        u64 ns, prev_ns;
> > >  
> > >        do {
> > >                prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime);
> > >                ns = READ_ONCE(t->se.sum_exec_runtime);
> > >        } while (ns < prev_ns || ns > (prev_ns + U32_MAX));
> > >  
> > >        return ns;
> > > }
> > > 
> > > This should work based on fact that prev_sum_exec_runtime and
> > > sum_exec_runtime are not modified and stored at the same time, so only
> > > one of those variabled can be mangled. Though I need to think about 
> > > correctnes of that a bit more.
> > 
> > I'm not sure that would be enough. READ_ONCE prevents from reordering by the
> > compiler but not by the CPU. You'd need memory barriers between reads and
> > writes of prev_ns and ns.
> 
> It will not be enough, this _suppose_ to work based on that sum_exec_runtime
> and prev_sum_exec_runtime are not written at the same time. i.e. only
> one variable can be mangled as another one sits already in the memory.
> However "not written at the same time" is weak part of reasoning. Even
> if those variables are stored at different part of code (sum_exec_runtime
> on update_curr() and prev_sum_exec_runtime on set_next_entity()) we can
> not assume store of one variable is finished before another one starts.

Right.

> 
> >    WRITE ns                READ prev_ns
> >    smp_wmb()               smp_rmb()
> >    WRITE prev_ns           READ ns
> >    smp_wmb()               smp_rmb()
> >
> > It seems to be the only way to make sure that at least one of the reads
> > (prev_ns or ns) is correct.

Well reading that again, I'm not 100% sure about the correctness guarantee.
But it might work.

> 
> I think you have right, but seems on much code paths we have scenario:
> 
> 	WRITE ns		READ prev_ns
> 	smp_wmb()		smp_rmb()
> 	WRITE prev_ns		READ ns
> 
> and we have already smp_wmb() after write of sum_exec_runtime on
> update_min_vruntime().

You still need a second barrier after the second write and read (or
before the first write and read, it's the same) to ensure that if you read
a mangled version of ns, prev_ns is ok.

Still I think u64_stats_sync is less trouble and more reliable.

Thanks.