linux-kernel - Re: [PATCH] sched_clock: Prevent 64bit inatomicity on 32bit systems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Mon, 08 Apr 2013 20:31:04 -0400
From:	Steven Rostedt <rostedt@...dmis.org>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...nel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	"Wulsch, Siegfried" <Siegfried.Wulsch@...ema.de>
Subject: Re: [PATCH] sched_clock: Prevent 64bit inatomicity on 32bit systems

On Sat, 2013-04-06 at 10:10 +0200, Thomas Gleixner wrote:
> The sched_clock_remote() implementation has the following inatomicity
> problem on 32bit systems when accessing the remote scd->clock, which
> is a 64bit value.
> 
> CPU0			CPU1
> 
> sched_clock_local()	sched_clock_remote(CPU0)
> ...
> 			remote_clock = scd[CPU0]->clock
> 			    read_low32bit(scd[CPU0]->clock)
> cmpxchg64(scd->clock,...)
> 			    read_high32bit(scd[CPU0]->clock)
> 
> While the update of scd->clock is using an atomic64 mechanism, the
> readout on the remote cpu is not, which can cause completely bogus
> readouts.
> 
> It is a quite rare problem, because it requires the update to hit the
> narrow race window between the low/high readout and the update must go
> across the 32bit boundary.
> 
> The resulting misbehaviour is, that CPU1 will see the sched_clock on
> CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
> to this bogus timestamp. This stays that way due to the clamping
> implementation for about 4 seconds until the synchronization with
> CLOCK_MONOTONIC undoes the problem.
> 
> The issue is hard to observe, because it might only result in a less
> accurate SCHED_OTHER timeslicing behaviour. To create observable
> damage on realtime scheduling classes, it is necessary that the bogus
> update of CPU1 sched_clock happens in the context of an realtime
> thread, which then gets charged 4 seconds of RT runtime, which results
> in the RT throttler mechanism to trigger and prevent scheduling of RT
> tasks for a little less than 4 seconds. So this is quite unlikely as
> well.
> 
> The issue was quite hard to decode as the reproduction time is between
> 2 days and 3 weeks and intrusive tracing makes it less likely, but the
> following trace recorded with trace_clock=global, which uses
> sched_clock_local(), gave the final hint:
> 
>   <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
>   <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
> irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
>   <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
> 
> What happens is that CPU0 goes idle and invokes
> sched_clock_idle_sleep_event() which invokes sched_clock_local() and
> CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
> sched_remote_clock(). The time jump gets propagated to CPU0 via
> sched_remote_clock() and stays stale on both cores for ~4 seconds.
> 
> There are only two other possibilities, which could cause a stale
> sched clock:
> 
> 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
>    wrong value.
> 
> 2) sched_clock() which reads the TSC returns a sporadic wrong value.
> 
> #1 can be excluded because sched_clock would continue to increase for
>    one jiffy and then go stale.
> 
> #2 can be excluded because it would not make the clock jump
>    forward. It would just result in a stale sched_clock for one jiffy.
> 
> After quite some brain twisting and finding the same pattern on other
> traces, sched_clock_remote() remained the only place which could cause
> such a problem and as explained above it's indeed racy on 32bit
> systems.
> 
> So while on 64bit systems the readout is atomic, we need to verify the
> remote readout on 32bit machines. We need to protect the local->clock
> readout in sched_clock_remote() on 32bit as well because an NMI could
> hit between the low and the high readout, call sched_clock_local() and
> modify local->clock.
> 
> Thanks to Siegfried Wulsch for bearing with my debug requests and
> going through the tedious tasks of running a bunch of reproducer
> systems to generate the debug information which let me decode the
> issue.

Ug. That looks painful.

Nice catch!

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/