linux-kernel - Re: Extreme time jitter with suspend/resume cycles

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1710052302440.2398@nanos>
Date:   Thu, 5 Oct 2017 23:04:42 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Gabriel Beddingfield <gabe@...tlabs.com>
cc:     LKML <linux-kernel@...r.kernel.org>,
        Stephen Boyd <sboyd@...eaurora.org>,
        John Stultz <john.stultz@...aro.org>,
        Alessandro Zummo <a.zummo@...ertech.it>,
        Alexandre Belloni <alexandre.belloni@...e-electrons.com>,
        linux-rtc@...r.kernel.org, Guy Erb <guy@...tlabs.com>,
        Howard Harte <hharte@...tlabs.com>
Subject: Re: Extreme time jitter with suspend/resume cycles

On Thu, 5 Oct 2017, Gabriel Beddingfield wrote:

> Hi Thomas,
> 
> On Thu, Oct 5, 2017 at 11:01 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
> >> > Which SoC/clocksource driver are you talking about?
> >>
> >> NXP i.MX 6SoloX
> >> drivers/clocksource/timer-imx-gpt.c
> >
> > So that clocksource driver looks correct. Do you have an idea in which
> > context this time jump happens? Does it happen when you exercise your high
> > frequency suspend/resume dance or is that happening just when you let the
> > machine run forever as well?
> 
> We couldn't devise any reproduction steps. We observed it happening at
> unexpected times in a fleet of devices -- and we couldn't find any
> patterns to clue us in.

Ok. Did you talk to NXP about that? Or did you try to exercise reads in a
loop to detect the wreckage and maybe a pattern in there?

> > The timekeeping_resume() path definitely has an issue:
> >
> >         cycle_now = tk_clock_read(&tk->tkr_mono);
> >         if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
> >                 cycle_now > tk->tkr_mono.cycle_last) {
> >
> > This works nice for clocksources which wont wrap across suspend/resume but
> > not for those which can. That cycle_now -> cycle_last check should take
> > cs-mask into account ...
> >
> > Of course for clocksources which can wrap within realistic suspend times,
> > which 36 hours might be accounted for, this would need an extra sanity
> > check against a RTC whether wrap time has been exceeded.
> >
> > I haven't thought it through whether that buggered check fully explains
> > what you are observing, but it's wrong nevertheless. John?
> 
> Nah. It looks like the consequence is that you'll either fail to inject
> the sleep time or you'll fall back to having the RTC inject the sleep
> time. In our case, we never sleep for more than a couple of minutes so
> the error would be seconds rather than hours.

Fair enough. It's still wrong though and wants to be fixed.

Thanks,

	tglx