lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1611302355070.3619@nanos>
Date:   Thu, 1 Dec 2016 00:21:02 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     David Gibson <david@...son.dropbear.id.au>
cc:     John Stultz <john.stultz@...aro.org>,
        lkml <linux-kernel@...r.kernel.org>,
        Liav Rehana <liavr@...lanox.com>,
        Chris Metcalf <cmetcalf@...lanox.com>,
        Richard Cochran <richardcochran@...il.com>,
        Ingo Molnar <mingo@...nel.org>,
        Prarit Bhargava <prarit@...hat.com>,
        Laurent Vivier <lvivier@...hat.com>,
        "Christopher S . Hall" <christopher.s.hall@...el.com>,
        "4.6+" <stable@...r.kernel.org>
Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned
 in its calculation.

On Wed, 30 Nov 2016, David Gibson wrote:
> On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > If we have legitimate use cases with a negative delta, then this patch
> > breaks them no matter what. See the basic C course section in the second
> > link.
> 
> So, fwiw, when I first wrote a variant on this, I wasn't trying to fix
> every case - just to make the consequences less bad if something goes
> wrong.  An overflow here can still mess up timekeeping, it's true, but
> time going backwards tends to cause things to go horribly, horribly
> wrong - which was why I spotted this in the first place.

I completely understand the intention.

We _cannot_ make that whole thing unsigned when it is not 100% clear
that there is no legitimate caller which hands in a negative delta and
rightfully expects to get a negative nanoseconds value handed back.

If someone sits down and proves that this cannot happen there is no reason
to hold that off.

But that still does not solve the underlying root cause. Assume the
following:

  T1 = base + to_nsec(delta1)

       where delta1 is big, but the multiplication does not overflow 64bit

  Now wait a bit and do:
      
  T2 = base + to_nsec(delta2)

       now delta2 is big enough, so the  multiplication does overflow 64bit
       now delta2 is big enough to overflow 64bit with the multiplication.

  The result is T2 < T1, i.e. time goes backwards.

All what the unsigned conversion does is to procrastinate the problem by a
factor of 2. So instead of failing after 10 seconds we fail after 20
seconds. And just because you never observed the 20 seconds problem it does
not go away magically.

The proper solution is to figure out WHY we are running into that situation
at all. So far all I have seen are symptom reports and fairy tales about
ftp connections, but no real root cause analysis.

The only reason for this to happen is that 'base' does not get updated for
a too long time, so the delta grows into the overflow range.

We already have protection against idle sleeping too long for this to
happen. If the idle protection is not working then it needs to be fixed.

if some other situation can cause the base not to be updated for a long
time, then this needs to be fixed.

Curing the symptom is a guarantee that the root cause will show another
symptom sooner than later.

Thanks,

	tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ