linux-kernel - Re: [PATCH] FS: timerfd: Fix unexpected return value of timerfd

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1908191943280.1796@nanos.tec.linutronix.de>
Date:   Mon, 19 Aug 2019 20:29:47 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Arul Jeniston <arul.jeniston@...il.com>
cc:     viro@...iv.linux.org.uk, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, arul_mc@...l.com
Subject: Re: [PATCH] FS: timerfd: Fix unexpected return value of timerfd_read
 function.

Arul,

On Mon, 19 Aug 2019, Arul Jeniston wrote:

> > hits ktime_get() or whether it happens concurrent on a different CPU.
> > ktime_get() can never use inconsistent tk data for calculating the time.
> 
> Agreed. I think, I am not making my point clear here.
> Do you mean to say ktime_get() would always return incremental time
> irrespective of isr and multi-processors?

Yes. The only exception is when the TSC is either jumping or not fully in
sync between cores.

> If yes, this is where, I have difference of understanding.

And your understanding is still wrong. I explain it to you _once_ more:

The side which updates the timekeeper:

    - increments the sequence count _BEFORE_ it changes any data. After that
      increment the sequence count is odd, i.e. bit 0 is set.

    - updates data (base, last, mult, shift ....)

    - increments it again _AFTER_ it updated data.  After that increment
      the sequence count is even, i.e. bit 0 is cleared.

The read out side:

start:
    - reads the sequence count
      - if sequence count is odd (update in progress) go back to start

    - reads base from timekeeper data
    - reads TSC and calculates the delta with timekeeper data
      (last, mult, shift ...), i.e. timekeeping_get_ns().

    - reads the sequence count again.

      If it is even and the same as read above, the data is valid and
      consistent and the result is returned.

      If the sequence count is different to the original value it goes back
      to start.

It does not matter at all if timekeeping_get_ns() returns occasionally a
wrong value due to timekeeper data being updated concurrently. The result
is discarded and never returned to the caller. It tries again.

All places which update the timer keeper issue the sequence count increment
protection and are properly serialized against each other. So there is no
occacional point where ktime_get() would return random crap due to being
interrupted by an update or due to a concurrent update on a different CPU.

This is a protection mechanism which is well understood in computer science
(seqlock with lockless readers) and it works in kernel timekeeping for way
more than a decade without any issue except when the underlying hardware
clocksource (TSC in that case) misbehaves. There is no way to protect the
code against this and we are not going to do anything about it simply
because we can't.

The fact that you can observe the (cycles < last) condition is not proving
anything. Just looking at the (cycles < last) condition is wrong. You need
to proof that the result is returned from ktime_get() without a retry
despite the sequence counter being changed. I doubt you can.

If you can prove that the condition is met _AND_ the sequence counter has
NOT changed, then you have proven that the TSC on your machine is not
correctly synchronized or otherwise returning crap values.

You can make up further weird theories about the incorrectness of that
code, but these theories wont become magically true.

Thanks,

	tglx