linux-kernel - Re: [PATCH] time.c::timespec_trunc: fix nanosecond file time rounding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALAqxLVSKSgf2+c1oHdSv9aH7gUjE9=q-E-j81DGsPNM1LoH9g@mail.gmail.com>
Date:	Tue, 16 Jun 2015 16:08:12 -0700
From:	John Stultz <john.stultz@...aro.org>
To:	Karsten Blees <karsten.blees@...il.com>
Cc:	lkml <linux-kernel@...r.kernel.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH] time.c::timespec_trunc: fix nanosecond file time rounding

On Tue, Jun 16, 2015 at 3:39 PM, Karsten Blees <karsten.blees@...il.com> wrote:
> Am 16.06.2015 um 19:07 schrieb John Stultz:
>> On Tue, Jun 9, 2015 at 10:36 AM, Karsten Blees <karsten.blees@...il.com> wrote:
>>> From: Karsten Blees <blees@...n.de>
>>> Date: Tue, 9 Jun 2015 10:50:28 +0200
>>>
>>> The rounding optimization in timespec_trunc() is based on the incorrect
>>> assumptions that current_kernel_time() is rounded to jiffies resolution,
>>> and that jiffies resolution is a multiple of all potential file time
>>> granularities.
>>
>> Sorry, this is a little opaque on the first read. You're saying that
>> there are filesystems where the on-disk granularity is smaller then a
>> tick/jiffy, but larger then a nanosecond, right?
>>
>
> Yes, examples include CIFS, NTFS (100 ns) and CEPH, UDF (1000 ns).

Thanks. Adding these concrete examples to the commit message would be good.


> The current code assumes that rounding can be avoided if (gran <= ns_per_tick).
>
> However, this optimization is only valid if:
>
> 1. current_kernel_time().tv_nsec is already rounded to tick resolution.
>    E.g. with HZ=1000 you would get tv_nsec = 1000000, 2000000, 3000000, but
>    never 1000001. AFAICT this is not true; current_kernel_time() may be
>    incremented only once per tick, but its not rounded to tick resolution.
>
> 2. ns_per_tick is evenly divisible by gran, for all potential HZ and
>    granularity values. IOW "(ns_per_tick % gran) == 0". This may have been
>    true for HZ=100, 250, 1000, but not for HZ=300. E.g. if assumption 1
>    above was true, HZ=300 would give you tv_nsec = 3333333, 6666666,
>    9999999... This would definitely need to be rounded to e.g. UDF
>    resolution, even though (1000 <= 3333333) is clearly true.
>
>>> Thus, sub-second portions of in-core file times are not rounded to on-disk
>>> granularity. I.e. file times may change when the inode is re-read from disk
>>> or when the file system is remounted.
>>>
>>> File systems with on-disk resolutions of exactly 1 ns or 1 s are not
>>> affected by this.
>>>
>>> Steps to reproduce with e.g. UDF:
>>>
>>>   $ dd if=/dev/zero of=udfdisk count=10000 && mkudffs udfdisk
>>>   $ mkdir udf && mount udfdisk udf
>>>   $ touch udf/test && stat -c %y udf/test
>>>   2015-06-09 10:22:56.130006767 +0200
>>>   $ umount udf && mount udfdisk udf
>>>   $ stat -c %y udf/test
>>>   2015-06-09 10:22:56.130006000 +0200
>>>
>>> Remounting rounds the mtime to 1µs.
>>>
>>> Fix the rounding in timespec_trunc() and update the documentation.
>>>
>>> Note: This does _not_ fix the issue for FAT's 2 second mtime resolution,
>>> as struct super_block.s_time_gran isn't prepared to handle different
>>> ctime / mtime / atime resolutions nor resolutions > 1 second.
>>>
>>> Signed-off-by: Karsten Blees <blees@...n.de>
>>> ---
>>>
>>> This issue came up in a recent discussion on the git ML about enabling
>>> nanosecond file times on Windows, see
>>>
>>> http://thread.gmane.org/gmane.comp.version-control.msysgit/21290/focus=21315
>>>
>>>
>>>  kernel/time/time.c | 17 ++++-------------
>>>  1 file changed, 4 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/kernel/time/time.c b/kernel/time/time.c
>>> index 972e3bb..362ee06 100644
>>> --- a/kernel/time/time.c
>>> +++ b/kernel/time/time.c
>>> @@ -287,23 +287,14 @@ EXPORT_SYMBOL(jiffies_to_usecs);
>>>   * @t: Timespec
>>>   * @gran: Granularity in ns.
>>>   *
>>> - * Truncate a timespec to a granularity. gran must be smaller than a second.
>>> - * Always rounds down.
>>> - *
>>> - * This function should be only used for timestamps returned by
>>> - * current_kernel_time() or CURRENT_TIME, not with do_gettimeofday() because
>>> - * it doesn't handle the better resolution of the latter.
>>> + * Truncate a timespec to a granularity. gran must not be greater than a
>>> + * second (10^9 ns). Always rounds down.
>>>   */
>>>  struct timespec timespec_trunc(struct timespec t, unsigned gran)
>>>  {
>>> -       /*
>>> -        * Division is pretty slow so avoid it for common cases.
>>> -        * Currently current_kernel_time() never returns better than
>>> -        * jiffies resolution. Exploit that.
>>> -        */
>>> -       if (gran <= jiffies_to_usecs(1) * 1000) {
>>> +       if (gran <= 1) {
>>>                 /* nothing */
>>
>> So this change will in effect, cause us to truncate where granularity
>> was less then one tick, where before we didn't do anything. Have you
>> reviewed all users to ensure this is safe (I assume you have, but it
>> might be good to describe which users are affected in the commit
>> message)?
>>
>>
>
> timespec_trunc() is exclusively used to calculate inode's [acm]time.
> It is mostly called through current_fs_time(), only a handful of fs
> drivers use it directly (but always with super_block.s_time_gran as
> second argument).
>
> So I think changing the function to do what the documentation says it
> does should be safe...

Yea, though existing behavior is often more "expected" then documented
behavior. :)


>
>>> -       } else if (gran == 1000000000) {
>>> +       } else if (gran >= 1000000000) {
>>>                 t.tv_nsec = 0;
>>
>> While the code (which is quite old) wasn't super intuitive, this looks
>> to be making it more subtle instead of more clear. So if the
>> granularity is larger then a second, we just truncate to a second?
>> That seems surprising. If handling granularity larger then a second
>> isn't supported, we should probably make that explicit and add a
>> WARN_ON to catch problematic users of the function.
>
> Indeed, I changed this to catch invalid arguments (similar to how
> "gran <= 1" catches 0 and thus prevents division by zero).
>
> What about this instead?
>
>         if (gran == 1) {
>                 /* nothing */
>         } else if (gran == 1000000000) {
>                 t.tv_nsec = 0;
>         } else if (gran < 1 || gran > 1000000000) {
>                 WARN_ON(1);
>         } else {
>                 t.tv_nsec -= t.tv_nsec % gran;
>         }
>         return t;

Logically its ok. I might suggest cleaning it up as:

if ((gran < 1) || (gran > NSEC_PER_SEC))
   WARN_ON(1);  /* catch invalid granularity values  */
else if (gran == NSEC_PER_SEC)
   t.tv_nsec = 0; /* special case to avoid div */
else if ((gran > 1) && ( gran < NSEC_PER_SEC))
     t.tv_nsec -= t.tv_nsec % gran;
return t;

Also it would be good to make it clear in the function comment that
gran > NSEC_PER_SEC are invalid.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/