linux-kernel - Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALyZvKxKXWiFF9OYtykXEfnx-k+xaAM7XREEQ+N58oJtOTZhew@mail.gmail.com>
Date:   Tue, 13 Mar 2018 23:45:45 +0000
From:   Jason Vas Dias <jason.vas.dias@...il.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     x86@...nel.org, LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        andi <andi@...stfloor.org>
Subject: Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW

On 12/03/2018, Peter Zijlstra <peterz@...radead.org> wrote:
> On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote:
>>   Sometimes, particularly when correlating elapsed time to performance
>>   counter values,
>
> So what actual problem are you tring to solve here? Perf can already
> give you sample time in various clocks, including MONOTONIC_RAW.
>
>

Yes, I am sampling perf counters, including CPU_CYCLES , INSTRUCTIONS,
CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with
perf_event_open() , for the current thread on the current CPU -
I am doing this for 4 threads , on Intel & ARM cpus.

Reading performance counters does involve  2 ioctls and a read() ,
which takes time that  already far exceeds the time required to read
the TSC or CNTPCT in the VDSO .

The CPU_CLOCK software counter should give the converted TSC cycles
seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...)
and the  ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the
difference between the event->time_running and time_enabled
should also measure elapsed time .

This gives the "inner" elapsed time, from the perpective of the kernel,
while the measured code section had the counters enabled.

But unless the user-space program  also has a way of measuring elapsed time
from the CPU's perspective , ie. without being subject to operator or NTP / PTP
adjustment, it has no way of correlating this inner elapsed time with
any "outer"
elapsed time measurement it may have made - I also measure the time
taken by I/O operations between threads, for instance.

So that is my primary motivation - for each thread's main run loop, I
enable performance counters and count several PMU counters
and the CPU_CLOCK & TASK_CLOCK .  I want to determine
with maximal accuracy how much elapsed time was used
actually executing the task's instructions on the CPU ,
and how long they took to execute.
I want to try to exclude the time spent gathering and making
and analysing the performance measurements from the
time spent running the threads' main loop .

To do this accurately, it is best to exclude variations in time
that occur because of operator or NTP / PTP adjustments .

The CLOCK_MONOTONIC_RAW clock is the ONLY
clock that is MEANT to be immune from any adjustment.

It is meant to be high - resolution clock with 1ns resolution
that should be subject to no adjustment, and hence one would expect
it it have the lowest latency.

But the way Linux has up to now implemented it , CLOCK_MONOTONIC_RAW
has a resolution (minimum time that can be measured)
that varies from 300 - 1000ns .

I can read the TSC  and store a 16-byte timespec value in @ 8ns
on the same CPU .

I understand that linux must conform to the POSIX interface which
means it cannot provide sub-nanosecond resolution timers, but
it could allow user-space programs to easily discover the timer calibration
so that user-space programs can read the timers themselves.

Currently, users must parse the log file or use gdb / objdump to
inspect /proc/kcore to get the TSC calibration and exact
mult+shift values for the TSC value conversion.

Intel does not publish, nor does the CPU come with in ROM or firmware,
the actual precise TSC frequency - this must be calibrated against the
other clocks , according to a complicated procedure in section 18.2 of
the SDM . My TSC has a "rated" / nominal TSC frequency , which one
can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency"
is 2.8333ghz .

Hence I think Linux should export this calibrated frequency somehow ;
its "calibration" is expressed as the raw clocksource 'mult' and 'shift'
values, and is exported to the VDSO .

I think the VDSO should read the TSC and use the calibration
to render the raw, unadjusted time from the CPU's perspective.

Hence, the patch I am preparing , which is again attached.

I will submit it properly via email once I figure out
how to obtain the 'git-send-mail' tool, and how to
use it to send multiple patches, which seems
to be the only way to submit acceptable patches.

Also the attached timer program measures a latency
of @ 20ns with my patch 4.15.9 kernel, when it
measured a latency of 300-1000ns without it.

Thanks & Regards,

Jason

Download attachment "vdso_clock_monotonic_raw_1.patch" of type "application/octet-stream" (3834 bytes)

View attachment "timer_latency.c" of type "text/x-csrc" (2443 bytes)