linux-kernel - Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <87twd30xwk.fsf@gmail.com>
Date:   Mon, 26 Sep 2016 12:15:55 +0200
From:   Nicolai Stange <nicstange@...il.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Nicolai Stange <nicstange@...il.com>,
        John Stultz <john.stultz@...aro.org>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock

Nicolai Stange <nicstange@...il.com> writes:

> Thomas Gleixner <tglx@...utronix.de> writes:
>
>> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> Thomas Gleixner <tglx@...utronix.de> writes:
>>> 
>>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> >> Thomas Gleixner <tglx@...utronix.de> writes:
>>> >> > Have you ever measured the overhead of the extra work which has to be done
>>> >> > in clockevents_adjust_all_freqs() ?
>>> >> 
>>> >> Not exactly, I had a look at its invocation frequency which seems to
>>> >> decay exponentially with uptime, presumably because the NTP error
>>> >> approaches zero.
>>> >> 
>>> >> However, I've just gathered a function_graph ftrace on my Intel
>>> >> i7-4800MQ (Haswell, 8HTs):
>>> >> 
>>> >> #     TIME        CPU  DURATION                  FUNCTION CALLS
>>> >> #      |          |     |   |                     |   |   |   |
>>> >>    85.287027 |   0)   0.899 us    |  clockevents_adjust_all_freqs();
>>> >>    85.288026 |   0)   0.759 us    |  clockevents_adjust_all_freqs();
>>> >>    85.289026 |   0)   0.735 us    |  clockevents_adjust_all_freqs();
>>> >>    85.290026 |   0)   0.671 us    |  clockevents_adjust_all_freqs();
>>> >>   149.503656 |   2)   2.477 us    |  clockevents_adjust_all_freqs();
>>> >
>>> > That's not that bad. Though I'd like to see numbers for ARM (especially the
>>> > less powerful SoCs) as well.
>>> 
>>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>>
>> So why is the variance that high?
>
> I think this is because the histogram has got two peaks, c.f. [1]
>
> Also, the 11us maximum is not isolated but a flat tail is reaching to
> this point which I admittedly can't explain.

It turned out that the linux-next kernel always ran the RPi2B at what
apparently is its minimum speed.

lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of
120ns on linux-next.

Compare this to an "official" kernel from the Raspberry Pi Foundation
obtained from [2]: mhz says that the CPU runs at 900MHz and according to
lat_mem_rd, the memory latency is at 50ns.

Especially the high memory latency killed my benchmark: both, the second
peak and the long tail stemmed from cache misses.

In order to verify this, I separated the tracing data from linux-next
into those samples that do not have any other calls to
clockevents_adjust_all_freqs() within a time span of 100ms before them
("first of run") and those that do ("not first of run"). The result can
be seen at [3]: the second peak as well as the long tail is generated
exclusively by the "first of run"'s.

Unfortunately I was not able to get this RPi2B running at its full
capabilities with linux-next. So I applied this series on top of the RPi
Foundation's kernel and did further benchmarking there. The results can
be found at [4]: no second peak, no particularly long tail.
Some statistics:
     0%   25%   50%   75%  100% 
  1.250 1.511 1.667 1.927 7.031

  Mean: 1.89
  sd: 0.69 

Much better IMHO. Good enough?

A random note: during tracing, I recognized that the adjustment should
better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both
measurements include that change already.

>> You have an outlier on that intel as well which might be caused by
>> NMI, but it might also be a systematic issue depending on the input
>> parameters.
>
> AFACIT, the "algorithmic" runtime should be constant per CED, so it
> should not be dependent on any input parameters.

Well, this is not exactly true: the __do_div64() on ARM is implemented in
software. Basically this algorithm's runtime depends on the position of
the dividend's MSB. However, the range of the "adj" dividend as given by
__clockevents_calc_adjust_freq() should be relatively narrow.
I traced __do_div64() and there haven't been any apparent abnormalities.

>> 11 us on that ARM worries me.

These are 7us now. Also, this max value isn't nearly as connected to the
rest of the histogram as that 11us sample before. So it *might* be an
outlier now. I can't tell for sure though.

Thanks,

Nicolai

> [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png

[2] https://github.com/raspberrypi/linux
[3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf
[4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf