[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87twd30xwk.fsf@gmail.com>
Date: Mon, 26 Sep 2016 12:15:55 +0200
From: Nicolai Stange <nicstange@...il.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Nicolai Stange <nicstange@...il.com>,
John Stultz <john.stultz@...aro.org>,
linux-kernel@...r.kernel.org
Subject: Re: [RFC v7 00/23] adapt clockevents frequencies to mono clock
Nicolai Stange <nicstange@...il.com> writes:
> Thomas Gleixner <tglx@...utronix.de> writes:
>
>> On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> Thomas Gleixner <tglx@...utronix.de> writes:
>>>
>>> > On Wed, 21 Sep 2016, Nicolai Stange wrote:
>>> >> Thomas Gleixner <tglx@...utronix.de> writes:
>>> >> > Have you ever measured the overhead of the extra work which has to be done
>>> >> > in clockevents_adjust_all_freqs() ?
>>> >>
>>> >> Not exactly, I had a look at its invocation frequency which seems to
>>> >> decay exponentially with uptime, presumably because the NTP error
>>> >> approaches zero.
>>> >>
>>> >> However, I've just gathered a function_graph ftrace on my Intel
>>> >> i7-4800MQ (Haswell, 8HTs):
>>> >>
>>> >> # TIME CPU DURATION FUNCTION CALLS
>>> >> # | | | | | | | |
>>> >> 85.287027 | 0) 0.899 us | clockevents_adjust_all_freqs();
>>> >> 85.288026 | 0) 0.759 us | clockevents_adjust_all_freqs();
>>> >> 85.289026 | 0) 0.735 us | clockevents_adjust_all_freqs();
>>> >> 85.290026 | 0) 0.671 us | clockevents_adjust_all_freqs();
>>> >> 149.503656 | 2) 2.477 us | clockevents_adjust_all_freqs();
>>> >
>>> > That's not that bad. Though I'd like to see numbers for ARM (especially the
>>> > less powerful SoCs) as well.
>>>
>>> On a Raspberry Pi 2B (bcm2836, ARMv7) with CONFIG_SMP=y, the mean over
>>> ~5300 samples is 5.14+/-1.04us with a max of 11.15us.
>>
>> So why is the variance that high?
>
> I think this is because the histogram has got two peaks, c.f. [1]
>
> Also, the 11us maximum is not isolated but a flat tail is reaching to
> this point which I admittedly can't explain.
It turned out that the linux-next kernel always ran the RPi2B at what
apparently is its minimum speed.
lmbench3's mhz gave me 560MHz and lat_mem_rd reports a memory latency of
120ns on linux-next.
Compare this to an "official" kernel from the Raspberry Pi Foundation
obtained from [2]: mhz says that the CPU runs at 900MHz and according to
lat_mem_rd, the memory latency is at 50ns.
Especially the high memory latency killed my benchmark: both, the second
peak and the long tail stemmed from cache misses.
In order to verify this, I separated the tracing data from linux-next
into those samples that do not have any other calls to
clockevents_adjust_all_freqs() within a time span of 100ms before them
("first of run") and those that do ("not first of run"). The result can
be seen at [3]: the second peak as well as the long tail is generated
exclusively by the "first of run"'s.
Unfortunately I was not able to get this RPi2B running at its full
capabilities with linux-next. So I applied this series on top of the RPi
Foundation's kernel and did further benchmarking there. The results can
be found at [4]: no second peak, no particularly long tail.
Some statistics:
0% 25% 50% 75% 100%
1.250 1.511 1.667 1.927 7.031
Mean: 1.89
sd: 0.69
Much better IMHO. Good enough?
A random note: during tracing, I recognized that the adjustment should
better skip those CLOCK_EVT_FEAT_DUMMY devices. v8 will do this. Both
measurements include that change already.
>> You have an outlier on that intel as well which might be caused by
>> NMI, but it might also be a systematic issue depending on the input
>> parameters.
>
> AFACIT, the "algorithmic" runtime should be constant per CED, so it
> should not be dependent on any input parameters.
Well, this is not exactly true: the __do_div64() on ARM is implemented in
software. Basically this algorithm's runtime depends on the position of
the dividend's MSB. However, the range of the "adj" dividend as given by
__clockevents_calc_adjust_freq() should be relatively narrow.
I traced __do_div64() and there haven't been any apparent abnormalities.
>> 11 us on that ARM worries me.
These are 7us now. Also, this max value isn't nearly as connected to the
rest of the histogram as that 11us sample before. So it *might* be an
outlier now. I can't tell for sure though.
Thanks,
Nicolai
> [1] https://nicst.de/cev-freqadjust/adjust_all_freqs-function_graph_hist.png
[2] https://github.com/raspberrypi/linux
[3] https://nicst.de/cev-freqadjust/hist-adjust-smp.pdf
[4] https://nicst.de/cev-freqadjust/hist-adjust-official-smp.pdf
Powered by blists - more mailing lists