linux-kernel - Re: [PATCH] [RFC] timekeeping: Rework frequency adjustments to work better w/ nohz

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <53589195.3060605@linaro.org>
Date:	Wed, 23 Apr 2014 21:22:45 -0700
From:	John Stultz <john.stultz@...aro.org>
To:	Miroslav Lichvar <mlichvar@...hat.com>
CC:	LKML <linux-kernel@...r.kernel.org>,
	Richard Cochran <richardcochran@...il.com>,
	Prarit Bhargava <prarit@...hat.com>
Subject: Re: [PATCH] [RFC] timekeeping: Rework frequency adjustments to work
 better w/ nohz

Hey Miroslav!

  Once again, a few months pass and I finally get some more time to look
at this. :( Sorry for how slow this has been going.


On 02/12/2014 07:42 AM, Miroslav Lichvar wrote:
> On Mon, Jan 06, 2014 at 07:57:03PM -0800, John Stultz wrote:
>> Got a few cycles to take another look at this, and tried to address
>> Miroslav's latest comments. Please let me know if you have further
>> thoughts!
> In the simulations this version of the patch does indeed work better
> than the previous one. I tested it with the ntp_error correction added
> back as per your previous mail. It converges, the measured frequency
> matches the value set by adjtimex and the ntp error stays close to
> zero. I'm concerned about two things now and I'm not sure if they can
> be fixed easily.
>
> One is that the convergence still seems too slow to me.
>
> ./tk_test -t 500 -n 4000
> samples: 1-500 reg: 1-500 slope: 1.00 dev: 1947.5 max: 36940.9 freq: -100.08081
> samples: 501-1000 reg: 501-1000 slope: 1.00 dev: 430.7 max: 1669.6 freq: -100.07168
> samples: 1001-1500 reg: 1001-1500 slope: 1.00 dev: 675.3 max: 3172.9 freq: -100.07393
> samples: 1501-2000 reg: 1501-2000 slope: 1.00 dev: 453.9 max: 2223.4 freq: -100.07177
> samples: 2001-2500 reg: 2001-2500 slope: 1.00 dev: 2601.2 max: 10875.4 freq: -100.03978
> samples: 2501-3000 reg: 2501-3000 slope: 1.00 dev: 185.6 max: 1251.6 freq: -99.99987
> samples: 3001-3500 reg: 3001-3500 slope: 1.00 dev: 160.1 max: 1181.7 freq: -99.99996
> samples: 3501-4000 reg: 3501-4000 slope: 1.00 dev: 150.7 max: 1103.2 freq: -99.99990
>
> You can see in this test it takes about 2500 updates to correct the
> initial ntp error and settle down. That's with 1GHz clocksource. In
> some tests I did with smaller clock frequencies or different frequency
> offsets it took much longer than that.

So I started to look into this slow update issue. I was a little
confused, as the logarithmic approximation done in the frequency
correction shouldn't let *that* much initial error accumulate before get
to the +1/0 adjustments.

It ends up this is more a reflection of a different part of your patch.
Particularly the tk->ntp_tick storage. Bascially the ntp_tick variable
is a cache of the ntp_tick_length() value, however it doesn't get set to
ntp_tick_length() until *after* you do the first frequency correction.
Basically this avoids accumulating any error until after the first
correction is made.

My main concern is that this seems like an accounting error. By
basically avoiding accumulating the initial error it seems it would
never be corrected, no?

If I add a similar ntp_tick caching to my current implementation, it
converges practically as fast (with the only initial error coming from
the logarithmic approximation of the divide being quite small). 
Similarly, if you remove the usage of tk->ntp_tick in the error
accumulation loop, using ntp_tick_length() your patch behaves quite
similarly to mine.

Does this align with your view of the code? Or am I misunderstanding
something?

> $ ./tk_test -s 2500 -n 5000
> samples: 1-5000 reg: 2501-5000 slope: 1.00 dev: 135942.4 max: 390761.8 freq: -100.00000
>
> Here the regression line is calculated only through the latter half of
> the samples (where it was already settled down) and we can see the
> total ntp error corrected since the first update is around 390
> microseconds.
>
> I'm not saying ntpd or linuxptp can't work with this. The PLL and PI
> controllers seem to handle this slowness in the frequency adjustment
> well. I'm more worried about designs that may change the frequency
> quickly and rely on accurate prediction of the clock in their feedback
> loop.
>
> The other thing I'm concerned about is that the multiplier doesn't
> change only between two closest values when the loop has settled down,
> but in a larger range. With the 1GHz clock I see the multiplier is
> moving in range of 8387767-8387772, which makes the ntp error several
> times larger than it would be if the multiplier switched just between
> 8387769 and 8387770.

So this point is valid. I spent some time reworking things and I'm not
totally happy with it currently (there's a little mess to it), but its
much closer to your implementation logically, but again just avoids the
divide and keeps the accounting closer to what we already have.

I'll hopefully clean up my current work and send it out tomorrow.

Also I do really appreciate your submissions here and apologize I've not
been able to put more time towards it recently. I also know having me
rewrite your patch over and over with various mistakes is probably
frustrating, but I do really want to make sure I both understand all the
subtleties of your changes (which resynthesizing helps) as well as make
sure the accounting is really correct (or at least not changed subtlety
without clear reason).

Thanks so much again!
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/