lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Sat, 26 Mar 2022 22:13:52 +0100 From: Thomas Gleixner <tglx@...utronix.de> To: Artem Savkov <asavkov@...hat.com>, jpoimboe@...hat.com, netdev@...r.kernel.org Cc: davem@...emloft.net, yoshfuji@...ux-ipv6.org, dsahern@...nel.org, linux-kernel@...r.kernel.org, Artem Savkov <asavkov@...hat.com>, Anna-Maria Gleixner <anna-maria@...utronix.de> Subject: Re: [PATCH 1/2] timer: introduce upper bound timers Artem, On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote: > On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote: >> * Round up with level granularity to prevent this. >> + * Do not perform round up in case of upper bound timer. >> */ >> - expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); >> + if (upper_bound) >> + expires = expires >> LVL_SHIFT(lvl); >> + else >> + expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl); > > While this "works", I fundamentally hate this because it adds an extra actually it cannot not work. At least not in a realiable and predictable way. The timer wheel is fundamentaly relying on moving the timer one bucket out. Let's look how this all works. The wheel has N levels of bucket arrays. Each level has 64 buckets and the granularity increases by a factor of 8 per level, i.e. the worst case deviation is 12.5% per level. The original timer wheel was able to fire the timer at expiry time + one tick for the price of cascading timers every so often from one level to the next lower level. Here are a few pointers: https://lwn.net/Articles/152436/ https://lwn.net/Articles/646950/ The accuracy of the original wheel implementation was weakened already in course of the NOHZ development where the concept of slack (algorithmic batching at enqueue time for the price of overhead) was introduced. It had the same 12.5% worst case deviation of the resulting granularity level, though the batching was not enforced and only worked most of the time. So in theory the same issue could have been seen back then already. The enqueue placement and the expiry is driven by base::clock, which is nothing else than jiffies. When a timer is queued then base::clock is the time on which the next tick and expiry check happens. Now let's look how the expiry mechanism works. The first level (0) is obviously expired every tick. On every eigth tick the second level (1) is expired, on every 64th tick the third level (2)... IOW, the expiry events of a level happen at 8^index(level) intervals. Let's assume that base::clk is 0. That means at the next tick (which could be imminent) in _all_ levels bucket[0] is due for expiry. Now let's enqueue a timer with expiry value of 64: delta = 64 - 0 = 64 That means the timer ends up in the second level in bucket[0], which makes it eligible for expiry at the next tick. The same is true for any expiry value of 8^N. Not what you are trying to achieve, right? You try to enforce an upper bound, but you expect that the timer does not fire earlier than 12.5% of the granularity level of that upper bound, right? IOW, you created a expiry lottery and there is no way to prevent that except with more conditionals and heuristics which are hardely welcomed. You've also seen the outcome of a timer firing unexpectedly due to the bit abuse, right? Now let's take a step back and look at the problem at hand. TCP alive timer, which is affected by the batching and the resulting inaccuracy, is (re)armed with a relative timeout, which is known upfront, right? The important part is *relative* timeout. Why? Because the timeout is relative it's trivial to calculate a relative timeout value from the given accurate relative timeout value, which is guaranteed to not expire late and within the timerwheel's error margin, and use that for the actual rearming. I'm pretty sure that you can come up with a conversion function for that and use this function in the TCP code at the point where the TCP keepalive timer is configured. Hint 1: The output of calc_wheel_index() should be good enough to figure that out. Hint 2: If you get frustrated, try git grep johnstul arch/x86/ | awk '{ $1=$2=$3=""; print $0 }' Hint 3: Feel free to ask questions anytime Thanks, tglx
Powered by blists - more mailing lists