netdev - Re: [PATCH 1/2] timer: introduce upper bound timers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87zglcfmcv.ffs@tglx>
Date:   Sat, 26 Mar 2022 22:13:52 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Artem Savkov <asavkov@...hat.com>, jpoimboe@...hat.com,
        netdev@...r.kernel.org
Cc:     davem@...emloft.net, yoshfuji@...ux-ipv6.org, dsahern@...nel.org,
        linux-kernel@...r.kernel.org, Artem Savkov <asavkov@...hat.com>,
        Anna-Maria Gleixner <anna-maria@...utronix.de>
Subject: Re: [PATCH 1/2] timer: introduce upper bound timers

Artem,

On Thu, Mar 24 2022 at 13:28, Thomas Gleixner wrote:
> On Wed, Mar 23 2022 at 12:16, Artem Savkov wrote:
>>  	 * Round up with level granularity to prevent this.
>> +	 * Do not perform round up in case of upper bound timer.
>>  	 */
>> -	expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>> +	if (upper_bound)
>> +		expires = expires >> LVL_SHIFT(lvl);
>> +	else
>> +		expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
>
> While this "works", I fundamentally hate this because it adds an extra

actually it cannot not work. At least not in a realiable and predictable
way.

The timer wheel is fundamentaly relying on moving the timer one bucket
out. Let's look how this all works.

The wheel has N levels of bucket arrays. Each level has 64 buckets and
the granularity increases by a factor of 8 per level, i.e. the worst
case deviation is 12.5% per level.

The original timer wheel was able to fire the timer at expiry time + one
tick for the price of cascading timers every so often from one level to
the next lower level. Here are a few pointers:

    https://lwn.net/Articles/152436/
    https://lwn.net/Articles/646950/

The accuracy of the original wheel implementation was weakened already
in course of the NOHZ development where the concept of slack
(algorithmic batching at enqueue time for the price of overhead) was
introduced. It had the same 12.5% worst case deviation of the resulting
granularity level, though the batching was not enforced and only worked
most of the time. So in theory the same issue could have been seen back
then already.

The enqueue placement and the expiry is driven by base::clock, which
is nothing else than jiffies. When a timer is queued then base::clock is
the time on which the next tick and expiry check happens.

Now let's look how the expiry mechanism works. The first level (0) is
obviously expired every tick. On every eigth tick the second level (1) is
expired, on every 64th tick the third level (2)...

IOW, the expiry events of a level happen at 8^index(level) intervals.

Let's assume that base::clk is 0. That means at the next tick (which
could be imminent) in _all_ levels bucket[0] is due for expiry.

Now let's enqueue a timer with expiry value of 64:

    delta = 64 - 0 = 64

That means the timer ends up in the second level in bucket[0], which
makes it eligible for expiry at the next tick. The same is true for any
expiry value of 8^N.

Not what you are trying to achieve, right? You try to enforce an upper
bound, but you expect that the timer does not fire earlier than 12.5% of
the granularity level of that upper bound, right?

IOW, you created a expiry lottery and there is no way to prevent that
except with more conditionals and heuristics which are hardely welcomed.

You've also seen the outcome of a timer firing unexpectedly due to the
bit abuse, right?

Now let's take a step back and look at the problem at hand.

TCP alive timer, which is affected by the batching and the resulting
inaccuracy, is (re)armed with a relative timeout, which is known
upfront, right?

The important part is *relative* timeout. Why?

Because the timeout is relative it's trivial to calculate a relative
timeout value from the given accurate relative timeout value, which is
guaranteed to not expire late and within the timerwheel's error margin,
and use that for the actual rearming.

I'm pretty sure that you can come up with a conversion function for that
and use this function in the TCP code at the point where the TCP
keepalive timer is configured.

Hint 1: The output of calc_wheel_index() should be good enough to figure
        that out.

Hint 2: If you get frustrated, try

        git grep johnstul arch/x86/ | awk '{ $1=$2=$3=""; print $0 }'

Hint 3: Feel free to ask questions anytime

Thanks,

        tglx