linux-kernel - Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGudoHF5H0NhCu-mCjtd1SGRc5P=8X7jmTaP9k12zZixX1-9LA@mail.gmail.com>
Date: Sat, 5 Apr 2025 11:17:00 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com, 
	Andrew Morton <akpm@...ux-foundation.org>, Kuniyuki Iwashima <kuniyu@...zon.com>, 
	Petr Mladek <pmladek@...e.com>, Steven Rostedt <rostedt@...dmis.org>, 
	John Ogness <john.ogness@...utronix.de>, Sergey Senozhatsky <senozhatsky@...omium.org>, 
	Jon Pan-Doh <pandoh@...gle.com>, Bjorn Helgaas <bhelgaas@...gle.com>, 
	Karolina Stolarek <karolina.stolarek@...cle.com>
Subject: Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive misses

On Thu, Apr 3, 2025 at 11:15 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> The current ratelimit implementation can suffer from false-positive
> misses.  That is, ___ratelimit() might return zero (causing the caller
> to invoke rate limiting, for example, by dropping printk()s) even when
> the current burst has not yet been consumed.  This happens when one CPU
> holds a given ratelimit structure's lock and some other CPU concurrently
> invokes ___ratelimit().  The fact that the lock is a raw irq-disabled
> spinlock might make low-contention trylock failure seem unlikely, but
> vCPU preemption, NMIs, and firmware interrupts can greatly extend the
> trylock-failure window.
>
> Avoiding these false-positive misses is especially important when
> correlating console logged hardware failures with other information.
>
> Therefore, instead of attempting to acquire the lock on each call to
> ___ratelimit(), construct a lockless fastpath and only acquire the lock
> when retriggering (for the next burst) or when resynchronizing (due to
> either a long idle period or due to ratelimiting having been disabled).
> This reduces the number of lock-hold periods that can be extended
> by vCPU preemption, NMIs and firmware interrupts, but also means that
> these extensions must be of much longer durations (generally moving from
> milliseconds to seconds) before they can result in false-positive drops.
>
> In addition, the lockless fastpath gets a 10-20% speedup compared to
> the old fully locked code on my x86 laptop.  Your mileage will of course
> vary depending on your hardware, workload, and configuration.
>

First a nit: the func returns an int with 1 or 0, perhaps one extra
patch to make it bool can be squeezed in here?

One of the previous patches fixes a bug on 32-bit archs.

Maybe it will sound silly, but my suggestion below hinges on it: is
this patchset written with 32-bit kernels in mind?

If not, I wonder if the 32-bit stuff can stay with the locked variant
and the 64-bit can get a lockless fast path which issues 8-byte
cmpxchg on the event count + (to be introduced) sequence counter.

I think that would be significantly easier to reason about as it would
guarantee no changes are made if someone is reconfiguring the struct,
while providing the same win from single-threaded standpoint.

I think you know what you mean, but just in case here is a pseudocode
draft of the fast path:

#define RATELIMIT_NEED_INIT BIT(31)
#define RATELIMIT_IN_FLUX BIT(0)

struct ratelimit_state_change {
        int             events_left;
        unsigned int    seq;
};

struct ratelimit_state {
        raw_spinlock_t  lock;

        int             interval;
        int             burst;
        int             missed;
        struct ratelimit_state_change rsc;
        unsigned long   begin;
};

seq = READ_ONCE(rs->rsc.seq);
smp_rmb();
if (seq & (RATELIMIT_NEED_INIT | RATELIMIT_IN_FLUX))
        goto bad;
begin = READ_ONCE(rs->begin);
burst = READ_ONCE(rs->burst);
interval = READ_ONCE(rs->interval);
events_left = READ_ONCE(rs->rsc.events_left;
smp_rmb();
/* checks if we can cmpxchg go here */
....
/* now the work */
struct ratelimit_state_change new = {
        .events_left = events_left - 1;
        .seq = seq;
}
if (try_cmpxchg64_relaxed(&rs->rsc, ......)) {
        return true; /* succeeded */
}
/* ... retry based on what we got, most likely only ->events_left has changed */

On the stock kernel the struct is 32 bytes. I'm combining flags and
the new seq field to avoid growing it.

This does cut down on available seq size, but it should be plenty as
is. This also means the slowpath will have to be careful to not
blindly ++ it to not walk into flags, but I think that's easier to
handle that races. ;)

That said this is merely a suggestion, I'm not going to push for it.

I recognize this swaps atomic_dec into an cmpxchg loop which in
principle will have worse throughput in face of multiple CPUs messing
with it. However, the fast path in both your and my variant issues
loads prior to the atomic op which already do most of the damage, so I
don't think this bit matters that much.
-- 
Mateusz Guzik <mjguzik gmail.com>