linux-kernel - Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <276d81e0-3867-471a-8e99-b7582378dd64@paulmck-laptop>
Date: Sun, 6 Apr 2025 10:41:52 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kuniyuki Iwashima <kuniyu@...zon.com>,
	Petr Mladek <pmladek@...e.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	John Ogness <john.ogness@...utronix.de>,
	Sergey Senozhatsky <senozhatsky@...omium.org>,
	Jon Pan-Doh <pandoh@...gle.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Karolina Stolarek <karolina.stolarek@...cle.com>
Subject: Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive
 misses

On Sat, Apr 05, 2025 at 11:17:00AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 3, 2025 at 11:15 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> >
> > The current ratelimit implementation can suffer from false-positive
> > misses.  That is, ___ratelimit() might return zero (causing the caller
> > to invoke rate limiting, for example, by dropping printk()s) even when
> > the current burst has not yet been consumed.  This happens when one CPU
> > holds a given ratelimit structure's lock and some other CPU concurrently
> > invokes ___ratelimit().  The fact that the lock is a raw irq-disabled
> > spinlock might make low-contention trylock failure seem unlikely, but
> > vCPU preemption, NMIs, and firmware interrupts can greatly extend the
> > trylock-failure window.
> >
> > Avoiding these false-positive misses is especially important when
> > correlating console logged hardware failures with other information.
> >
> > Therefore, instead of attempting to acquire the lock on each call to
> > ___ratelimit(), construct a lockless fastpath and only acquire the lock
> > when retriggering (for the next burst) or when resynchronizing (due to
> > either a long idle period or due to ratelimiting having been disabled).
> > This reduces the number of lock-hold periods that can be extended
> > by vCPU preemption, NMIs and firmware interrupts, but also means that
> > these extensions must be of much longer durations (generally moving from
> > milliseconds to seconds) before they can result in false-positive drops.
> >
> > In addition, the lockless fastpath gets a 10-20% speedup compared to
> > the old fully locked code on my x86 laptop.  Your mileage will of course
> > vary depending on your hardware, workload, and configuration.

Thank you for digging into this!!!

> First a nit: the func returns an int with 1 or 0, perhaps one extra
> patch to make it bool can be squeezed in here?

I can do that.  Patch below.

> One of the previous patches fixes a bug on 32-bit archs.
> 
> Maybe it will sound silly, but my suggestion below hinges on it: is
> this patchset written with 32-bit kernels in mind?

Yes, that bug fix is reflected in the lockless-fastpath patch.  It no
longer treats ->begin==0 as special.  The reason that this is 32-bit
specific is that at 1000HZ, a 32-bit counter wraps every 50 days or so,
which is well within the range of possible uptimes.  Wrapping for 64-bit
counter takes way longer.

> If not, I wonder if the 32-bit stuff can stay with the locked variant
> and the 64-bit can get a lockless fast path which issues 8-byte
> cmpxchg on the event count + (to be introduced) sequence counter.
> 
> I think that would be significantly easier to reason about as it would
> guarantee no changes are made if someone is reconfiguring the struct,
> while providing the same win from single-threaded standpoint.
> 
> I think you know what you mean, but just in case here is a pseudocode
> draft of the fast path:
> 
> #define RATELIMIT_NEED_INIT BIT(31)
> #define RATELIMIT_IN_FLUX BIT(0)
> 
> struct ratelimit_state_change {
>         int             events_left;
>         unsigned int    seq;
> };
> 
> struct ratelimit_state {
>         raw_spinlock_t  lock;
> 
>         int             interval;
>         int             burst;
>         int             missed;
>         struct ratelimit_state_change rsc;
>         unsigned long   begin;
> };
> 
> seq = READ_ONCE(rs->rsc.seq);
> smp_rmb();
> if (seq & (RATELIMIT_NEED_INIT | RATELIMIT_IN_FLUX))
>         goto bad;
> begin = READ_ONCE(rs->begin);
> burst = READ_ONCE(rs->burst);
> interval = READ_ONCE(rs->interval);
> events_left = READ_ONCE(rs->rsc.events_left;
> smp_rmb();
> /* checks if we can cmpxchg go here */
> ....
> /* now the work */
> struct ratelimit_state_change new = {
>         .events_left = events_left - 1;
>         .seq = seq;
> }
> if (try_cmpxchg64_relaxed(&rs->rsc, ......)) {
>         return true; /* succeeded */
> }
> /* ... retry based on what we got, most likely only ->events_left has changed */
> 
> On the stock kernel the struct is 32 bytes. I'm combining flags and
> the new seq field to avoid growing it.
> 
> This does cut down on available seq size, but it should be plenty as
> is. This also means the slowpath will have to be careful to not
> blindly ++ it to not walk into flags, but I think that's easier to
> handle that races. ;)

In theory, something sort of like this that used a 16-byte cmpxchg
and packed the ->begin, ->rs_n_left, and ->flags fields together could
simplify this quite a bit.  But not every system has a 16-byte cmpxchg
on the on hand and packing into 8 bytes (let alone a 32-bit system's 4
bytes) would require painful tradeoffs.  But in practice...

> That said this is merely a suggestion, I'm not going to push for it.
> 
> I recognize this swaps atomic_dec into an cmpxchg loop which in
> principle will have worse throughput in face of multiple CPUs messing
> with it. However, the fast path in both your and my variant issues
> loads prior to the atomic op which already do most of the damage, so I
> don't think this bit matters that much.

...as you say, the full-load throughput of cmpxchg() is lacking compared
to that of atomic_dec_return().  And large systems might have serious
___ratelimit() call rates.  Worse yet, the forward-progress properties
of cmpxchg() are lacking compared to those of atomic_dec_return(), so I
am not at all sold on this packing approach, even for systems providing
16-byte cmpxchg operations.

Yes, if a given ratelimit_state structure is mostly throttling, the
load-only fastpath is there, but the quadratic overload behavior of
cmpxchg() would apply during the non-throttling phases.

Never say never, of course, but we would need to see real issues
with the atomic_dec_return() approach before it would make sense
to take on the packing approach.

						Thanx, Paul

> -- 
> Mateusz Guzik <mjguzik gmail.com>

------------------------------------------------------------------------

diff --git a/include/linux/ratelimit_types.h b/include/linux/ratelimit_types.h
index 2f38345ffc1a5..633f154a1e2eb 100644
--- a/include/linux/ratelimit_types.h
+++ b/include/linux/ratelimit_types.h
@@ -43,7 +43,7 @@ struct ratelimit_state {
 	struct ratelimit_state name =					\
 		RATELIMIT_STATE_INIT(name, interval_init, burst_init)	\
 
-extern int ___ratelimit(struct ratelimit_state *rs, const char *func);
+extern bool ___ratelimit(struct ratelimit_state *rs, const char *func);
 #define __ratelimit(state) ___ratelimit(state, __func__)
 
 #endif /* _LINUX_RATELIMIT_TYPES_H */
diff --git a/lib/ratelimit.c b/lib/ratelimit.c
index 03704c6f8899e..a1d69c0cdcb39 100644
--- a/lib/ratelimit.c
+++ b/lib/ratelimit.c
@@ -24,7 +24,7 @@
  * 0 means callbacks will be suppressed.
  * 1 means go ahead and do it.
  */
-int ___ratelimit(struct ratelimit_state *rs, const char *func)
+bool ___ratelimit(struct ratelimit_state *rs, const char *func)
 {
 	unsigned long begin;
 	int burst = READ_ONCE(rs->burst);