[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHFH=U4eb=t50nr55kTaamsaKHdwPeZZCtJ7JXtYYy7-KQ@mail.gmail.com>
Date: Mon, 7 Apr 2025 02:07:38 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: paulmck@...nel.org
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com,
Andrew Morton <akpm@...ux-foundation.org>, Kuniyuki Iwashima <kuniyu@...zon.com>,
Petr Mladek <pmladek@...e.com>, Steven Rostedt <rostedt@...dmis.org>,
John Ogness <john.ogness@...utronix.de>, Sergey Senozhatsky <senozhatsky@...omium.org>,
Jon Pan-Doh <pandoh@...gle.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
Karolina Stolarek <karolina.stolarek@...cle.com>
Subject: Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive misses
On Sun, Apr 6, 2025 at 7:41 PM Paul E. McKenney <paulmck@...nel.org> wrote:
>
> On Sat, Apr 05, 2025 at 11:17:00AM +0200, Mateusz Guzik wrote:
> > On Thu, Apr 3, 2025 at 11:15 PM Paul E. McKenney <paulmck@...nel.org> wrote:
> > >
> > > The current ratelimit implementation can suffer from false-positive
> > > misses. That is, ___ratelimit() might return zero (causing the caller
> > > to invoke rate limiting, for example, by dropping printk()s) even when
> > > the current burst has not yet been consumed. This happens when one CPU
> > > holds a given ratelimit structure's lock and some other CPU concurrently
> > > invokes ___ratelimit(). The fact that the lock is a raw irq-disabled
> > > spinlock might make low-contention trylock failure seem unlikely, but
> > > vCPU preemption, NMIs, and firmware interrupts can greatly extend the
> > > trylock-failure window.
> > >
> > > Avoiding these false-positive misses is especially important when
> > > correlating console logged hardware failures with other information.
> > >
> > > Therefore, instead of attempting to acquire the lock on each call to
> > > ___ratelimit(), construct a lockless fastpath and only acquire the lock
> > > when retriggering (for the next burst) or when resynchronizing (due to
> > > either a long idle period or due to ratelimiting having been disabled).
> > > This reduces the number of lock-hold periods that can be extended
> > > by vCPU preemption, NMIs and firmware interrupts, but also means that
> > > these extensions must be of much longer durations (generally moving from
> > > milliseconds to seconds) before they can result in false-positive drops.
> > >
> > > In addition, the lockless fastpath gets a 10-20% speedup compared to
> > > the old fully locked code on my x86 laptop. Your mileage will of course
> > > vary depending on your hardware, workload, and configuration.
>
> Thank you for digging into this!!!
>
> > First a nit: the func returns an int with 1 or 0, perhaps one extra
> > patch to make it bool can be squeezed in here?
>
> I can do that. Patch below.
>
thanks
> > One of the previous patches fixes a bug on 32-bit archs.
> >
> > Maybe it will sound silly, but my suggestion below hinges on it: is
> > this patchset written with 32-bit kernels in mind?
>
> Yes, that bug fix is reflected in the lockless-fastpath patch. It no
> longer treats ->begin==0 as special. The reason that this is 32-bit
> specific is that at 1000HZ, a 32-bit counter wraps every 50 days or so,
> which is well within the range of possible uptimes. Wrapping for 64-bit
> counter takes way longer.
>
> > If not, I wonder if the 32-bit stuff can stay with the locked variant
> > and the 64-bit can get a lockless fast path which issues 8-byte
> > cmpxchg on the event count + (to be introduced) sequence counter.
> >
> > I think that would be significantly easier to reason about as it would
> > guarantee no changes are made if someone is reconfiguring the struct,
> > while providing the same win from single-threaded standpoint.
> >
> > I think you know what you mean, but just in case here is a pseudocode
> > draft of the fast path:
> >
> > #define RATELIMIT_NEED_INIT BIT(31)
> > #define RATELIMIT_IN_FLUX BIT(0)
> >
> > struct ratelimit_state_change {
> > int events_left;
> > unsigned int seq;
> > };
> >
> > struct ratelimit_state {
> > raw_spinlock_t lock;
> >
> > int interval;
> > int burst;
> > int missed;
> > struct ratelimit_state_change rsc;
> > unsigned long begin;
> > };
> >
> > seq = READ_ONCE(rs->rsc.seq);
> > smp_rmb();
> > if (seq & (RATELIMIT_NEED_INIT | RATELIMIT_IN_FLUX))
> > goto bad;
> > begin = READ_ONCE(rs->begin);
> > burst = READ_ONCE(rs->burst);
> > interval = READ_ONCE(rs->interval);
> > events_left = READ_ONCE(rs->rsc.events_left;
> > smp_rmb();
> > /* checks if we can cmpxchg go here */
> > ....
> > /* now the work */
> > struct ratelimit_state_change new = {
> > .events_left = events_left - 1;
> > .seq = seq;
> > }
> > if (try_cmpxchg64_relaxed(&rs->rsc, ......)) {
> > return true; /* succeeded */
> > }
> > /* ... retry based on what we got, most likely only ->events_left has changed */
> >
> > On the stock kernel the struct is 32 bytes. I'm combining flags and
> > the new seq field to avoid growing it.
> >
> > This does cut down on available seq size, but it should be plenty as
> > is. This also means the slowpath will have to be careful to not
> > blindly ++ it to not walk into flags, but I think that's easier to
> > handle that races. ;)
>
> In theory, something sort of like this that used a 16-byte cmpxchg
> and packed the ->begin, ->rs_n_left, and ->flags fields together could
> simplify this quite a bit. But not every system has a 16-byte cmpxchg
> on the on hand and packing into 8 bytes (let alone a 32-bit system's 4
> bytes) would require painful tradeoffs. But in practice...
>
well cmpxchg16b has atrocious performance and I would not recommend ;)
> > That said this is merely a suggestion, I'm not going to push for it.
> >
> > I recognize this swaps atomic_dec into an cmpxchg loop which in
> > principle will have worse throughput in face of multiple CPUs messing
> > with it. However, the fast path in both your and my variant issues
> > loads prior to the atomic op which already do most of the damage, so I
> > don't think this bit matters that much.
>
> ...as you say, the full-load throughput of cmpxchg() is lacking compared
> to that of atomic_dec_return(). And large systems might have serious
> ___ratelimit() call rates. Worse yet, the forward-progress properties
> of cmpxchg() are lacking compared to those of atomic_dec_return(), so I
> am not at all sold on this packing approach, even for systems providing
> 16-byte cmpxchg operations.
>
Well in my proposal this is 8-byte cmpxchg, not 16 with the sequence
counter validating the rest of the state has not changed.
> Yes, if a given ratelimit_state structure is mostly throttling, the
> load-only fastpath is there, but the quadratic overload behavior of
> cmpxchg() would apply during the non-throttling phases.
>
It is indeed non-ideal, but if you really need good perf here, then I
would argue literally just one instance of the counter is already bad.
> Never say never, of course, but we would need to see real issues
> with the atomic_dec_return() approach before it would make sense
> to take on the packing approach.
>
I claim my proposal is simpler to reason about as you get an invariant
nobody changes the event count from under you and they always operate
on a fully populated state.
All that said, this was a suggestion on the side which requires work
to implement.
On the other hand your variant is already written and I'm by no means
trying to block it. I am not in position to ACK it either and afaics
ratelimit is virtually unmaintained anyway. I guess it's your call
what to do with it.
Powered by blists - more mailing lists