linux-kernel - Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_VRo63o2UsVoxLG@pathway.suse.cz>
Date: Tue, 8 Apr 2025 18:41:07 +0200
From: Petr Mladek <pmladek@...e.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: linux-kernel@...r.kernel.org, kernel-team@...a.com,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kuniyuki Iwashima <kuniyu@...zon.com>,
	Mateusz Guzik <mjguzik@...il.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	John Ogness <john.ogness@...utronix.de>,
	Sergey Senozhatsky <senozhatsky@...omium.org>,
	Jon Pan-Doh <pandoh@...gle.com>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Karolina Stolarek <karolina.stolarek@...cle.com>
Subject: Re: [PATCH RFC 8/9] ratelimit: Reduce ratelimit's false-positive
 misses

On Thu 2025-04-03 14:15:13, Paul E. McKenney wrote:
> The current ratelimit implementation can suffer from false-positive
> misses.  That is, ___ratelimit() might return zero (causing the caller
> to invoke rate limiting, for example, by dropping printk()s) even when
> the current burst has not yet been consumed.  This happens when one CPU
> holds a given ratelimit structure's lock and some other CPU concurrently
> invokes ___ratelimit().  The fact that the lock is a raw irq-disabled
> spinlock might make low-contention trylock failure seem unlikely, but
> vCPU preemption, NMIs, and firmware interrupts can greatly extend the
> trylock-failure window.
> 
> Avoiding these false-positive misses is especially important when
> correlating console logged hardware failures with other information.
> 
> Therefore, instead of attempting to acquire the lock on each call to
> ___ratelimit(), construct a lockless fastpath and only acquire the lock
> when retriggering (for the next burst) or when resynchronizing (due to
> either a long idle period or due to ratelimiting having been disabled).
> This reduces the number of lock-hold periods that can be extended
> by vCPU preemption, NMIs and firmware interrupts, but also means that
> these extensions must be of much longer durations (generally moving from
> milliseconds to seconds) before they can result in false-positive drops.
> 
> In addition, the lockless fastpath gets a 10-20% speedup compared to
> the old fully locked code on my x86 laptop.  Your mileage will of course
> vary depending on your hardware, workload, and configuration.
> 
> [ paulmck: Apply feedback from Dan Carpenter. ]
> 
> diff --git a/lib/ratelimit.c b/lib/ratelimit.c
> index bd6e3b429e333..03704c6f8899e 100644
> --- a/lib/ratelimit.c
> +++ b/lib/ratelimit.c
> @@ -26,55 +26,161 @@
>   */
>  int ___ratelimit(struct ratelimit_state *rs, const char *func)
>  {
> -	/* Paired with WRITE_ONCE() in .proc_handler().
> -	 * Changing two values seperately could be inconsistent
> -	 * and some message could be lost.  (See: net_ratelimit_state).
> -	 */
> -	int interval = READ_ONCE(rs->interval);
> +	unsigned long begin;
>  	int burst = READ_ONCE(rs->burst);
> +	int delta = 0;
>  	unsigned long flags;
> -	int ret;
> -
> -	if (!interval)
> -		return 1;
> +	bool gotlock = false;
> +	bool initialized;
> +	int interval = READ_ONCE(rs->interval);
> +	unsigned long j;
> +	int n_left;
>  
>  	/*
> -	 * If we contend on this state's lock then almost
> -	 * by definition we are too busy to print a message,
> -	 * in addition to the one that will be printed by
> -	 * the entity that is holding the lock already:
> +	 * If the burst or interval settings mark this ratelimit_state
> +	 * structure as disabled, then clear the RATELIMIT_INITIALIZED bit
> +	 * in ->flags to force resetting of the ratelimiting interval when
> +	 * this ratelimit_state structure is next re-enabled.
>  	 */
> -	if (!raw_spin_trylock_irqsave(&rs->lock, flags)) {
> -		ratelimit_state_inc_miss(rs);
> -		return 0;
> +	if (burst <= 0 || interval <= 0) {
> +		if ((READ_ONCE(rs->flags) & RATELIMIT_INITIALIZED) &&
> +		    raw_spin_trylock_irqsave(&rs->lock, flags)) {
> +			if (READ_ONCE(rs->flags) & RATELIMIT_INITIALIZED)
> +				smp_store_release(&rs->flags, rs->flags & ~RATELIMIT_INITIALIZED);
> +			raw_spin_unlock_irqrestore(&rs->lock, flags);
> +		}
> +		return true;
>  	}
>  
> -	if (!(rs->flags & RATELIMIT_INITIALIZED)) {
> -		rs->begin = jiffies;
> -		rs->flags |= RATELIMIT_INITIALIZED;
> +	/*
> +	 * If this structure has just now been ratelimited, but not yet
> +	 * reset for the next rate-limiting interval, take an early and
> +	 * low-cost exit.
> +	 */
> +	if (atomic_read_acquire(&rs->rs_n_left) <= 0) /* Pair with release. */
> +		goto limited;
> +
> +	/*
> +	 * If this structure is marked as initialized and has been
> +	 * recently used, pick up its ->begin field.  Otherwise, pick up
> +	 * the current time and attempt to re-initialized the structure.
> +	 */
> +	j = jiffies;
> +	initialized = smp_load_acquire(&rs->flags) & RATELIMIT_INITIALIZED; /* Pair with release. */
> +	if (initialized) {
> +		begin = READ_ONCE(rs->begin);
> +	} else {
> +		/*
> +		 * Uninitialized or long idle, so reset ->begin and
> +		 * mark initialized.  If we fail to acquire the lock,
> +		 * let the lock holder do the work.
> +		 */
> +		begin = j;
> +		if (raw_spin_trylock_irqsave(&rs->lock, flags)) {
> +			if (!(READ_ONCE(rs->flags) & RATELIMIT_INITIALIZED)) {
> +				begin = jiffies;
> +				j = begin;
> +				WRITE_ONCE(rs->begin, begin);
> +				smp_store_release(&rs->flags, /* Pair with acquire. */
> +						  rs->flags | RATELIMIT_INITIALIZED);
> +				initialized = true;
> +			}
> +			raw_spin_unlock_irqrestore(&rs->lock, flags);
> +		}
>  	}
>  
> -	if (time_is_before_jiffies(rs->begin + interval)) {
> -		int m = ratelimit_state_reset_miss(rs);
> +	/*
> +	 * If this structure is still in the interval in which has
> +	 * already hit the rate limit, take an early and low-cost exit.
> +	 */
> +	if (initialized && time_before(begin - 2 * interval, j) && time_before(j, begin))

How do we know that the previous interval was ratelimited, please?
I have to admit that I still do not fully get the new logic ;-)

> +		goto limited;
>  
> -		if (m) {
> -			if (!(rs->flags & RATELIMIT_MSG_ON_RELEASE)) {
> -				printk_deferred(KERN_WARNING
> -						"%s: %d callbacks suppressed\n", func, m);
> -			}
> +	/*
> +	 * Register another request, and take an early (but not low-cost)
> +	 * exit if rate-limiting just nowcame into effect.
> +	 */
> +	n_left = atomic_dec_return(&rs->rs_n_left);
> +	if (n_left < 0)
> +		goto limited; /* Just now started ratelimiting. */
> +	if (n_left > 0) {
> +		/*
> +		 * Otherwise, there is not yet any rate limiting for the
> +		 * current interval, and furthermore there is at least one
> +		 * last count remaining.  But check to see if initialization
> +		 * is required or if we have run off the end of the interval
> +		 * without rate limiting having been imposed.  Either way,
> +		 * we eventually return @true to tell our caller to go ahead.
> +		 */
> +		if (initialized &&
> +		    time_before(begin - interval, j) && time_before(j, begin + interval))
> +			return true;  /* Nothing special to do. */
> +		if (!raw_spin_trylock_irqsave(&rs->lock, flags))
> +			return true; /* Let lock holder do special work. */
> +		interval = READ_ONCE(rs->interval);
> +		begin = rs->begin;
> +		initialized = smp_load_acquire(&rs->flags) & RATELIMIT_INITIALIZED;
> +		if (interval <= 0 ||
> +		    (initialized &&
> +		     time_before(begin - interval, j) && time_before(j, begin + interval))) {
> +			/*
> +			 * Someone else beat us to the special work,
> +			 * so release the lock and return.
> +			 */

I am confused by the comment. IMHO, we are here when we do not need
to reset the interval. It happens either when the interval is
infinite (interval < 0) or when we are still in the currrent interval
and more calls are allowed (n_left > 0).

> +			raw_spin_unlock_irqrestore(&rs->lock, flags);
> +			return true;
>  		}
> -		rs->begin   = jiffies;
> -		rs->printed = 0;
> +
> +		/* We have the lock and will do initialization. */
> +		gotlock = true;
> +		delta = -1;
>  	}
> -	if (burst && burst > rs->printed) {
> -		rs->printed++;
> -		ret = 1;
> -	} else {
> -		ratelimit_state_inc_miss(rs);
> -		ret = 0;
> +	if (!gotlock) {
> +		/*
> +		 * We get here if we got the last count (n_left == 0),
> +		 * so that rate limiting is in effect for the next caller.
> +		 * We will return @true to tell our caller to go ahead,
> +		 * but first we acquire the lock and set things up for
> +		 * the next rate-limiting interval.
> +		 */
> +		raw_spin_lock_irqsave(&rs->lock, flags);

I do not have a good feeling about this. It could cause a deadlock
in panic() when other CPUs are stopped or in NMI. And the ratelimit
used to be "safe" in this context.

Honestly, the new code is a bit hard to follow for me. The original
logic was more strightforward. I wonder if we could reduce the problem
"just" by checking the remaining allowed count.

I mean to introduce the atomic rs_n_left and check it when
it is not possible to take the lock. But keep most of
the current logic. Something like:

>From 1df6d1a79645dd5c0ad7cef7c94b802de202057e Mon Sep 17 00:00:00 2001
From: Petr Mladek <pmladek@...e.com>
Date: Tue, 8 Apr 2025 16:00:22 +0200
Subject: [PATCH 1/2] ratelimit: Reduce ratelimit's false-positive
 (alternative)

---
 include/linux/ratelimit.h       |  2 +-
 include/linux/ratelimit_types.h |  2 +-
 lib/ratelimit.c                 | 51 ++++++++++++++++++++++++---------
 3 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/include/linux/ratelimit.h b/include/linux/ratelimit.h
index adfec24061d1..7aaad158ee37 100644
--- a/include/linux/ratelimit.h
+++ b/include/linux/ratelimit.h
@@ -44,7 +44,7 @@ static inline void ratelimit_state_reset_interval(struct ratelimit_state *rs, in
 	raw_spin_lock_irqsave(&rs->lock, flags);
 	rs->interval = interval_init;
 	rs->flags &= ~RATELIMIT_INITIALIZED;
-	rs->printed = 0;
+	atomic_set(&rs->rs_n_left, rs->burst);
 	ratelimit_state_reset_miss(rs);
 	raw_spin_unlock_irqrestore(&rs->lock, flags);
 }
diff --git a/include/linux/ratelimit_types.h b/include/linux/ratelimit_types.h
index ef6711b6b229..b19c4354540a 100644
--- a/include/linux/ratelimit_types.h
+++ b/include/linux/ratelimit_types.h
@@ -18,7 +18,7 @@ struct ratelimit_state {
 
 	int		interval;
 	int		burst;
-	int		printed;
+	atomic_t	rs_n_left;
 	atomic_t	missed;
 	unsigned int	flags;
 	unsigned long	begin;
diff --git a/lib/ratelimit.c b/lib/ratelimit.c
index bd6e3b429e33..90c9fe57eb42 100644
--- a/lib/ratelimit.c
+++ b/lib/ratelimit.c
@@ -39,12 +39,22 @@ int ___ratelimit(struct ratelimit_state *rs, const char *func)
 		return 1;
 
 	/*
-	 * If we contend on this state's lock then almost
-	 * by definition we are too busy to print a message,
-	 * in addition to the one that will be printed by
-	 * the entity that is holding the lock already:
+	 * If we contend on this state's lock then just check if
+	 * the current burst is used or not. It might cause
+	 * false positive when we are past the interval and
+	 * the current lock owner is just about to reset it.
 	 */
 	if (!raw_spin_trylock_irqsave(&rs->lock, flags)) {
+		unsigned int rs_flags = READ_ONCE(rs->flags);
+
+		if (rs_flags & RATELIMIT_INITIALIZED && burst) {
+			int n_left;
+
+			n_left = atomic_dec_return(&rs->rs_n_left);
+			if (n_left >= 0)
+				return 1;
+		}
+
 		ratelimit_state_inc_miss(rs);
 		return 0;
 	}
@@ -52,27 +62,42 @@ int ___ratelimit(struct ratelimit_state *rs, const char *func)
 	if (!(rs->flags & RATELIMIT_INITIALIZED)) {
 		rs->begin = jiffies;
 		rs->flags |= RATELIMIT_INITIALIZED;
+		atomic_set(&rs->rs_n_left, rs->burst);
 	}
 
 	if (time_is_before_jiffies(rs->begin + interval)) {
-		int m = ratelimit_state_reset_miss(rs);
+		int m;
 
+		/*
+		 * Reset rs_n_left ASAP to reduce false positives
+		 * in parallel calls, see above.
+		 */
+		atomic_set(&rs->rs_n_left, rs->burst);
+		rs->begin = jiffies;
+
+		m = ratelimit_state_reset_miss(rs);
 		if (m) {
 			if (!(rs->flags & RATELIMIT_MSG_ON_RELEASE)) {
 				printk_deferred(KERN_WARNING
 						"%s: %d callbacks suppressed\n", func, m);
 			}
 		}
-		rs->begin   = jiffies;
-		rs->printed = 0;
 	}
-	if (burst && burst > rs->printed) {
-		rs->printed++;
-		ret = 1;
-	} else {
-		ratelimit_state_inc_miss(rs);
-		ret = 0;
+	if (burst) {
+		int n_left;
+
+		/* The burst might have been taken by a parallel call. */
+		n_left = atomic_dec_return(&rs->rs_n_left);
+		if (n_left >= 0) {
+			ret = 1;
+			goto unlock_ret;
+		}
 	}
+
+	ratelimit_state_inc_miss(rs);
+	ret = 0;
+
+unlock_ret:
 	raw_spin_unlock_irqrestore(&rs->lock, flags);
 
 	return ret;
-- 
2.49.0

It is still racy. But I think that all versions were somehow racy.

BTW: The new selftest fails with it. There seems to be two kind of
     problems:

1. Timing. The test seems to be a bit sensitive on timing.
   It helped me to do not go that close to the "interval":

diff --git a/lib/test_ratelimit.c b/lib/test_ratelimit.c
index 3d6db9be6be2..539aa968e858 100644
--- a/lib/test_ratelimit.c
+++ b/lib/test_ratelimit.c
@@ -24,19 +24,19 @@ static void test_ratelimit_smoke(struct kunit *test)
 	test_ratelimited(test, true);
 	test_ratelimited(test, false);
 
-	schedule_timeout_idle(TESTRL_INTERVAL - 20);
+	schedule_timeout_idle(TESTRL_INTERVAL - 40);
 	test_ratelimited(test, false);
 
-	schedule_timeout_idle(30);
+	schedule_timeout_idle(50);
 	test_ratelimited(test, true);
 
 	schedule_timeout_idle(2 * TESTRL_INTERVAL);
 	test_ratelimited(test, true);
 	test_ratelimited(test, true);
 
-	schedule_timeout_idle(TESTRL_INTERVAL - 20);
+	schedule_timeout_idle(TESTRL_INTERVAL - 40);
 	test_ratelimited(test, true);
-	schedule_timeout_idle(30);
+	schedule_timeout_idle(50);
 	test_ratelimited(test, true);
 	test_ratelimited(test, true);
 	test_ratelimited(test, true);


2. It still failed on burst = 0 test:

	// Test disabling.
	testrl.burst = 0;
	test_ratelimited(test, true);

   I haven't tried the selftest with the original code. But it looks
   to me like it always returned "false" here as well. But I agree
   that it would make more sense to return "true" here.

Best Regards,
Petr