lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <qyuogq43hmgqguiswjt2r5hyxlk6epuc6gs3n6votoc4nwk5lf@o2alaj7nrgy2>
Date: Sun, 26 May 2024 14:45:16 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Dennis Zhou <dennis@...nel.org>
Cc: tj@...nel.org, hughd@...gle.com, akpm@...ux-foundation.org, 
	vbabka@...e.cz, linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v3] percpu_counter: add a cmpxchg-based _add_batch variant

On Wed, May 22, 2024 at 04:52:00PM -0700, Dennis Zhou wrote:
> On Wed, May 22, 2024 at 06:59:02AM +0200, Mateusz Guzik wrote:
> +     do {
> +             if (unlikely(abs(count + amount)) >= batch) {
> +                     raw_spin_lock_irqsave(&fbc->lock, flags);
> +                     /*
> +                      * Note: by now we might have migrated to another CPU
> +                      * or the value might have changed.
> +                      */
> +                     count = __this_cpu_read(*fbc->counters);
> +                     fbc->count += count + amount;
> +                     __this_cpu_sub(*fbc->counters, count);
> +                     raw_spin_unlock_irqrestore(&fbc->lock, flags);
> +                     return;
> +             }
> +     } while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount));

For completeness I'll add that Vlastimil suggested creating an inline
variant of the fast path of the routine, which does look reasonable now
that it is 2 branches. Should it land, the reworked routine I posted
would serve as the slowpath.

I hacked it up (inlined below for reference).

This of course creates a tradeoff in terms of func call cost vs i-cache
footprint. bloat-o-meter on a debian-based kernel config says:
Total: Before=24584638, After=24592309, chg +0.03%

I don't know if that automatically disqualifies the thing.

At the moment the negative lookup test I used to bench the cmpxchg
change failed to register a win due to numerous artificial slowdowns in
that codepath (most notably memory alloc/free are incredibly slow).

That is to say at the moment I cannot justify submitting the inline
variant for inclusion. If someone has a test which bounces off of the
routine they are welcome to bench it and share if it helps.

I however do intent to whack out some of the problems in the test I was
running, by the end of that process it may be the inline variant will
provide a measurable win, in which case I'll send a proper patch.

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index 3a44dd1e33d2..b0038811b69f 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -52,9 +52,21 @@ static inline void percpu_counter_destroy(struct percpu_counter *fbc)
 	percpu_counter_destroy_many(fbc, 1);
 }
 
-void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
+#ifdef CONFIG_HAVE_CMPXCHG_LOCAL
+void percpu_counter_add_batch_slow(struct percpu_counter *fbc, s64 amount, s32 batch);
+void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
+{
+        s64 count = this_cpu_read(*fbc->counters);
+	if (unlikely((abs(count + amount)) >= batch ||
+            !this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount)))
+		percpu_counter_add_batch_slow(fbc, amount, batch);
+}
+#else
 void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
+#endif
+
+void percpu_counter_set(struct percpu_counter *fbc, s64 amount);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
 bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index c3140276bb36..80fa1aa7a1bb 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -90,7 +90,7 @@ EXPORT_SYMBOL(percpu_counter_set);
  * 1. the fast path uses local cmpxchg (note: no lock prefix)
  * 2. the slow path operates with interrupts disabled
  */
-void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
+void percpu_counter_add_batch_slow(struct percpu_counter *fbc, s64 amount, s32 batch)
 {
 	s64 count;
 	unsigned long flags;
@@ -111,6 +111,7 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 		}
 	} while (!this_cpu_try_cmpxchg(*fbc->counters, &count, count + amount));
 }
+EXPORT_SYMBOL(percpu_counter_add_batch_slow);
 #else
 /*
  * local_irq_save() is used to make the function irq safe:
@@ -134,8 +135,8 @@ void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount, s32 batch)
 	}
 	local_irq_restore(flags);
 }
-#endif
 EXPORT_SYMBOL(percpu_counter_add_batch);
+#endif
 
 /*
  * For percpu_counter with a big batch, the devication of its count could

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ