[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2vznaaotzkgkrfoi2qitiwdjinpl7ozhpz7w6n7577kaa2hpki@okh2mkqqhbkq>
Date: Thu, 27 Mar 2025 15:38:50 +0100
From: Mateusz Guzik <mjguzik@...il.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Greg Thelen <gthelen@...gle.com>, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>, Michal Koutný <mkoutny@...e.com>,
Andrew Morton <akpm@...ux-foundation.org>, Eric Dumazet <edumzaet@...gle.com>, cgroups@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>
Subject: Re: [PATCH] cgroup/rstat: avoid disabling irqs for O(num_cpu)
On Wed, Mar 19, 2025 at 05:18:05PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 19, 2025 at 11:47:32AM +0100, Mateusz Guzik wrote:
> > Is not this going a little too far?
> >
> > the lock + irq trip is quite expensive in its own right and now is
> > going to be paid for each cpu, as in the total time spent executing
> > cgroup_rstat_flush_locked is going to go up.
> >
> > Would your problem go away toggling this every -- say -- 8 cpus?
>
> I was concerned about this too, and about more lock bouncing, but the
> testing suggests that this actually overall improves the latency of
> cgroup_rstat_flush_locked() (at least on tested HW).
>
> So I don't think we need to do something like this unless a regression
> is observed.
>
To my reading it reduces max time spent with irq disabled, which of
course it does -- after all it toggles it for every CPU.
Per my other e-mail in the thread the irq + lock trips remain not cheap
at least on Sapphire Rapids.
In my testing outlined below I see 11% increase in total execution time
with the irq + lock trip for every CPU in a 24-way vm.
So I stand by instead doing this every n CPUs, call it 8 or whatever.
How to repro:
I employed a poor-man's profiler like so:
bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @start[tid] = nsecs; } kretprobe:cgroup_rstat_flush_locked /@...rt[tid]/ { print(nsecs - @start[tid]); delete(@start[tid]); } interval:s:60 { exit(); }'
This patch or not, execution time varies wildly even while the box is idle.
The above runs for a minute, collecting 23 samples (you may get
"lucky" and get one extra, in that case remove it for comparison).
A sysctl was added to toggle the new behavior vs old one. Patch at the
end.
"enabled"(1) means new behavior, "disabled"(0) means the old one.
Sum of nsecs (results piped to: awk '{ sum += $1 } END { print sum }'):
disabled: 903610
enabled: 1006833 (+11.4%)
Toggle at runtime with:
sysctl fs.magic_tunable=0 # disabled, no mandatory relocks
sysctl fs.magic_tunable=1 # enabled, relock for every CPU
I attached the stats I got for reference.
I patched v6.14 with the following:
diff --git a/fs/file_table.c b/fs/file_table.c
index c04ed94cdc4b..441f89421413 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -106,6 +106,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -123,6 +125,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 3e01781aeb7b..f6444bf25b2f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -299,6 +299,8 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
spin_unlock_irq(&cgroup_rstat_lock);
}
+extern unsigned long magic_tunable;
+
/* see cgroup_rstat_flush() */
static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -323,12 +325,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
rcu_read_unlock();
}
- /* play nice and yield if necessary */
- if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ if (READ_ONCE(magic_tunable)) {
__cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();
__cgroup_rstat_lock(cgrp, cpu);
+ } else {
+ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ __cgroup_rstat_unlock(cgrp, cpu);
+ if (!cond_resched())
+ cpu_relax();
+ __cgroup_rstat_lock(cgrp, cpu);
+ }
}
}
}
View attachment "disabled" of type "text/plain" (138 bytes)
View attachment "enabled" of type "text/plain" (139 bytes)
Powered by blists - more mailing lists