[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <44407892-b7bc-4d6c-8e4a-6452f0ee88b9@paulmck-laptop>
Date: Sat, 27 May 2023 04:14:05 -0700
From: "Paul E. McKenney" <paulmck@...nel.org>
To: Qi Zheng <qi.zheng@...ux.dev>
Cc: RCU <rcu@...r.kernel.org>, Yujie Liu <yujie.liu@...el.com>,
oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Vlastimil Babka <vbabka@...e.cz>, Kirill Tkhai <tkhai@...ru>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Christian König <christian.koenig@....com>,
David Hildenbrand <david@...hat.com>,
Davidlohr Bueso <dave@...olabs.net>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Muchun Song <muchun.song@...ux.dev>,
Shakeel Butt <shakeelb@...gle.com>,
Yang Shi <shy828301@...il.com>, linux-mm@...ck.org,
ying.huang@...el.com, feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec
-88.8% regression
On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
> On 2023/5/24 19:56, Qi Zheng wrote:
> > On 2023/5/24 19:08, Qi Zheng wrote:
> >
> > [...]
> >
> > >
> > > Well, I just ran the following command and reproduced the result:
> > >
> > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> > >
> > > 1) with commit 42c9db3970483:
> > >
> > > stress-ng: info: [11023] setting to a 60 second run per stressor
> > > stress-ng: info: [11023] dispatching hogs: 9 ramfs
> > > stress-ng: info: [11023] stressor bogo ops real time usr
> > > time sys time bogo ops/s bogo ops/s
> > > stress-ng: info: [11023] (secs) (secs)
> > > (secs) (real time) (usr+sys time)
> > > stress-ng: info: [11023] ramfs 774966 60.00
> > > 10.18 169.45 12915.89 4314.26
> > > stress-ng: info: [11023] for a 60.00s run time:
> > > stress-ng: info: [11023] 1920.11s available CPU time
> > > stress-ng: info: [11023] 10.18s user time ( 0.53%)
> > > stress-ng: info: [11023] 169.44s system time ( 8.82%)
> > > stress-ng: info: [11023] 179.62s total time ( 9.35%)
> > > stress-ng: info: [11023] load average: 8.99 2.69 0.93
> > > stress-ng: info: [11023] successful run completed in 60.00s (1 min,
> > > 0.00 secs)
> > >
> > > 2) with commit f95bdb700bc6b:
> > >
> > > stress-ng: info: [37676] dispatching hogs: 9 ramfs
> > > stress-ng: info: [37676] stressor bogo ops real time usr
> > > time sys time bogo ops/s bogo ops/s
> > > stress-ng: info: [37676] (secs) (secs)
> > > (secs) (real time) (usr+sys time)
> > > stress-ng: info: [37676] ramfs 168673 60.00
> > > 1.61 39.66 2811.08 4087.47
> > > stress-ng: info: [37676] for a 60.10s run time:
> > > stress-ng: info: [37676] 1923.36s available CPU time
> > > stress-ng: info: [37676] 1.60s user time ( 0.08%)
> > > stress-ng: info: [37676] 39.66s system time ( 2.06%)
> > > stress-ng: info: [37676] 41.26s total time ( 2.15%)
> > > stress-ng: info: [37676] load average: 7.69 3.63 2.36
> > > stress-ng: info: [37676] successful run completed in 60.10s (1 min,
> > > 0.10 secs)
> > >
> > > The bogo ops/s (real time) did drop significantly.
> > >
> > > And the memory reclaimation was not triggered in the whole process. so
> > > theoretically no one is in the read critical section of shrinker_srcu.
> > >
> > > Then I found that some stress-ng-ramfs processes were in
> > > TASK_UNINTERRUPTIBLE state for a long time:
> > >
> > > root 42313 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42314 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42315 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42316 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42317 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42318 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42319 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42320 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42321 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42322 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42323 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42324 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42325 7.8 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42326 0.0 0.0 69592 2068 pts/0 S 19:00 0:00
> > > stress-ng-ramfs [run]
> > > root 42327 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42328 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42329 7.9 0.0 69592 1812 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > > root 42330 7.9 0.0 69592 1556 pts/0 D 19:00 0:02
> > > stress-ng-ramfs [run]
> > >
> > > Their call stack is as follows:
> > >
> > > cat /proc/42330/stack
> > >
> > > [<0>] __synchronize_srcu.part.21+0x83/0xb0
> > > [<0>] unregister_shrinker+0x85/0xb0
> > > [<0>] deactivate_locked_super+0x27/0x70
> > > [<0>] cleanup_mnt+0xb8/0x140
> > > [<0>] task_work_run+0x65/0x90
> > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> > > [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > > [<0>] do_syscall_64+0x44/0x80
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > >
> > > + RCU folks, Is this result as expected? I would have thought that
> > > synchronize_srcu() should return quickly if no one is in the read
> > > critical section. :(
In theory, it would indeed be nice if synchronize_srcu() would do that.
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.
So SRCU has to be careful to not check too frequently.
However, if SRCU has been idle for some time, normal synchronize_srcu()
will do an immediate check. And this will of course mark SRCU as
non-idle.
> > With the following changes, ops/s can return to previous levels:
>
> Or just set rcu_expedited to 1:
> echo 1 > /sys/kernel/rcu_expedited
This does cause SRCU to be much more aggressive. This can be a good
choice for small systems, but please keep in mind that this affects normal
RCU as well as SRCU. It will cause RCU to also be much more aggressive,
sending IPIs to CPUs that are (or might be) in RCU read-side critical
sections. Depending on your workload, this might or might not be what
you want RCU to be doing. For example, if you are running aggressive
real-time workloads, it most definitely is not what you want.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index db2ed6e08f67..90f541b07cd1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> > debugfs_entry = shrinker_debugfs_remove(shrinker);
> > up_write(&shrinker_rwsem);
> >
> > - synchronize_srcu(&shrinker_srcu);
> > + synchronize_srcu_expedited(&shrinker_srcu);
If shrinkers are unregistered only occasionally, this is an entirely
reasonable change.
> > debugfs_remove_recursive(debugfs_entry);
> >
> > stress-ng: info: [13159] dispatching hogs: 9 ramfs
> > stress-ng: info: [13159] stressor bogo ops real time usr time
> > sys time bogo ops/s bogo ops/s
> > stress-ng: info: [13159] (secs) (secs)
> > (secs) (real time) (usr+sys time)
> > stress-ng: info: [13159] ramfs 710062 60.00 9.63
> > 157.26 11834.18 4254.75
> > stress-ng: info: [13159] for a 60.00s run time:
> > stress-ng: info: [13159] 1920.14s available CPU time
> > stress-ng: info: [13159] 9.62s user time ( 0.50%)
> > stress-ng: info: [13159] 157.26s system time ( 8.19%)
> > stress-ng: info: [13159] 166.88s total time ( 8.69%)
> > stress-ng: info: [13159] load average: 9.49 4.02 1.65
> > stress-ng: info: [13159] successful run completed in 60.00s (1 min,
> > 0.00 secs)
> >
> > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> > one is in the read critical section?
Yes, in theory we could, but this would be a bad thing in practice.
After all, the point of having synchronize_srcu() be separate from
synchronize_srcu_expedited() is to allow uses that are OK with longer
latency avoid consuming too much CPU. In addition, that longer
SRCU grace-period latency allows the next grace period to handle more
synchronize_srcu() and call_srcu() requests. This amortizes the
overhead of that next grace period over a larger number of updates.
However, your use of synchronize_srcu_expedited() does have that effect,
but only for this call point. Which has the advantage of avoiding
burning excessive quantities of CPU for the other 50+ call points.
Thanx, Paul
Powered by blists - more mailing lists