[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZTlNogQ_nWUzVJ9M@boqun-archlinux>
Date: Wed, 25 Oct 2023 10:17:22 -0700
From: Boqun Feng <boqun.feng@...il.com>
To: "Uladzislau Rezki (Sony)" <urezki@...il.com>
Cc: "Paul E . McKenney" <paulmck@...nel.org>,
RCU <rcu@...r.kernel.org>,
Neeraj upadhyay <Neeraj.Upadhyay@....com>,
Hillf Danton <hdanton@...a.com>,
Joel Fernandes <joel@...lfernandes.org>,
LKML <linux-kernel@...r.kernel.org>,
Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH 1/3] rcu: Reduce synchronize_rcu() waiting time
On Wed, Oct 25, 2023 at 04:09:13PM +0200, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from time point of
> view. Different workloads can be affected by this especially the
> ones which use this API in its time critical sections.
>
> For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> callback can be delayed and such delay depends on:
>
> - where in a nocb list it is located;
> - how fast previous callbacks completed.
>
> 1. On our Android devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
I wonder how many of the callbacks are queued via call_rcu_hurry()? If
not a lot, I wonder whether we can resolve the problem differently, see
below:
> <snip>
> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
> <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
> cgroup_attach_task():
> -> cgroup_migrate_execute()
> -> cpuset_can_attach()
> -> percpu_down_write()
> -> rcu_sync_enter()
> -> synchronize_rcu()
> -> now move tasks to the new cgroup.
> -> cgroup_migrate_finish()
>
> <snip>
> rcuop/1-29 [000] ..... 7030.528570: rcu_invoke_callback: rcu_preempt rhp=00000000461605e0 func=wakeme_after_rcu.cfi_jt
> PERFD-SERVER-1855 [000] d..1. 7030.530293: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> PERFD-SERVER-1855 [000] d..1. 7030.530383: cgroup_attach_task: dst_root=3 dst_id=22 dst_level=1 dst_path=/foreground pid=1900 comm=surfaceflinger
> TimerDispatch-2768 [002] d..5. 7030.537542: sched_migrate_task: comm=surfaceflinger pid=1900 prio=98 orig_cpu=0 dest_cpu=4
> <snip>
>
> "A moving time" depends on how fast synchronize_rcu() completes. See
> the first trace line. The migration has not occurred until the sync
> was done first. Please note, number of different callbacks to be
> invoked can be thousands.
>
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> either wake-ups a worker to drain all list or directly wakes-up end
> user if it is one in the drain list.
>
Late to the party, but I kinda wonder whether we can resolve it by:
1) either introduce a separate seglist that only contains callbacks
queued by call_rcu_hurry(), and whenever after an GP and callbacks are
ready, call_rcu_hurry() callbacks will be called first.
2) or make call_rcu_hurry() callbacks always inserted at the head of the
NEXT list instead of the tail, e.g. (untested code):
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index f71fac422c8f..89a875f8ecc7 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -338,13 +338,21 @@ bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp)
* absolutely not OK for it to ever miss posting a callback.
*/
void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
- struct rcu_head *rhp)
+ struct rcu_head *rhp,
+ bool is_lazy)
{
rcu_segcblist_inc_len(rsclp);
rcu_segcblist_inc_seglen(rsclp, RCU_NEXT_TAIL);
- rhp->next = NULL;
- WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
- WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], &rhp->next);
+ /* If hurry and the list is not empty, put it in the front */
+ if (!is_lazy && rcu_segcblist_get_seglen(rsclp, RCU_NEXT_TAIL) > 1) {
+ // hurry callback, queued at front
+ rhp->next = READ_ONCE(*rsclp->tails[RCU_NEXT_READY_TAIL]);
+ WRITE_ONCE(*rsclp->tails[RCU_NEXT_READY_TAIL], rhp);
+ } else {
+ rhp->next = NULL;
+ WRITE_ONCE(*rsclp->tails[RCU_NEXT_TAIL], rhp);
+ WRITE_ONCE(rsclp->tails[RCU_NEXT_TAIL], &rhp->next);
+ }
}
/*
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 4fe877f5f654..459475bb8df9 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -136,7 +136,8 @@ struct rcu_head *rcu_segcblist_first_cb(struct rcu_segcblist *rsclp);
struct rcu_head *rcu_segcblist_first_pend_cb(struct rcu_segcblist *rsclp);
bool rcu_segcblist_nextgp(struct rcu_segcblist *rsclp, unsigned long *lp);
void rcu_segcblist_enqueue(struct rcu_segcblist *rsclp,
- struct rcu_head *rhp);
+ struct rcu_head *rhp,
+ bool is_lazy);
bool rcu_segcblist_entrain(struct rcu_segcblist *rsclp,
struct rcu_head *rhp);
void rcu_segcblist_extract_done_cbs(struct rcu_segcblist *rsclp,
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 20d7a238d675..53adf5ab9c9f 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1241,7 +1241,7 @@ static unsigned long srcu_gp_start_if_needed(struct srcu_struct *ssp,
sdp = raw_cpu_ptr(ssp->sda);
spin_lock_irqsave_sdp_contention(sdp, &flags);
if (rhp)
- rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp);
+ rcu_segcblist_enqueue(&sdp->srcu_cblist, rhp, true);
rcu_segcblist_advance(&sdp->srcu_cblist,
rcu_seq_current(&ssp->srcu_sup->srcu_gp_seq));
s = rcu_seq_snap(&ssp->srcu_sup->srcu_gp_seq);
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 8d65f7d576a3..7dec7c68f88f 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -362,7 +362,7 @@ static void call_rcu_tasks_generic(struct rcu_head *rhp, rcu_callback_t func,
}
if (needwake)
rtpcp->urgent_gp = 3;
- rcu_segcblist_enqueue(&rtpcp->cblist, rhp);
+ rcu_segcblist_enqueue(&rtpcp->cblist, rhp, true);
raw_spin_unlock_irqrestore_rcu_node(rtpcp, flags);
if (unlikely(needadjust)) {
raw_spin_lock_irqsave(&rtp->cbs_gbl_lock, flags);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cb1caefa8bd0..e05cbff40dc7 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2670,7 +2670,7 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy_in)
if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
return; // Enqueued onto ->nocb_bypass, so just leave.
// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
- rcu_segcblist_enqueue(&rdp->cblist, head);
+ rcu_segcblist_enqueue(&rdp->cblist, head, lazy_in);
if (__is_kvfree_rcu_offset((unsigned long)func))
trace_rcu_kvfree_callback(rcu_state.name, head,
(unsigned long)func,
Sure, there may be some corner cases I'm missing, but I think overall
this is better than (sorta) duplicating the logic of seglist (the llist
in sr_normal_state) or the logic of wake_rcu_gp()
(synchronize_rcu_normal).
Anyway, these are just if-you-have-time-to-try options ;-)
Regards,
Boqun
> 4. This patch improves the performance of synchronize_rcu() approximately
> by ~30% on synthetic tests. The real test case, camera launch time, shows
> below figures(time is in milliseconds):
>
> 542 vs 489 diff: 9%
> 540 vs 466 diff: 13%
> 518 vs 468 diff: 9%
> 531 vs 457 diff: 13%
> 548 vs 475 diff: 13%
> 509 vs 484 diff: 4%
>
> Synthetic test:
>
> Hardware: x86_64 64 CPUs, 64GB of memory
>
> - 60.000 tasks(simultaneous);
> - each task does(1000 loops)
> synchronize_rcu();
> kfree(p);
>
> default: CONFIG_RCU_NOCB_CPU: takes 323 seconds to complete all users;
> patch: CONFIG_RCU_NOCB_CPU: takes 240 seconds to complete all users.
>
> Please note, by default this functionality is OFF and the old way is
> still used instead, In order to activate it, please do:
>
> echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
>
> Signed-off-by: Uladzislau Rezki (Sony) <urezki@...il.com>
> ---
[...]
Powered by blists - more mailing lists