[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8182a5a-e804-4fcc-a6a5-bb121260e6a6@joelfernandes.org>
Date: Tue, 27 Feb 2024 15:51:03 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: "Uladzislau Rezki (Sony)" <urezki@...il.com>,
"Paul E . McKenney" <paulmck@...nel.org>
Cc: RCU <rcu@...r.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@....com>,
Boqun Feng <boqun.feng@...il.com>, Hillf Danton <hdanton@...a.com>,
LKML <linux-kernel@...r.kernel.org>,
Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency
On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
>
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
>
> - how fast a process of offloading is started. Combination of:
> - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
> - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
> - other.
> - when started, invoking path is interrupted due to:
> - time limit;
> - need_resched();
> - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
>
> Example:
>
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
> <...>-29 [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
> <...>-29 [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
> <...>-29 [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
> <...>-29 [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
> cgroup_attach_task():
> -> cgroup_migrate_execute()
> -> cpuset_can_attach()
> -> percpu_down_write()
> -> rcu_sync_enter()
> -> synchronize_rcu()
We should do this patch but I wonder also if cgroup_attach_task() usage of
synchronize_rcu() should actually be using the _expedited() variant (via some
possible flag to the percpu rwsem / rcu_sync).
If the user assumes it a slow path, then usage of _expedited() should probably
be OK. If it is assumed a fast path, then it is probably hurting latency anyway
without the enablement of this patch's rcu_normal_wake_from_gp.
Thoughts?
Then it becomes a matter of how to plumb the expeditedness down the stack.
Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use
rcu_sync. I haven't looked closely why, but something I hope to get time to look
into is if it can be converted over and what benefits would that entail if any.
Also will continue reviewing the patch. Thanks.
- Joel
Powered by blists - more mailing lists