linux-kernel - Re: [PATCH v5 2/4] rcu: Reduce synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c8182a5a-e804-4fcc-a6a5-bb121260e6a6@joelfernandes.org>
Date: Tue, 27 Feb 2024 15:51:03 -0500
From: Joel Fernandes <joel@...lfernandes.org>
To: "Uladzislau Rezki (Sony)" <urezki@...il.com>,
 "Paul E . McKenney" <paulmck@...nel.org>
Cc: RCU <rcu@...r.kernel.org>, Neeraj upadhyay <Neeraj.Upadhyay@....com>,
 Boqun Feng <boqun.feng@...il.com>, Hillf Danton <hdanton@...a.com>,
 LKML <linux-kernel@...r.kernel.org>,
 Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
 Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH v5 2/4] rcu: Reduce synchronize_rcu() latency

On 2/20/2024 1:31 PM, Uladzislau Rezki (Sony) wrote:
> A call to a synchronize_rcu() can be optimized from a latency
> point of view. Workloads which depend on this can benefit of it.
> 
> The delay of wakeme_after_rcu() callback, which unblocks a waiter,
> depends on several factors:
> 
> - how fast a process of offloading is started. Combination of:
>     - !CONFIG_RCU_NOCB_CPU/CONFIG_RCU_NOCB_CPU;
>     - !CONFIG_RCU_LAZY/CONFIG_RCU_LAZY;
>     - other.
> - when started, invoking path is interrupted due to:
>     - time limit;
>     - need_resched();
>     - if limit is reached.
> - where in a nocb list it is located;
> - how fast previous callbacks completed;
> 
> Example:
> 
> 1. On our embedded devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
> 
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
> 
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
> 
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
> 
> cgroup_attach_task():
>  -> cgroup_migrate_execute()
>    -> cpuset_can_attach()
>      -> percpu_down_write()
>        -> rcu_sync_enter()
>          -> synchronize_rcu()

We should do this patch but I wonder also if cgroup_attach_task() usage of
synchronize_rcu() should actually be using the _expedited() variant (via some
possible flag to the percpu rwsem / rcu_sync).

If the user assumes it a slow path, then usage of _expedited() should probably
be OK. If it is assumed a fast path, then it is probably hurting latency anyway
without the enablement of this patch's rcu_normal_wake_from_gp.

Thoughts?

Then it becomes a matter of how to plumb the expeditedness down the stack.

Also, speaking of percpu rwsem, I noticed that percpu refcounts don't use
rcu_sync. I haven't looked closely why, but something I hope to get time to look
into is if it can be converted over and what benefits would that entail if any.

Also will continue reviewing the patch. Thanks.

 - Joel