linux-kernel - Re: [PATCH v3 1/1] rcu: Reduce synchronize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAEXW_YRfuXqnBFN=DpOLio74j8fX3eEDSFCH8LXyavuHDdYysA@mail.gmail.com>
Date:   Mon, 16 Oct 2023 21:28:22 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     "Uladzislau Rezki (Sony)" <urezki@...il.com>
Cc:     "Paul E . McKenney" <paulmck@...nel.org>,
        RCU <rcu@...r.kernel.org>,
        Neeraj upadhyay <Neeraj.Upadhyay@....com>,
        Boqun Feng <boqun.feng@...il.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Oleksiy Avramchenko <oleksiy.avramchenko@...y.com>,
        Frederic Weisbecker <frederic@...nel.org>
Subject: Re: [PATCH v3 1/1] rcu: Reduce synchronize_rcu() waiting time

Hello, Vlad,

Looks like a nice patch, I am still looking at this more and just
started, but a few small comments:

On Mon, Oct 16, 2023 at 1:30 PM Uladzislau Rezki (Sony)
<urezki@...il.com> wrote:
>
> A call to a synchronize_rcu() can be optimized from time point of
> view. Different workloads can be affected by this especially the
> ones which use this API in its time critical sections.
>
> For example if CONFIG_RCU_NOCB_CPU is set, the wakeme_after_rcu()
> callback can be delayed and such delay depends on where in a nocb
> list it is located.
>
> 1. On our Android devices i can easily trigger the scenario when
> it is a last in the list out of ~3600 callbacks:
>
> <snip>
>   <...>-29      [001] d..1. 21950.145313: rcu_batch_start: rcu_preempt CBs=3613 bl=28
> ...
>   <...>-29      [001] ..... 21950.152578: rcu_invoke_callback: rcu_preempt rhp=00000000b2d6dee8 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152579: rcu_invoke_callback: rcu_preempt rhp=00000000a446f607 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152580: rcu_invoke_callback: rcu_preempt rhp=00000000a5cab03b func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152581: rcu_invoke_callback: rcu_preempt rhp=0000000013b7e5ee func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152582: rcu_invoke_callback: rcu_preempt rhp=000000000a8ca6f9 func=__free_vm_area_struct.cfi_jt
>   <...>-29      [001] ..... 21950.152583: rcu_invoke_callback: rcu_preempt rhp=000000008f162ca8 func=wakeme_after_rcu.cfi_jt
>   <...>-29      [001] d..1. 21950.152625: rcu_batch_end: rcu_preempt CBs-invoked=3612 idle=....
> <snip>
>
> 2. We use cpuset/cgroup to classify tasks and assign them into
> different cgroups. For example "backgrond" group which binds tasks
> only to little CPUs or "foreground" which makes use of all CPUs.
> Tasks can be migrated between groups by a request if an acceleration
> is needed.
>
> See below an example how "surfaceflinger" task gets migrated.
> Initially it is located in the "system-background" cgroup which
> allows to run only on little cores. In order to speed it up it
> can be temporary moved into "foreground" cgroup which allows
> to use big/all CPUs:
>
[...]
> 3. To address this drawback, maintain a separate track that consists
> of synchronize_rcu() callers only. The GP-kthread, that drivers a GP
> either wake-ups a worker to drain all list or directly wakes-up end
> user if it is one in the drain list.

I wonder if there is a cut-off N, where waking up N ("a few") inline
instead of just 1, yields similar results. For small values of N, that
keeps the total number of wakeups lower than pushing off to a kworker.
For instance, if 2 users, then you get 3 wakeups instead of just 2 (1
for the kworker and another 2 for the synchronize). But if you had a
cut off like N=5, then 2 users results in just 2 wakeups.
I don't feel too strongly about it, but for small values of N, I am
interested in a slightly better optimization if we can squeeze it.

[...]
> + * There are three lists for handling synchronize_rcu() users.
> + * A first list corresponds to new coming users, second for users
> + * which wait for a grace period and third is for which a grace
> + * period is passed.
> + */
> +static struct sr_normal_state {
> +       struct llist_head curr; /* request a GP users. */
> +       struct llist_head wait; /* wait for GP users. */
> +       struct llist_head done; /* ready for GP users. */
> +       struct llist_node *curr_tail;
> +       struct llist_node *wait_tail;

Just for clarification, the only reason you need 'tail' is because you
can use llist_add_batch() to quickly splice the list, instead of
finding the tail each time (expensive for a large list), correct? It
would be good to mention that in a comment here.

> +       atomic_t active;
> +} sr;
> +
> +/* Disable it by default. */
> +static int rcu_normal_wake_from_gp = 0;
> +module_param(rcu_normal_wake_from_gp, int, 0644);
> +
> +static void rcu_sr_normal_complete(struct llist_node *node)
> +{
> +       struct rcu_synchronize *rs = container_of(
> +               (struct rcu_head *) node, struct rcu_synchronize, head);
> +       unsigned long oldstate = (unsigned long) rs->head.func;
> +
> +       if (!poll_state_synchronize_rcu(oldstate))
> +               WARN_ONCE(1, "A full grace period is not passed yet: %lu",
> +                       rcu_seq_diff(get_state_synchronize_rcu(), oldstate));

nit: WARN_ONCE(!poll_state_synchronize_rcu(oldstate)), ...) and get
rid of if() ?

> +
> +       /* Finally. */
> +       complete(&rs->completion);
> +}
> +
> +static void rcu_sr_normal_gp_cleanup_work(struct work_struct *work)
> +{
> +       struct llist_node *done, *rcu, *next;
> +
> +       done = llist_del_all(&sr.done);
> +       if (!done)
> +               return;
> +
> +       llist_for_each_safe(rcu, next, done)
> +               rcu_sr_normal_complete(rcu);
> +}
[...]
> +static void rcu_sr_normal_add_req(struct rcu_synchronize *rs)
> +{
> +       atomic_inc(&sr.active);
> +       if (llist_add((struct llist_node *) &rs->head, &sr.curr))
> +               /* Set the tail. Only first and one user can do that. */
> +               WRITE_ONCE(sr.curr_tail, (struct llist_node *) &rs->head);
> +       atomic_dec(&sr.active);

Here there is no memory ordering provided by the atomic ops. Is that really Ok?

And same question for other usages of .active.

Per: https://www.kernel.org/doc/Documentation/atomic_t.txt
"RMW operations that have no return value are unordered;"
--
Note to self to ping my Android friends as well about this improvement
:-). Especially the -mm Android folks are interested in app startup
time.

thanks,

 - Joel



> +}
> +
>  /*
>   * Initialize a new grace period.  Return false if no grace period required.
>   */
> @@ -1418,6 +1534,7 @@ static noinline_for_stack bool rcu_gp_init(void)
>         /* Record GP times before starting GP, hence rcu_seq_start(). */
>         rcu_seq_start(&rcu_state.gp_seq);
>         ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> +       rcu_sr_normal_gp_init();
>         trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
>         rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
>         raw_spin_unlock_irq_rcu_node(rnp);
> @@ -1787,6 +1904,9 @@ static noinline void rcu_gp_cleanup(void)
>         }
>         raw_spin_unlock_irq_rcu_node(rnp);
>
> +       // Make synchronize_rcu() users aware of the end of old grace period.
> +       rcu_sr_normal_gp_cleanup();
> +
>         // If strict, make all CPUs aware of the end of the old grace period.
>         if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
>                 on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
> @@ -3500,6 +3620,35 @@ static int rcu_blocking_is_gp(void)
>         return true;
>  }
>
> +/*
> + * Helper function for the synchronize_rcu() API.
> + */
> +static void synchronize_rcu_normal(void)
> +{
> +       struct rcu_synchronize rs;
> +
> +       if (READ_ONCE(rcu_normal_wake_from_gp)) {
> +               init_rcu_head_on_stack(&rs.head);
> +               init_completion(&rs.completion);
> +
> +               /*
> +                * This code might be preempted, therefore take a GP
> +                * snapshot before adding a request.
> +                */
> +               rs.head.func = (void *) get_state_synchronize_rcu();
> +               rcu_sr_normal_add_req(&rs);
> +
> +               /* Kick a GP and start waiting. */
> +               (void) start_poll_synchronize_rcu();
> +
> +               /* Now we can wait. */
> +               wait_for_completion(&rs.completion);
> +               destroy_rcu_head_on_stack(&rs.head);
> +       } else {
> +               wait_rcu_gp(call_rcu_hurry);
> +       }
> +}
> +
>  /**
>   * synchronize_rcu - wait until a grace period has elapsed.
>   *
> @@ -3551,7 +3700,7 @@ void synchronize_rcu(void)
>                 if (rcu_gp_is_expedited())
>                         synchronize_rcu_expedited();
>                 else
> -                       wait_rcu_gp(call_rcu_hurry);
> +                       synchronize_rcu_normal();
>                 return;
>         }
>
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..279a37beb05a 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -987,7 +987,7 @@ void synchronize_rcu_expedited(void)
>
>         /* If expedited grace periods are prohibited, fall back to normal. */
>         if (rcu_gp_is_normal()) {
> -               wait_rcu_gp(call_rcu_hurry);
> +               synchronize_rcu_normal();
>                 return;
>         }
>
> --
> 2.30.2
>