lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191211234648.GO2889@paulmck-ThinkPad-P72>
Date:   Wed, 11 Dec 2019 15:46:48 -0800
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Uladzislau Rezki <urezki@...il.com>
Cc:     "Joel Fernandes (Google)" <joel@...lfernandes.org>,
        linux-kernel@...r.kernel.org, kernel-team@...roid.com,
        kernel-team@....com, Byungchul Park <byungchul.park@....com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Josh Triplett <josh@...htriplett.org>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        max.byungchul.park@...il.com, Rao Shoaib <rao.shoaib@...cle.com>,
        rcu@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu()
 batching

On Tue, Dec 10, 2019 at 10:53:48AM +0100, Uladzislau Rezki wrote:
> On Wed, Sep 18, 2019 at 11:58:11AM +0200, Uladzislau Rezki wrote:
> > > Recently a discussion about stability and performance of a system
> > > involving a high rate of kfree_rcu() calls surfaced on the list [1]
> > > which led to another discussion how to prepare for this situation.
> > > 
> > > This patch adds basic batching support for kfree_rcu(). It is "basic"
> > > because we do none of the slab management, dynamic allocation, code
> > > moving or any of the other things, some of which previous attempts did
> > > [2]. These fancier improvements can be follow-up patches and there are
> > > different ideas being discussed in those regards. This is an effort to
> > > start simple, and build up from there. In the future, an extension to
> > > use kfree_bulk and possibly per-slab batching could be done to further
> > > improve performance due to cache-locality and slab-specific bulk free
> > > optimizations. By using an array of pointers, the worker thread
> > > processing the work would need to read lesser data since it does not
> > > need to deal with large rcu_head(s) any longer.
> > > 
> According to https://lkml.org/lkml/2017/12/19/706 there was an attempt
> to make use of kfree_bulk() interface. I have done some tests based on
> your patch and enhanced kfree_bulk() logic. Basically storing pointers 
> in an array with a specific size makes sense to me and seems to others
> as well. I mean in comparison with "pointer chasing" way, when there is
> probably a cache misses each time the access is done to next element:

Something like this would be good!

The other thing to track besides CPU time savings (which does look good!)
is memory footprint.

And there will also need to be something visible to RCU counting the
number of outstanding kfree()s.  But on a per-CPU basis, for example,
as an atomic_long_t field in the rcu_data structure or similar.  This
is needed to help RCU work out when it needs to work harder to bring
grace periods to an end.  But that can be a separate issue.

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 1fe0418a5901..4f68662c1568 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2595,6 +2595,13 @@ EXPORT_SYMBOL_GPL(call_rcu);
> 
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (HZ / 50)
> +#define KFREE_BULK_MAX_SIZE 64

My guess is that performance does not depend all that much on the
exact number.  Does that match your testing?

> +
> +struct kfree_rcu_bulk_data {
> +       int nr_records;
> +       void *records[KFREE_BULK_MAX_SIZE];
> +       struct kfree_rcu_bulk_data *next;
> +};
> 
>  /*
>   * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
> @@ -2607,15 +2614,24 @@ struct kfree_rcu_cpu {
>         struct rcu_work rcu_work;
> 
>         /* The list of objects being queued in a batch but are not yet
> -        * scheduled to be freed.
> +        * scheduled to be freed. For emergency path only.
>          */
>         struct rcu_head *head;
> 
>         /* The list of objects that have now left ->head and are queued for
> -        * freeing after a grace period.
> +        * freeing after a grace period. For emergency path only.
>          */
>         struct rcu_head *head_free;
> 
> +       /*
> +        * It is a block list that keeps pointers in the array of specific
> +        * size which are freed by the kfree_bulk() logic. Intends to improve
> +        * drain throughput.
> +        */
> +       struct kfree_rcu_bulk_data *bhead;
> +       struct kfree_rcu_bulk_data *bhead_free;
> +       struct kfree_rcu_bulk_data *bcached;

So ->bcached keeps at most one kfree_rcu_bulk_data around for later use,
correct?  And ->bhead is where new memory is placed, while ->bhead_free
contains those waiting for a grace period, right?  (It would be good
to make the comment explicit about this.)

> +
>         /* Protect concurrent access to this structure. */
>         spinlock_t lock;
> @@ -2637,23 +2653,39 @@ static void kfree_rcu_work(struct work_struct *work)
>  {
>         unsigned long flags;
>         struct rcu_head *head, *next;
> +       struct kfree_rcu_bulk_data *bhead, *bnext;
>         struct kfree_rcu_cpu *krcp = container_of(to_rcu_work(work),
>                                         struct kfree_rcu_cpu, rcu_work);
>  
>         spin_lock_irqsave(&krcp->lock, flags);
>         head = krcp->head_free;
>         krcp->head_free = NULL;
> +       bhead = krcp->bhead_free;
> +       krcp->bhead_free = NULL;
>         spin_unlock_irqrestore(&krcp->lock, flags);
>  
>         /*
>          * The head is detached and not referenced from anywhere, so lockless
>          * access is Ok.
>          */
> +       for (; bhead; bhead = bnext) {
> +               bnext = bhead->next;
> +               kfree_bulk(bhead->nr_records, bhead->records);
> +
> +               if (cmpxchg(&krcp->bcached, NULL, bhead))
> +                       kfree(bhead);
> +
> +               cond_resched_tasks_rcu_qs();
> +       }
> +
> +       /*
> +        * Emergency case only. It can happen under low
> +        * memory condition when kmalloc gets failed, so
> +        * the "bulk" path can not be temporary maintained.
> +        */
>         for (; head; head = next) {
>                 next = head->next;
> -               /* Could be possible to optimize with kfree_bulk in future */
>                 __rcu_reclaim(rcu_state.name, head);
> -               cond_resched_tasks_rcu_qs();
>         }
>  }
> 
> @@ -2671,11 +2703,15 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
>          * another one, just refuse the optimization and it will be retried
>          * again in KFREE_DRAIN_JIFFIES time.
>          */
> -       if (krcp->head_free)
> +       if (krcp->bhead_free || krcp->head_free)
>                 return false;
> 
>         krcp->head_free = krcp->head;
>         krcp->head = NULL;
> +
> +       krcp->bhead_free = krcp->bhead;
> +       krcp->bhead = NULL;
> +
>         INIT_RCU_WORK(&krcp->rcu_work, kfree_rcu_work);
>         queue_rcu_work(system_wq, &krcp->rcu_work);
> 
> @@ -2747,6 +2783,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  {
>         unsigned long flags;
>         struct kfree_rcu_cpu *krcp;
> +       struct kfree_rcu_bulk_data *bnode;
> 
>         /* kfree_call_rcu() batching requires timers to be up. If the scheduler
>          * is not yet up, just skip batching and do the non-batched version.
> @@ -2754,16 +2791,35 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>         if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING)
>                 return kfree_call_rcu_nobatch(head, func);
> 
> -       head->func = func;
> -
>         local_irq_save(flags);  /* For safely calling this_cpu_ptr(). */
>         krcp = this_cpu_ptr(&krc);
>         spin_lock(&krcp->lock);
> 
> +       if (!krcp->bhead ||
> +                       krcp->bhead->nr_records == KFREE_BULK_MAX_SIZE) {
> +               /* Need a new block. */
> +               if (!(bnode = xchg(&krcp->bcached, NULL)))
> +                       bnode = kmalloc(sizeof(struct kfree_rcu_bulk_data),
> +                               GFP_ATOMIC | __GFP_NOWARN);
> +
> +               /* If gets failed, maintain the list instead. */
> +               if (unlikely(!bnode)) {
> +                       head->func = func;
> +                       head->next = krcp->head;
> +                       krcp->head = head;
> +                       goto check_and_schedule;

It should be possible to move this code out to follow the "Queue the
next" comment, thus avoiding the goto.  Setting krcp->bhead to NULL
here should set up for the check below, right?

> +               }
> +
> +               bnode->nr_records = 0;
> +               bnode->next = krcp->bhead;
> +               krcp->bhead = bnode;
> +       }
> +
>         /* Queue the kfree but don't yet schedule the batch. */
> -       head->next = krcp->head;
> -       krcp->head = head;
> +       krcp->bhead->records[krcp->bhead->nr_records++] =
> +               (void *) head - (unsigned long) func;
>  
> +check_and_schedule:
>         /* Schedule monitor for timely drain after KFREE_DRAIN_JIFFIES. */
>         if (!xchg(&krcp->monitor_todo, true))
>                 schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> 
> See below some test results with/without this patch:
> 
> # HiKey 960 8xCPUs
> rcuperf.ko kfree_loops=200000 kfree_alloc_num=1000 kfree_rcu_test=1
> [  159.017771] Total time taken by all kfree'ers: 92783584881 ns, loops: 200000, batches: 5117
> [  126.862573] Total time taken by all kfree'ers: 70935580718 ns, loops: 200000, batches: 3953
> 
> Running the "rcuperf" shows approximately ~23% better throughput in case of using
> "bulk" interface, so we have 92783584881 vs 70935580718 as total time. The "drain logic"
> or its RCU callback does the work faster that leads to better throughput.
> 
> I can upload the RFC/PATCH of that change providing the test details and so on. 
> 
> Any thoughts about it?

Again nice improvement!  Please also check memory footprint.  I would
not expect much difference, but...

							Thanx, Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ