[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190807144305.v55fohssujsqtegb@willie-the-truck>
Date: Wed, 7 Aug 2019 15:45:34 +0100
From: Will Deacon <will@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Oleg Nesterov <oleg@...hat.com>, Will Deacon <will@...nel.org>,
Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
linux-kernel@...r.kernel.org, bigeasy@...utronix.de,
juri.lelli@...hat.com, williams@...hat.com, bristot@...hat.com,
longman@...hat.com, dave@...olabs.net, jack@...e.com
Subject: Re: [PATCH] locking/percpu_rwsem: Rewrite to not use rwsem
Hi Peter,
I've mostly been spared the joys of pcpu rwsem, but I took a look anyway.
Comments of questionable quality below.
On Mon, Aug 05, 2019 at 04:02:41PM +0200, Peter Zijlstra wrote:
> The filesystem freezer uses percpu_rwsem in a way that is effectively
> write_non_owner() and achieves this with a few horrible hacks that
> rely on the rwsem (!percpu) implementation.
>
> When -RT re-implements rwsem this house of cards comes undone.
>
> Re-implement percpu_rwsem without relying on rwsem.
>
> Reported-by: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
> Reported-by: Juri Lelli <juri.lelli@...hat.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Tested-by: Juri Lelli <juri.lelli@...hat.com>
> Cc: Clark Williams <williams@...hat.com>
> Cc: Daniel Bristot de Oliveira <bristot@...hat.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Cc: Waiman Long <longman@...hat.com>
> Cc: Davidlohr Bueso <dave@...olabs.net>
> Cc: Oleg Nesterov <oleg@...hat.com>
> Cc: jack@...e.com
> ---
> include/linux/percpu-rwsem.h | 72 +++++++++++++-------------
> include/linux/wait.h | 3 +
> kernel/cpu.c | 4 -
> kernel/locking/percpu-rwsem.c | 116 +++++++++++++++++++++++++-----------------
> 4 files changed, 112 insertions(+), 83 deletions(-)
[...]
> +/*
> + * Called with preemption disabled, and returns with preemption disabled,
> + * except when it fails the trylock.
> + */
> +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
> {
> /*
> * Due to having preemption disabled the decrement happens on
> * the same CPU as the increment, avoiding the
> * increment-on-one-CPU-and-decrement-on-another problem.
> *
> - * If the reader misses the writer's assignment of readers_block, then
> - * the writer is guaranteed to see the reader's increment.
> + * If the reader misses the writer's assignment of sem->block, then the
> + * writer is guaranteed to see the reader's increment.
> *
> * Conversely, any readers that increment their sem->read_count after
> - * the writer looks are guaranteed to see the readers_block value,
> - * which in turn means that they are guaranteed to immediately
> - * decrement their sem->read_count, so that it doesn't matter that the
> - * writer missed them.
> + * the writer looks are guaranteed to see the sem->block value, which
> + * in turn means that they are guaranteed to immediately decrement
> + * their sem->read_count, so that it doesn't matter that the writer
> + * missed them.
> */
>
> +again:
> smp_mb(); /* A matches D */
>
> /*
> - * If !readers_block the critical section starts here, matched by the
> + * If !sem->block the critical section starts here, matched by the
> * release in percpu_up_write().
> */
> - if (likely(!smp_load_acquire(&sem->readers_block)))
> - return 1;
> + if (likely(!atomic_read_acquire(&sem->block)))
> + return true;
>
> /*
> * Per the above comment; we still have preemption disabled and
> * will thus decrement on the same CPU as we incremented.
> */
> - __percpu_up_read(sem);
> + __percpu_up_read(sem); /* implies preempt_enable() */
Irritatingly, it also implies an smp_mb() which I don't think we need here.
> if (try)
> - return 0;
> + return false;
>
> - /*
> - * We either call schedule() in the wait, or we'll fall through
> - * and reschedule on the preempt_enable() in percpu_down_read().
> - */
> - preempt_enable_no_resched();
> + wait_event(sem->waiters, !atomic_read_acquire(&sem->block));
Why do you need acquire semantics here? Is the control dependency to the
increment not enough?
Talking of control dependencies, could we replace the smp_mb() in
readers_active_check() with smp_acquire__after_ctrl_dep()? In fact, perhaps
we could remove it altogether, given that our writes will be ordered by
the dependency and I don't think we care about ordering our reads wrt
previous readers. Hmm. Either way, clearly not for this patch.
Anyway, general shape of the patch looks good to me.
Will
Powered by blists - more mailing lists