[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 4 May 2022 21:11:49 -0400
From: Waiman Long <longman@...hat.com>
To: Matthew Wilcox <willy@...radead.org>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
"Liam R. Howlett" <Liam.Howlett@...cle.com>,
linux-kernel@...r.kernel.org
Subject: Re: Wait for mutex to become unlocked
On 5/4/22 17:44, Matthew Wilcox wrote:
> Paul, Liam and I were talking about some code we intend to write soon
> and realised there's a missing function in the mutex & rwsem API.
> We're intending to use it for an rwsem, but I think it applies equally
> to mutexes.
>
> The customer has a low priority task which wants to read /proc/pid/smaps
> of a higher priority task. Today, everything is awful; smaps acquires
> mmap_sem read-only, is preempted, then the high-pri task calls mmap()
> and the down_write(mmap_sem) blocks on the low-pri task. Then all the
> other threads in the high-pri task block on the mmap_sem as they take
> page faults because we don't want writers to starve.
>
> The approach we're looking at is to allow RCU lookup of VMAs, and then
> take a per-VMA rwsem for read. Because we're under RCU protection,
> that looks a bit like this:
>
> rcu_read_lock();
> vma = vma_lookup();
> if (down_read_trylock(&vma->sem)) {
> rcu_read_unlock();
> } else {
> rcu_read_unlock();
> down_read(&mm->mmap_sem);
> vma = vma_lookup();
> down_read(&vma->sem);
> up_read(&mm->mmap_sem);
> }
>
> (for clarity, I've skipped the !vma checks; don't take this too literally)
>
> So this is Good. For the vast majority of cases, we avoid taking the
> mmap read lock and the problem will appear much less often. But we can
> do Better with a new API. You see, for this case, we don't actually
> want to acquire the mmap_sem; we're happy to spin a bit, but there's no
> point in spinning waiting for the writer to finish when we can sleep.
> I'd like to write this code:
>
> again:
> rcu_read_lock();
> vma = vma_lookup();
> if (down_read_trylock(&vma->sem)) {
> rcu_read_unlock();
> } else {
> rcu_read_unlock();
> rwsem_wait_read(&mm->mmap_sem);
> goto again;
> }
>
> That is, rwsem_wait_read() puts the thread on the rwsem's wait queue,
> and wakes it up without giving it the lock. Now this thread will never
> be able to block any thread that tries to acquire mmap_sem for write.
I suppose that a writer that needs to take a write lock on vma->sem will
have to take a write lock on mmap_sem first, then it makes sense to me
that you want to wait for all the vma->sem writers to finish by waiting
on the wait queue of mmap_sem. By the time the waiting task is being
woken up, there is no active write lock on the vma->sem and hopefully by
the time the waiting process wakes up and do a down_read_trylock(), it
will succeed. However, the time gap in the wakeup process may have
another writer coming in taking the vma->sem write lock. It improves the
chance of a successful trylock but it is not guaranteed. So you will
need a retry count and revert back to a direct down_read() when there
are too many retries.
Since the waiting process isn't taking any lock, the name
rwsem_wait_read() may be somewhat misleading. I think a better name may
be rwsem_flush_waiters(). So do you want to flush the waiters at the
point this API is called or you want to wait until the wait queue is empty?
Cheers,
Longman
Powered by blists - more mailing lists