[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4y-s25N94b2GnxypFhb-5bv53wOcJBt14Dx83M6AJD=7Q@mail.gmail.com>
Date: Wed, 4 Sep 2024 09:10:20 +1200
From: Barry Song <21cnbao@...il.com>
To: Carlos Llamas <cmllamas@...gle.com>
Cc: Hillf Danton <hdanton@...a.com>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>,
Suren Baghdasaryan <surenb@...gle.com>, Michal Hocko <mhocko@...e.com>,
Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to
reduce the lock duration
On Wed, Sep 4, 2024 at 1:55 AM Carlos Llamas <cmllamas@...gle.com> wrote:
>
> On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote:
> > On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@...a.com> wrote:
> > >
> > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@...o.com>
> > > >
> > > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > > faults. Performing memory allocation while holding this lock may trigger
> > > > direct reclamation, leading to others being queued in the rwsem for an
> > > > extended period.
> > > > We've observed that the allocation can sometimes take more than 300ms,
> > > > significantly blocking other threads. The user interface sometimes
> > > > becomes less responsive as a result. To prevent this, let's move the
> > > > allocation outside of the write lock.
>
> Thanks for you patch Barry. So, we are aware of this contention and I've
> been working on a fix for it. See more about this below.
Cool, Carlos!
>
> > >
> > > I suspect concurrent allocators make things better wrt response, cutting
> > > alloc latency down to 10ms for instance in your scenario. Feel free to
> > > show figures given Tangquan's 48-hour profiling.
> >
> > Likely.
> >
> > Concurrent allocators are quite common in PFs which occur
> > in the same PTE. whoever gets PTL sets PTE, others free the allocated
> > pages.
> >
> > >
> > > > A potential side effect could be an extra alloc_page() for the second
> > > > thread executing binder_install_single_page() while the first thread
> > > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > > negligible.
>
> This is not negligible. In fact, it is the exact reason for the page
> allocation to be done with the mmap sem. If the first thread sleeps on
> vm_insert_page(), then binder gets into a bad state of multiple threads
> trying to reclaim pages that won't really be used. Memory pressure goes
> from bad to worst pretty quick.
>
> FWIW, I believe this was first talked about here:
> https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@google.com/
However, I'm not entirely convinced that this is a problem :-) Concurrent
allocations like this can occur in many places, especially in PFs. Reclamation
is not useless because it helps free up memory for others; it's not
without value.
I also don't believe binder is one of the largest users executing concurrent
allocations.
>
>
> > > > On the other hand, holding a write lock without making any VMA
> > > > modifications appears questionable and likely incorrect. While this
> > > > patch focuses on reducing the lock duration, future updates may aim
> > > > to eliminate the write lock entirely.
> > >
> > > If spin, better not before taking a look at vm_insert_page().
> >
> > I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is
> > currently in the
> > testing queue. At the moment, alloc->spin is in place, but I'm not
> > entirely convinced
> > it's the best replacement for the write lock. Let's wait for
> > Tangquan's test results.
> >
> > Patch 2 is detailed below, but it has only passed the build-test phase
> > so far, so
> > its result is uncertain. I'm sharing it early in case you find it
> > interesting. And I
> > am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in
> > race with munmap()") is a correct fix to really avoid all UAF of alloc->vma.
> >
> > [PATCH] binder_alloc: Don't use mmap_write_lock for installing page
> >
> > Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with
> > munmap()") uses the mmap_rwsem write lock to protect against a race
> > condition with munmap, where the vma is detached by the write lock,
> > but pages are zapped by the read lock. This approach is extremely
> > expensive for the system, though perhaps less so for binder itself,
> > as the write lock can block all other operations.
> >
> > As an alternative, we could hold only the read lock and re-check
> > that the vma hasn't been detached. To protect simultaneous page
> > installation, we could use alloc->lock instead.
> >
> > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > ---
> > drivers/android/binder_alloc.c | 32 +++++++++++++++++---------------
> > 1 file changed, 17 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > index f20074e23a7c..a2281dfacbbc 100644
> > --- a/drivers/android/binder_alloc.c
> > +++ b/drivers/android/binder_alloc.c
> > @@ -228,24 +228,17 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> > return -ESRCH;
> >
> > /*
> > - * Don't allocate page in mmap_write_lock, this can block
> > - * mmap_rwsem for a long time; Meanwhile, allocation failure
> > - * doesn't necessarily need to return -ENOMEM, if lru_page
> > - * has been installed, we can still return 0(success).
> > + * Allocation failure doesn't necessarily need to return -ENOMEM,
> > + * if lru_page has been installed, we can still return 0(success).
> > + * So, defer the !page check until after binder_get_installed_page()
> > + * is completed.
> > */
> > page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> >
> > - /*
> > - * Protected with mmap_sem in write mode as multiple tasks
> > - * might race to install the same page.
> > - */
> > - mmap_write_lock(alloc->mm);
> > - if (binder_get_installed_page(lru_page)) {
> > - ret = 1;
> > - goto out;
> > - }
> > + mmap_read_lock(alloc->mm);
> >
> > - if (!alloc->vma) {
> > + /* vma might have been dropped or deattached */
> > + if (!alloc->vma || !find_vma(alloc->mm, addr)) {
> > pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> > ret = -ESRCH;
> > goto out;
> > @@ -257,18 +250,27 @@ static int binder_install_single_page(struct
> > binder_alloc *alloc,
> > goto out;
> > }
> >
> > + spin_lock(&alloc->lock);
>
> You can't hold a spinlock and then call vm_insert_page().
Thanks! This patch has only passed the build test so far. It seems like
we can hold off on further testing for now.
>
> > + if (binder_get_installed_page(lru_page)) {
> > + spin_unlock(&alloc->lock);
> > + ret = 1;
> > + goto out;
> > + }
> > +
> > ret = vm_insert_page(alloc->vma, addr, page);
> > if (ret) {
> > pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> > alloc->pid, __func__, addr - alloc->buffer, ret);
> > + spin_unlock(&alloc->lock);
> > ret = -ENOMEM;
> > goto out;
> > }
> >
> > /* Mark page installation complete and safe to use */
> > binder_set_installed_page(lru_page, page);
> > + spin_unlock(&alloc->lock);
> > out:
> > - mmap_write_unlock(alloc->mm);
> > + mmap_read_unlock(alloc->mm);
> > mmput_async(alloc->mm);
> > if (ret && page)
> > __free_page(page);
> > --
> > 2.39.3 (Apple Git-146)
>
>
> Sorry, but as I mentioned, I've been working on fixing this contention
> by supporting concurrent "faults" in binder_install_single_page(). This
> is the appropriate fix. I should be sending a patch soon after working
> out the conflicts with the shrinker's callback.
Awesome! I’m eager to see your patch, and we’re ready to help with testing.
I strongly recommend dropping the write lock entirely. Using
`mmap_write_lock()` isn’t just a binder-specific concern; it has the
potential to affect the entire Android system.
In patch 3, I experimented with using `per_vma_lock` as well. I’m _not_
proposing it for merging since you’re already working on it, but I wanted
to share the idea. (just like patch2, it has only passed build-test)
[PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock
To further reduce the read lock duration, let's try using per_vma_lock
first. If that fails, we can take the read lock, similar to how page
fault handlers operate.
Signed-off-by: Barry Song <v-songbaohua@...o.com>
---
drivers/android/binder_alloc.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
index a2281dfacbbc..b40a5dd650c8 100644
--- a/drivers/android/binder_alloc.c
+++ b/drivers/android/binder_alloc.c
@@ -221,6 +221,8 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
struct binder_lru_page *lru_page,
unsigned long addr)
{
+ struct vm_area_struct *vma;
+ bool per_vma_lock = true;
struct page *page;
int ret = 0;
@@ -235,10 +237,15 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
*/
page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
- mmap_read_lock(alloc->mm);
+ vma = lock_vma_under_rcu(alloc->mm, addr);
+ if (!vma) {
+ per_vma_lock = false;
+ mmap_read_lock(alloc->mm);
+ vma = find_vma(alloc->mm, addr);
+ }
- /* vma might have been dropped or deattached */
- if (!alloc->vma || !find_vma(alloc->mm, addr)) {
+ /* vma might have been dropped, deattached or changed to new one */
+ if (!alloc->vma || !vma || vma != alloc->vma) {
pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
ret = -ESRCH;
goto out;
@@ -270,7 +277,10 @@ static int binder_install_single_page(struct
binder_alloc *alloc,
binder_set_installed_page(lru_page, page);
spin_unlock(&alloc->lock);
out:
- mmap_read_unlock(alloc->mm);
+ if (per_vma_lock)
+ vma_end_read(vma);
+ else
+ mmap_read_unlock(alloc->mm);
mmput_async(alloc->mm);
if (ret && page)
__free_page(page);
--
2.39.3 (Apple Git-146)
>
> Thanks,
> --
> Carlos Llamas
Thanks
Barry
Powered by blists - more mailing lists