lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zkz6fxkRFm190C89=p3=keJCYvrYtmjapTw=BbU_xUSQ@mail.gmail.com>
Date: Tue, 3 Sep 2024 19:01:08 +0800
From: Barry Song <21cnbao@...il.com>
To: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	Arve Hjønnevåg <arve@...roid.com>, 
	Todd Kjos <tkjos@...roid.com>, Martijn Coenen <maco@...roid.com>, 
	Joel Fernandes <joel@...lfernandes.org>, Christian Brauner <brauner@...nel.org>, 
	Carlos Llamas <cmllamas@...gle.com>, Suren Baghdasaryan <surenb@...gle.com>, 
	Tangquan Zheng <zhengtangquan@...o.com>
Subject: Re: [PATCH] binder_alloc: Move alloc_page() out of mmap_rwsem to
 reduce the lock duration

On Tue, Sep 3, 2024 at 6:37 PM Greg Kroah-Hartman
<gregkh@...uxfoundation.org> wrote:
>
> On Tue, Sep 03, 2024 at 05:07:23PM +0800, Barry Song wrote:
> > On Tue, Sep 3, 2024 at 4:57 PM Greg Kroah-Hartman
> > <gregkh@...uxfoundation.org> wrote:
> > >
> > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote:
> > > > From: Barry Song <v-songbaohua@...o.com>
> > > >
> > > > The mmap_write_lock() can block all access to the VMAs, for example page
> > > > faults. Performing memory allocation while holding this lock may trigger
> > > > direct reclamation, leading to others being queued in the rwsem for an
> > > > extended period.
> > > > We've observed that the allocation can sometimes take more than 300ms,
> > > > significantly blocking other threads. The user interface sometimes
> > > > becomes less responsive as a result. To prevent this, let's move the
> > > > allocation outside of the write lock.
> > > > A potential side effect could be an extra alloc_page() for the second
> > > > thread executing binder_install_single_page() while the first thread
> > > > has done it earlier. However, according to Tangquan's 48-hour profiling
> > > > using monkey, the likelihood of this occurring is minimal, with a ratio
> > > > of only 1 in 2400. Compared to the significantly costly rwsem, this is
> > > > negligible.
> > > > On the other hand, holding a write lock without making any VMA
> > > > modifications appears questionable and likely incorrect. While this
> > > > patch focuses on reducing the lock duration, future updates may aim
> > > > to eliminate the write lock entirely.
> > > >
> > > > Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>
> > > > Cc: "Arve Hjønnevåg" <arve@...roid.com>
> > > > Cc: Todd Kjos <tkjos@...roid.com>
> > > > Cc: Martijn Coenen <maco@...roid.com>
> > > > Cc: Joel Fernandes <joel@...lfernandes.org>
> > > > Cc: Christian Brauner <brauner@...nel.org>
> > > > Cc: Carlos Llamas <cmllamas@...gle.com>
> > > > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > > > Tested-by: Tangquan Zheng <zhengtangquan@...o.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > > > ---
> > > >  drivers/android/binder_alloc.c | 18 ++++++++++++++----
> > > >  1 file changed, 14 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c
> > > > index b3acbc4174fb..f20074e23a7c 100644
> > > > --- a/drivers/android/binder_alloc.c
> > > > +++ b/drivers/android/binder_alloc.c
> > > > @@ -227,13 +227,23 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >       if (!mmget_not_zero(alloc->mm))
> > > >               return -ESRCH;
> > > >
> > > > +     /*
> > > > +      * Don't allocate page in mmap_write_lock, this can block
> > > > +      * mmap_rwsem for a long time; Meanwhile, allocation failure
> > > > +      * doesn't necessarily need to return -ENOMEM, if lru_page
> > > > +      * has been installed, we can still return 0(success).
> > > > +      */
> > > > +     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> > >
> > > But now you are allocating new pages even if binder_get_installed_page()
> > > is an error, right?  Doesn't that slow things down?
> >
> > very very unlikely, as the ratio is only 1/2400 while write lock 100% blocks
> > everyone.
>
> Ok, but:
>
> > > How was this benchmarked?
> >
> > i actually put Tangquan's profiling result:
> > "
> > However, according to Tangquan's 48-hour profiling
> >  using monkey, the likelihood of this occurring is minimal, with a ratio
> >  of only 1 in 2400. Compared to the significantly costly rwsem, this is
> >  negligible."
>
> That's not a benchmark, or any real numbers of how this overall saves
> any time.

My point is that mmap_rw is one of the most notorious locks impacting
system performance. Our profiling indicates that this binder's rwsem lock
is among the top two write locks, blocking page faults and other accesses
to the VMA.

>From the memory management perspective, I believe the improvement will
be quite evident. However, I agree that we should include more data to show
how reducing lock contention can enhance overall user experience and
minimize UI lag.

>
> > > >       /*
> > > >        * Protected with mmap_sem in write mode as multiple tasks
> > > >        * might race to install the same page.
> > > >        */
> > > >       mmap_write_lock(alloc->mm);
> > > > -     if (binder_get_installed_page(lru_page))
> > > > +     if (binder_get_installed_page(lru_page)) {
> > > > +             ret = 1;
> > >
> > > That is not a valid error value :(
> > >
> > > >               goto out;
> > > > +     }
> > > >
> > > >       if (!alloc->vma) {
> > > >               pr_err("%d: %s failed, no vma\n", alloc->pid, __func__);
> > > > @@ -241,7 +251,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >               goto out;
> > > >       }
> > > >
> > > > -     page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO);
> > > >       if (!page) {
> > > >               pr_err("%d: failed to allocate page\n", alloc->pid);
> > > >               ret = -ENOMEM;
> > > > @@ -252,7 +261,6 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >       if (ret) {
> > > >               pr_err("%d: %s failed to insert page at offset %lx with %d\n",
> > > >                      alloc->pid, __func__, addr - alloc->buffer, ret);
> > > > -             __free_page(page);
> > > >               ret = -ENOMEM;
> > > >               goto out;
> > > >       }
> > > > @@ -262,7 +270,9 @@ static int binder_install_single_page(struct binder_alloc *alloc,
> > > >  out:
> > > >       mmap_write_unlock(alloc->mm);
> > > >       mmput_async(alloc->mm);
> > > > -     return ret;
> > > > +     if (ret && page)
> > > > +             __free_page(page);
> > > > +     return ret < 0 ? ret : 0;
> > >
> > > Please only use ? : for when you have to, otherwise please spell it out
> > > with a normal if statement:
> > >         if (ret < 0)
> > >                 return ret;
> > >         return 0;
> > >
> > > But, this is abusing the fact that you set "ret = 1" above, which is
> > > going to trip someone up in the future as that is NOT a normal coding
> > > pattern we have in the kernel, sorry.
> > >
> > > If you insist on this change, please rework it to not have that type of
> > > "positive means one thing, 0 means another, and negative means yet
> > > something else" please.
> >
> > I was trying to consolidate all free_page() calls into one place. Otherwise,
> > we would need multiple free_page() calls. I'm perfectly fine with having more
> > free_page() calls in both the ret = 1 and ret < 0 paths. In that case,
> > the ret = 1
> > path can be removed if you prefer.
>
> Remember, we write code for people first, and compilers second.  You
> have to maintain this code for the next 10+ years, make it _VERY_
> obvious what is happening and how it works as you will be coming back to
> it and not remembering what was made for what reason at all.

no objection to this.

>
> thanks,
>
> greg k-h

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ