lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4xb+h7EVG8WQxt9BpAz6EYC4V+M9+ijw47Pt0-6iOZtog@mail.gmail.com>
Date: Fri, 4 Oct 2024 23:55:28 +0800
From: Barry Song <21cnbao@...il.com>
To: Chris Li <chrisl@...nel.org>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, Barry Song <v-songbaohua@...o.com>, 
	Kairui Song <kasong@...cent.com>, "Huang, Ying" <ying.huang@...el.com>, Yu Zhao <yuzhao@...gle.com>, 
	David Hildenbrand <david@...hat.com>, Hugh Dickins <hughd@...gle.com>, 
	Johannes Weiner <hannes@...xchg.org>, Matthew Wilcox <willy@...radead.org>, Michal Hocko <mhocko@...e.com>, 
	Minchan Kim <minchan@...nel.org>, Yosry Ahmed <yosryahmed@...gle.com>, 
	SeongJae Park <sj@...nel.org>, Kalesh Singh <kaleshsingh@...gle.com>, 
	Suren Baghdasaryan <surenb@...gle.com>, stable@...r.kernel.org, 
	Oven Liyang <liyangouwen1@...o.com>
Subject: Re: [PATCH] mm: avoid unconditional one-tick sleep when
 swapcache_prepare fails

On Fri, Oct 4, 2024 at 6:22 AM Chris Li <chrisl@...nel.org> wrote:
>
> On Thu, Sep 26, 2024 at 2:20 PM Barry Song <21cnbao@...il.com> wrote:
> >
> > From: Barry Song <v-songbaohua@...o.com>
> >
> > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > introduced an unconditional one-tick sleep when `swapcache_prepare()`
> > fails, which has led to reports of UI stuttering on latency-sensitive
> > Android devices. To address this, we can use a waitqueue to wake up
> > tasks that fail `swapcache_prepare()` sooner, instead of always
> > sleeping for a full tick. While tasks may occasionally be woken by an
> > unrelated `do_swap_page()`, this method is preferable to two scenarios:
> > rapid re-entry into page faults, which can cause livelocks, and
> > multiple millisecond sleeps, which visibly degrade user experience.
> >
> > Oven's testing shows that a single waitqueue resolves the UI
> > stuttering issue. If a 'thundering herd' problem becomes apparent
> > later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]`
> > for page bit locks can be introduced.
> >
> > Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > Cc: Kairui Song <kasong@...cent.com>
> > Cc: "Huang, Ying" <ying.huang@...el.com>
> > Cc: Yu Zhao <yuzhao@...gle.com>
> > Cc: David Hildenbrand <david@...hat.com>
> > Cc: Chris Li <chrisl@...nel.org>
> > Cc: Hugh Dickins <hughd@...gle.com>
> > Cc: Johannes Weiner <hannes@...xchg.org>
> > Cc: Matthew Wilcox (Oracle) <willy@...radead.org>
> > Cc: Michal Hocko <mhocko@...e.com>
> > Cc: Minchan Kim <minchan@...nel.org>
> > Cc: Yosry Ahmed <yosryahmed@...gle.com>
> > Cc: SeongJae Park <sj@...nel.org>
> > Cc: Kalesh Singh <kaleshsingh@...gle.com>
> > Cc: Suren Baghdasaryan <surenb@...gle.com>
> > Cc: <stable@...r.kernel.org>
> > Reported-by: Oven Liyang <liyangouwen1@...o.com>
> > Tested-by: Oven Liyang <liyangouwen1@...o.com>
> > Signed-off-by: Barry Song <v-songbaohua@...o.com>
> > ---
> >  mm/memory.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2366578015ad..6913174f7f41 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4192,6 +4192,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > +
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> >   * but allow concurrent faults), and pte mapped but not yet locked.
> > @@ -4204,6 +4206,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> >         struct folio *swapcache, *folio = NULL;
> > +       DECLARE_WAITQUEUE(wait, current);
> >         struct page *page;
> >         struct swap_info_struct *si = NULL;
> >         rmap_t rmap_flags = RMAP_NONE;
> > @@ -4302,7 +4305,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                                          * Relax a bit to prevent rapid
> >                                          * repeated page faults.
> >                                          */
> > +                                       add_wait_queue(&swapcache_wq, &wait);
> >                                         schedule_timeout_uninterruptible(1);
> > +                                       remove_wait_queue(&swapcache_wq, &wait);
>
> There is only one "swapcache_wq", if we don't care about the memory
> overhead, ideally should be per swap entry that fails to grab the
> HAS_CACHE bit and has one wait queue. Currently all swap entries using
> one wait queue will likely cause other swap entries (if any) get wait
> up then find out the swap entry it cares hasn't been served yet.
>

even page bit locks do have a waitqueue for one page, i believe that
case has much serious contention then swap-in. page bit lock depends
on a waitqueue hash to decrease unrelated wake-up.

if one process is woken-up by unrelated do_swap_page() and its swapcache
is not released, it will sleep again after re-checking swapcache_prepare().

Too many unrelated wake-ups would be just a 'thundering herd' but not
a livelock.

> Another thing to consider is that, if we are using a wait queue, the
> 1ms is not relevant any more. It can be longer than 1ms and it is
> getting waited up by the wait queue anyway. Here you might use
> indefinitely sleep to reduce the unnecessary wait up and the
> complexity of the timer.

not quite sure what you mean for 1ms, in an embedded system, we never
use 1000HZ, the typical/maximum HZ is 250.  not quite sure what
you mean by "indefinitely sleep", my understanding is that we can't
poll the result of swapcache_prepare() as the winner process
which does swapcache_prepare() successfully will drop the
swap slots.

>
> >                                         goto out_page;
> >                                 }
> >                                 need_clear_cache = true;
> > @@ -4609,8 +4614,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 pte_unmap_unlock(vmf->pte, vmf->ptl);
> >  out:
> >         /* Clear the swap cache pin for direct swapin after PTL unlock */
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(&swapcache_wq);
>
> Agree with Ying that here the common path will need to take a lock to
> wait up the wait queue.

waitqueue_active() might be a good candidate.

>
> Chris
>
>
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > @@ -4625,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 folio_unlock(swapcache);
> >                 folio_put(swapcache);
> >         }
> > -       if (need_clear_cache)
> > +       if (need_clear_cache) {
> >                 swapcache_clear(si, entry, nr_pages);
> > +               wake_up(&swapcache_wq);
> > +       }
> >         if (si)
> >                 put_swap_device(si);
> >         return ret;
> > --
> > 2.34.1
> >

Thanks
Barry

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ