linux-kernel - Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can be locked

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJuCfpEV1OiM423bykYQTxDC1=bQAqhAwd5fiKYifsk=seP6yw@mail.gmail.com>
Date:   Mon, 17 Apr 2023 11:13:29 -0700
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Alistair Popple <apopple@...dia.com>
Cc:     Matthew Wilcox <willy@...radead.org>, akpm@...ux-foundation.org,
        hannes@...xchg.org, mhocko@...e.com, josef@...icpanda.com,
        jack@...e.cz, ldufour@...ux.ibm.com, laurent.dufour@...ibm.com,
        michel@...pinasse.org, liam.howlett@...cle.com, jglisse@...gle.com,
        vbabka@...e.cz, minchan@...gle.com, dave@...olabs.net,
        punit.agrawal@...edance.com, lstoakes@...il.com,
        linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can
 be locked

On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@...dia.com> wrote:
>
>
> Matthew Wilcox <willy@...radead.org> writes:
>
> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
> >> When page fault is handled under VMA lock protection, all swap page
> >> faults are retried with mmap_lock because folio_lock_or_retry
> >> implementation has to drop and reacquire mmap_lock if folio could
> >> not be immediately locked.
> >> Instead of retrying all swapped page faults, retry only when folio
> >> locking fails.
> >
> > Reviewed-by: Matthew Wilcox (Oracle) <willy@...radead.org>
> >
> > Let's just review what can now be handled under the VMA lock instead of
> > the mmap_lock, in case somebody knows better than me that it's not safe.
> >
> >  - We can call migration_entry_wait().  This will wait for PG_locked to
> >    become clear (in migration_entry_wait_on_locked()).  As previously
> >    discussed offline, I think this is safe to do while holding the VMA
> >    locked.
>
> Do we even need to be holding the VMA locked while in
> migration_entry_wait()? My understanding is we're just waiting for
> PG_locked to be cleared so we can return with a reasonable chance the
> migration entry is gone. If for example it has been unmapped or
> protections downgraded we will simply refault.

If we drop VMA lock before migration_entry_wait() then we would need
to lock_vma_under_rcu again after the wait. In which case it might be
simpler to retry the fault with some special return code to indicate
that VMA lock is not held anymore and we want to retry without taking
mmap_lock. I think it's similar to the last options Matthew suggested
earlier. In which case we can reuse the same retry mechanism for both
cases, here and in __folio_lock_or_retry.

>
> >  - We can call remove_device_exclusive_entry().  That calls
> >    folio_lock_or_retry(), which will fail if it can't get the VMA lock.
>
> Looks ok to me.
>
> >  - We can call pgmap->ops->migrate_to_ram().  Perhaps somebody familiar
> >    with Nouveau and amdkfd could comment on how safe this is?
>
> Currently this won't work because drives assume mmap_lock is held during
> pgmap->ops->migrate_to_ram(). Primarily this is because
> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
> that asserts mmap_lock is taken in walk_page_range() and also
> migrate_vma_insert_page().
>
> So I don't think we can call that case without mmap_lock.
>
> At a glance it seems it should be relatively easy to move to using
> lock_vma_under_rcu(). Drivers will need updating as well though because
> migrate_vma_setup() is called outside of fault handling paths so drivers
> will currently take mmap_lock rather than vma lock when looking up the
> vma. See for example nouveau_svmm_bind().

Thanks for the pointers, Alistair! It does look like we need to be
more careful with the migrate_to_ram() path. For now I can fallback to
retrying with mmap_lock for this case, like with do with all cases
today. Afterwards this path can be made ready for working under VMA
lock and we can remove that retry. Does that sound good?

>
> >  - I believe we can't call handle_pte_marker() because we exclude UFFD
> >    VMAs earlier.
> >  - We can call swap_readpage() if we allocate a new folio.  I haven't
> >    traced through all this code to tell if it's OK.
> >
> > So ... I believe this is all OK, but we're definitely now willing to
> > wait for I/O from the swap device while holding the VMA lock when we
> > weren't before.  And maybe we should make a bigger deal of it in the
> > changelog.
> >
> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
> > maybe we should be waiting for the folio lock with the VMA locked.
>