linux-kernel - Re: [PATCH] mm: Re-allow pinning of zero pfns

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220623142139.462a0841.alex.williamson@redhat.com>
Date:   Thu, 23 Jun 2022 14:21:39 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     David Hildenbrand <david@...hat.com>
Cc:     Jason Gunthorpe <jgg@...dia.com>, akpm@...ux-foundation.org,
        minchan@...nel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, paulmck@...nel.org,
        jhubbard@...dia.com, joaodias@...gle.com
Subject: Re: [PATCH] mm: Re-allow pinning of zero pfns

On Thu, 23 Jun 2022 20:07:14 +0200
David Hildenbrand <david@...hat.com> wrote:

> On 15.06.22 17:56, Jason Gunthorpe wrote:
> > On Sat, Jun 11, 2022 at 08:29:47PM +0200, David Hildenbrand wrote:  
> >> On 11.06.22 00:35, Alex Williamson wrote:  
> >>> The commit referenced below subtly and inadvertently changed the logic
> >>> to disallow pinning of zero pfns.  This breaks device assignment with
> >>> vfio and potentially various other users of gup.  Exclude the zero page
> >>> test from the negation.  
> >>
> >> I wonder which setups can reliably work with a long-term pin on a shared
> >> zeropage. In a MAP_PRIVATE mapping, any write access via the page tables
> >> will end up replacing the shared zeropage with an anonymous page.
> >> Something similar should apply in MAP_SHARED mappings, when lazily
> >> allocating disk blocks.  
> 
> ^ correction, shared zeropage is never user in MAP_SHARED mappings
> (fortunally).
> 
> >>
> >> In the future, we might trigger unsharing when taking a R/O pin for the
> >> shared zeropage, just like we do as of now upstream for shared anonymous
> >> pages (!PageAnonExclusive). And something similar could then be done
> >> when finding a !anon page in a MAP_SHARED mapping.  
> > 
> > I'm also confused how qemu is hitting this and it isn't already a bug?
> >   
> 
> I assume it's just some random thingy mapped into the guest physical
> address space (by the bios? R/O?), that actually never ends up getting
> used by a device.
> 
> So vfio simply only needs this to keep working ... but weon't actually
> ever user that data.
> 
> But this is just my best guess after thinking about it.

Good guess.

> > It is arising because vfio doesn't use FOLL_FORCE|FOLL_WRITE to move
> > away the zero page in most cases.
> > 
> > And why does Yishai say it causes an infinite loop in the kernel?  
> 
> 
> Good question. Maybe $something keeps retying if pinning fails, either
> in the kernel (which would be bad) or in user space. At least QEMU seems
> to just fail if pinning fails, but maybe it's a different user space?

The loop is in __gup_longterm_locked():

        do {
                rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
                                             NULL, gup_flags);
                if (rc <= 0)
                        break;
                rc = check_and_migrate_movable_pages(rc, pages, gup_flags);
        } while (!rc);

It appears we're pinning a 32 page (128K) range,
__get_user_pages_locked() returns 32, but
check_and_migrate_movable_pages() perpetually returns zero.  I believe
this is because folio_is_pinnable() previously returned true, and now
returns false.  Therefore we drop down to fail at folio_isolate_lru(),
incrementing isolation_error_count.  From there we do nothing more than
unpin the pages, return zero, and hope for better luck next time, which
obviously doesn't happen.

If I generate an errno here, QEMU reports failing on the pc.rom memory
region at 0xc0000.  Thanks,

Alex