[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87sfn4cj8u.fsf@nvdebian.thelocal>
Date: Thu, 14 Jul 2022 15:39:49 +1000
From: Alistair Popple <apopple@...dia.com>
To: David Hildenbrand <david@...hat.com>
Cc: Alex Sierra <alex.sierra@....com>, jgg@...dia.com,
Felix.Kuehling@....com, linux-mm@...ck.org, rcampbell@...dia.com,
linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org,
amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
hch@....de, jglisse@...hat.com, willy@...radead.org,
akpm@...ux-foundation.org
Subject: Re: [PATCH v8 07/15] mm/gup: migrate device coherent pages when
pinning instead of failing
David Hildenbrand <david@...hat.com> writes:
> On 07.07.22 21:03, Alex Sierra wrote:
>> From: Alistair Popple <apopple@...dia.com>
>>
>> Currently any attempts to pin a device coherent page will fail. This is
>> because device coherent pages need to be managed by a device driver, and
>> pinning them would prevent a driver from migrating them off the device.
>>
>> However this is no reason to fail pinning of these pages. These are
>> coherent and accessible from the CPU so can be migrated just like
>> pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
>> them first try migrating them out of ZONE_DEVICE.
>>
>> Signed-off-by: Alistair Popple <apopple@...dia.com>
>> Acked-by: Felix Kuehling <Felix.Kuehling@....com>
>> [hch: rebased to the split device memory checks,
>> moved migrate_device_page to migrate_device.c]
>> Signed-off-by: Christoph Hellwig <hch@....de>
>> ---
>> mm/gup.c | 47 +++++++++++++++++++++++++++++++++++-----
>> mm/internal.h | 1 +
>> mm/migrate_device.c | 53 +++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 96 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index b65fe8bf5af4..9b6b9923d22d 100644
>> --- a/mm/gup.c
>> +++ b/mm/gup.c
>> @@ -1891,9 +1891,43 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
>> continue;
>> prev_folio = folio;
>>
>> - if (folio_is_longterm_pinnable(folio))
>> + /*
>> + * Device private pages will get faulted in during gup so it
>> + * shouldn't be possible to see one here.
>> + */
>> + if (WARN_ON_ONCE(folio_is_device_private(folio))) {
>> + ret = -EFAULT;
>> + goto unpin_pages;
>> + }
>
> I'd just drop that. Device private pages are never part of a present PTE. So if we
> could actually get a grab of one via GUP we would be in bigger trouble ...
> already before this patch.
Fair.
>> +
>> + /*
>> + * Device coherent pages are managed by a driver and should not
>> + * be pinned indefinitely as it prevents the driver moving the
>> + * page. So when trying to pin with FOLL_LONGTERM instead try
>> + * to migrate the page out of device memory.
>> + */
>> + if (folio_is_device_coherent(folio)) {
>> + WARN_ON_ONCE(PageCompound(&folio->page));
>
> Maybe that belongs into migrate_device_page()?
Ok (noting Matthew's comment there as well).
>> +
>> + /*
>> + * Migration will fail if the page is pinned, so convert
>
> [...]
>
>> /*
>> * mm/gup.c
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index cf9668376c5a..5decd26dd551 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -794,3 +794,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
>> }
>> }
>> EXPORT_SYMBOL(migrate_vma_finalize);
>> +
>> +/*
>> + * Migrate a device coherent page back to normal memory. The caller should have
>> + * a reference on page which will be copied to the new page if migration is
>> + * successful or dropped on failure.
>> + */
>> +struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
>
> Function name should most probably indicate that we're dealing with coherent pages here?
Ok.
>> +{
>> + unsigned long src_pfn, dst_pfn = 0;
>> + struct migrate_vma args;
>> + struct page *dpage;
>> +
>> + lock_page(page);
>> + src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
>> + args.src = &src_pfn;
>> + args.dst = &dst_pfn;
>> + args.cpages = 1;
>> + args.npages = 1;
>> + args.vma = NULL;
>> + migrate_vma_setup(&args);
>> + if (!(src_pfn & MIGRATE_PFN_MIGRATE))
>> + return NULL;
>
> Wow, these refcount and page locking/unlocking rules with this migrate_* api are
> confusing now. And the usage here of sometimes returning and sometimes falling
> trough don't make it particularly easier to understand here.
>
> I'm not 100% happy about reusing migrate_vma_setup() usage if there *is no VMA*.
> That's just absolutely confusing, because usually migrate_vma_setup() itself
> would do the collection step and ref+lock pages. :/
>
> In general, I can see why/how we're reusing the migrate_vma_* API here, but there
> is absolutely no VMA ... not sure what to improve besides providing a second API
> that does a simple single-page migration. But that can be changed later ...
Yeah, as noted in your other response I think it should be ok to just
call migrate_vma_unmap() directly from migrate_device_page() so I assume
that would adequately deal with this.
>> +
>> + dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
>> +
>
> alloc_page()
>
>> + /*
>> + * get/pin the new page now so we don't have to retry gup after
>> + * migrating. We already have a reference so this should never fail.
>> + */
>> + if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
>> + __free_pages(dpage, 0);
>
> __free_page()
>
>> + dpage = NULL;
>> + }
>
> Hm, this means that we are not pinning via the PTE at hand, but via something
> we expect migration to put into the PTE. I'm not really happy about this.
>
> Ideally, we'd make the pinning decision only on the actual GUP path, not in here.
> Just like in the migrate_pages() case, where we end up dropping all refs/pins
> and looking up again via GUP from the PTE.
>
> For example, I wonder if something nasty could happen if the PTE got mapped
> R/O in the meantime and you're pinning R/W here ...
>
> TBH, all this special casing on gup_flags here is nasty. Please, let's just do
> it like migrate_pages() and do another GUP walk. Absolutely no need to optimize.
The only reason to pass gup_flags is to check FOLL_PIN vs. FOLL_GET so
that we can do the right reference on the destination page. I did the
optimisation because we already have the destination page with a
reference and GUP/PUP does not make any guarantees about the current PTE
state anyway.
However I noticed there might be a race here - during migration we
replace present PTEs with migration entries. On fork these get copied
via copy_nonpresent_pte() and made read-only. However we don't check if
the page a migration entry points to is pinned or not. For an ordinary
PTE copy_present_pte() would copy the page for a COW mapping, but this
won't happen if the page happens to be undergoing migration (even though
the migration will ultimately fail due to the pin).
Anyway I don't think this patch currently makes that any worse, but if
we fix the above it will because there is a brief period during which
the page we're pinning won't look like a pinned page.
So I will go with the suggestion to do another GUP walk.
> [...]
>
>
>
> I'd go with something like the following on top (which does not touch on the
> general semantic issue with migrate_vma_* ). Note that I most probably messed
> up some refcount/lock handling and that it's broken.
> Just to give you an idea what I think could be cleaner.
Thanks! At a glance it looks roughly right but I will check and respin
it to incorporate the comments.
> diff --git a/mm/gup.c b/mm/gup.c
> index 9b6b9923d22d..17041b3e605e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1881,7 +1881,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> unsigned long isolation_error_count = 0, i;
> struct folio *prev_folio = NULL;
> LIST_HEAD(movable_page_list);
> - bool drain_allow = true;
> + bool drain_allow = true, any_device_coherent = false;
> int ret = 0;
>
> for (i = 0; i < nr_pages; i++) {
> @@ -1891,15 +1891,6 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> continue;
> prev_folio = folio;
>
> - /*
> - * Device private pages will get faulted in during gup so it
> - * shouldn't be possible to see one here.
> - */
> - if (WARN_ON_ONCE(folio_is_device_private(folio))) {
> - ret = -EFAULT;
> - goto unpin_pages;
> - }
> -
> /*
> * Device coherent pages are managed by a driver and should not
> * be pinned indefinitely as it prevents the driver moving the
> @@ -1907,7 +1898,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> * to migrate the page out of device memory.
> */
> if (folio_is_device_coherent(folio)) {
> - WARN_ON_ONCE(PageCompound(&folio->page));
> + /*
> + * We always want a new GUP lookup with device coherent
> + * pages.
> + */
> + any_device_coherent = true;
> + pages[i] = 0;
>
> /*
> * Migration will fail if the page is pinned, so convert
> @@ -1918,11 +1914,12 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> unpin_user_page(&folio->page);
> }
>
> - pages[i] = migrate_device_page(&folio->page, gup_flags);
> - if (!pages[i]) {
> - ret = -EBUSY;
> + ret = migrate_device_coherent_page(&folio->page);
> + if (ret)
> goto unpin_pages;
> - }
> + /* The reference to our folio is stale now. */
> + prev_folio = NULL;
> + folio = NULL;
> continue;
> }
>
> @@ -1953,7 +1950,8 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> folio_nr_pages(folio));
> }
>
> - if (!list_empty(&movable_page_list) || isolation_error_count)
> + if (!list_empty(&movable_page_list) || isolation_error_count ||
> + any_device_coherent)
> goto unpin_pages;
>
> /*
> @@ -1963,14 +1961,19 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages,
> return nr_pages;
>
> unpin_pages:
> - for (i = 0; i < nr_pages; i++) {
> - if (!pages[i])
> - continue;
> + /* We have to be careful if we stumbled over device coherent pages. */
> + if (unlikely(any_device_coherent || !(gup_flags & FOLL_PIN))) {
> + for (i = 0; i < nr_pages; i++) {
> + if (!pages[i])
> + continue;
>
> - if (gup_flags & FOLL_PIN)
> - unpin_user_page(pages[i]);
> - else
> - put_page(pages[i]);
> + if (gup_flags & FOLL_PIN)
> + unpin_user_page(pages[i]);
> + else
> + put_page(pages[i]);
> + }
> + } else {
> + unpin_user_pages(pages, nr_pages);
> }
>
> if (!list_empty(&movable_page_list)) {
> diff --git a/mm/internal.h b/mm/internal.h
> index eeab4ee7a4a3..899dab512c5a 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -853,7 +853,7 @@ int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> unsigned long addr, int page_nid, int *flags);
>
> void free_zone_device_page(struct page *page);
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags);
> +int migrate_device_coherent_page(struct page *page);
>
> /*
> * mm/gup.c
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 5decd26dd551..dfb78ea3d326 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -797,53 +797,40 @@ EXPORT_SYMBOL(migrate_vma_finalize);
>
> /*
> * Migrate a device coherent page back to normal memory. The caller should have
> - * a reference on page which will be copied to the new page if migration is
> - * successful or dropped on failure.
> + * a reference on page, which will be dropped on return.
> */
> -struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
> +int migrate_device_coherent_page(struct page *page)
> {
> unsigned long src_pfn, dst_pfn = 0;
> - struct migrate_vma args;
> + struct migrate_vma args = {
> + .src = &src_pfn,
> + .dst = &dst_pfn,
> + .cpages = 1,
> + .npages = 1,
> + .vma = NULL,
> + };
> struct page *dpage;
>
> + VM_WARN_ON_ONCE(PageCompound(page));
> +
> lock_page(page);
> src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
> - args.src = &src_pfn;
> - args.dst = &dst_pfn;
> - args.cpages = 1;
> - args.npages = 1;
> - args.vma = NULL;
> - migrate_vma_setup(&args);
> - if (!(src_pfn & MIGRATE_PFN_MIGRATE))
> - return NULL;
> -
> - dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
> -
> - /*
> - * get/pin the new page now so we don't have to retry gup after
> - * migrating. We already have a reference so this should never fail.
> - */
> - if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
> - __free_pages(dpage, 0);
> - dpage = NULL;
> - }
>
> - if (dpage) {
> - lock_page(dpage);
> - dst_pfn = migrate_pfn(page_to_pfn(dpage));
> + migrate_vma_setup(&args);
> + if (src_pfn & MIGRATE_PFN_MIGRATE) {
> + dpage = alloc_page(GFP_USER | __GFP_NOWARN);
> + if (dpage) {
> + dst_pfn = migrate_pfn(page_to_pfn(dpage));
> + lock_page(dpage);
> + }
> }
>
> migrate_vma_pages(&args);
> if (src_pfn & MIGRATE_PFN_MIGRATE)
> copy_highpage(dpage, page);
> migrate_vma_finalize(&args);
> - if (dpage && !(src_pfn & MIGRATE_PFN_MIGRATE)) {
> - if (gup_flags & FOLL_PIN)
> - unpin_user_page(dpage);
> - else
> - put_page(dpage);
> - dpage = NULL;
> - }
>
> - return dpage;
> + if (src_pfn & MIGRATE_PFN_MIGRATE)
> + return 0;
> + return -EBUSY;
> }
> --
> 2.35.3
Powered by blists - more mailing lists