lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Fri, 24 Jun 2022 11:13:38 -0500 From: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@....com> To: David Hildenbrand <david@...hat.com>, Alistair Popple <apopple@...dia.com>, akpm@...ux-foundation.org Cc: Felix Kuehling <felix.kuehling@....com>, jgg@...dia.com, linux-mm@...ck.org, rcampbell@...dia.com, linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org, amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org, hch@....de, jglisse@...hat.com, willy@...radead.org Subject: Re: [PATCH v5 01/13] mm: add zone device coherent type memory support On 6/23/2022 1:21 PM, David Hildenbrand wrote: > On 23.06.22 20:20, Sierra Guiza, Alejandro (Alex) wrote: >> On 6/23/2022 2:57 AM, David Hildenbrand wrote: >>> On 23.06.22 01:16, Sierra Guiza, Alejandro (Alex) wrote: >>>> On 6/21/2022 11:16 AM, David Hildenbrand wrote: >>>>> On 21.06.22 18:08, Sierra Guiza, Alejandro (Alex) wrote: >>>>>> On 6/21/2022 7:25 AM, David Hildenbrand wrote: >>>>>>> On 21.06.22 13:55, Alistair Popple wrote: >>>>>>>> David Hildenbrand<david@...hat.com> writes: >>>>>>>> >>>>>>>>> On 21.06.22 13:25, Felix Kuehling wrote: >>>>>>>>>> Am 6/17/22 um 23:19 schrieb David Hildenbrand: >>>>>>>>>>> On 17.06.22 21:27, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>>>>> On 6/17/2022 12:33 PM, David Hildenbrand wrote: >>>>>>>>>>>>> On 17.06.22 19:20, Sierra Guiza, Alejandro (Alex) wrote: >>>>>>>>>>>>>> On 6/17/2022 4:40 AM, David Hildenbrand wrote: >>>>>>>>>>>>>>> On 31.05.22 22:00, Alex Sierra wrote: >>>>>>>>>>>>>>>> Device memory that is cache coherent from device and CPU point of view. >>>>>>>>>>>>>>>> This is used on platforms that have an advanced system bus (like CAPI >>>>>>>>>>>>>>>> or CXL). Any page of a process can be migrated to such memory. However, >>>>>>>>>>>>>>>> no one should be allowed to pin such memory so that it can always be >>>>>>>>>>>>>>>> evicted. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Signed-off-by: Alex Sierra<alex.sierra@....com> >>>>>>>>>>>>>>>> Acked-by: Felix Kuehling<Felix.Kuehling@....com> >>>>>>>>>>>>>>>> Reviewed-by: Alistair Popple<apopple@...dia.com> >>>>>>>>>>>>>>>> [hch: rebased ontop of the refcount changes, >>>>>>>>>>>>>>>> removed is_dev_private_or_coherent_page] >>>>>>>>>>>>>>>> Signed-off-by: Christoph Hellwig<hch@....de> >>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>> include/linux/memremap.h | 19 +++++++++++++++++++ >>>>>>>>>>>>>>>> mm/memcontrol.c | 7 ++++--- >>>>>>>>>>>>>>>> mm/memory-failure.c | 8 ++++++-- >>>>>>>>>>>>>>>> mm/memremap.c | 10 ++++++++++ >>>>>>>>>>>>>>>> mm/migrate_device.c | 16 +++++++--------- >>>>>>>>>>>>>>>> mm/rmap.c | 5 +++-- >>>>>>>>>>>>>>>> 6 files changed, 49 insertions(+), 16 deletions(-) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >>>>>>>>>>>>>>>> index 8af304f6b504..9f752ebed613 100644 >>>>>>>>>>>>>>>> --- a/include/linux/memremap.h >>>>>>>>>>>>>>>> +++ b/include/linux/memremap.h >>>>>>>>>>>>>>>> @@ -41,6 +41,13 @@ struct vmem_altmap { >>>>>>>>>>>>>>>> * A more complete discussion of unaddressable memory may be found in >>>>>>>>>>>>>>>> * include/linux/hmm.h and Documentation/vm/hmm.rst. >>>>>>>>>>>>>>>> * >>>>>>>>>>>>>>>> + * MEMORY_DEVICE_COHERENT: >>>>>>>>>>>>>>>> + * Device memory that is cache coherent from device and CPU point of view. This >>>>>>>>>>>>>>>> + * is used on platforms that have an advanced system bus (like CAPI or CXL). A >>>>>>>>>>>>>>>> + * driver can hotplug the device memory using ZONE_DEVICE and with that memory >>>>>>>>>>>>>>>> + * type. Any page of a process can be migrated to such memory. However no one >>>>>>>>>>>>>>> Any page might not be right, I'm pretty sure. ... just thinking about special pages >>>>>>>>>>>>>>> like vdso, shared zeropage, ... pinned pages ... >>>>>>>>>>>>> Well, you cannot migrate long term pages, that's what I meant :) >>>>>>>>>>>>> >>>>>>>>>>>>>>>> + * should be allowed to pin such memory so that it can always be evicted. >>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>> * MEMORY_DEVICE_FS_DAX: >>>>>>>>>>>>>>>> * Host memory that has similar access semantics as System RAM i.e. DMA >>>>>>>>>>>>>>>> * coherent and supports page pinning. In support of coordinating page >>>>>>>>>>>>>>>> @@ -61,6 +68,7 @@ struct vmem_altmap { >>>>>>>>>>>>>>>> enum memory_type { >>>>>>>>>>>>>>>> /* 0 is reserved to catch uninitialized type fields */ >>>>>>>>>>>>>>>> MEMORY_DEVICE_PRIVATE = 1, >>>>>>>>>>>>>>>> + MEMORY_DEVICE_COHERENT, >>>>>>>>>>>>>>>> MEMORY_DEVICE_FS_DAX, >>>>>>>>>>>>>>>> MEMORY_DEVICE_GENERIC, >>>>>>>>>>>>>>>> MEMORY_DEVICE_PCI_P2PDMA, >>>>>>>>>>>>>>>> @@ -143,6 +151,17 @@ static inline bool folio_is_device_private(const struct folio *folio) >>>>>>>>>>>>>>> In general, this LGTM, and it should be correct with PageAnonExclusive I think. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> However, where exactly is pinning forbidden? >>>>>>>>>>>>>> Long-term pinning is forbidden since it would interfere with the device >>>>>>>>>>>>>> memory manager owning the >>>>>>>>>>>>>> device-coherent pages (e.g. evictions in TTM). However, normal pinning >>>>>>>>>>>>>> is allowed on this device type. >>>>>>>>>>>>> I don't see updates to folio_is_pinnable() in this patch. >>>>>>>>>>>> Device coherent type pages should return true here, as they are pinnable >>>>>>>>>>>> pages. >>>>>>>>>>> That function is only called for long-term pinnings in try_grab_folio(). >>>>>>>>>>> >>>>>>>>>>>>> So wouldn't try_grab_folio() simply pin these pages? What am I missing? >>>>>>>>>>>> As far as I understand this return NULL for long term pin pages. >>>>>>>>>>>> Otherwise they get refcount incremented. >>>>>>>>>>> I don't follow. >>>>>>>>>>> >>>>>>>>>>> You're saying >>>>>>>>>>> >>>>>>>>>>> a) folio_is_pinnable() returns true for device coherent pages >>>>>>>>>>> >>>>>>>>>>> and that >>>>>>>>>>> >>>>>>>>>>> b) device coherent pages don't get long-term pinned >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yet, the code says >>>>>>>>>>> >>>>>>>>>>> struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags) >>>>>>>>>>> { >>>>>>>>>>> if (flags & FOLL_GET) >>>>>>>>>>> return try_get_folio(page, refs); >>>>>>>>>>> else if (flags & FOLL_PIN) { >>>>>>>>>>> struct folio *folio; >>>>>>>>>>> >>>>>>>>>>> /* >>>>>>>>>>> * Can't do FOLL_LONGTERM + FOLL_PIN gup fast path if not in a >>>>>>>>>>> * right zone, so fail and let the caller fall back to the slow >>>>>>>>>>> * path. >>>>>>>>>>> */ >>>>>>>>>>> if (unlikely((flags & FOLL_LONGTERM) && >>>>>>>>>>> !is_pinnable_page(page))) >>>>>>>>>>> return NULL; >>>>>>>>>>> ... >>>>>>>>>>> return folio; >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> What prevents these pages from getting long-term pinned as stated in this patch? >>>>>>>>>> Long-term pinning is handled by __gup_longterm_locked, which migrates >>>>>>>>>> pages returned by __get_user_pages_locked that cannot be long-term >>>>>>>>>> pinned. try_grab_folio is OK to grab the pages. Anything that can't be >>>>>>>>>> long-term pinned will be migrated afterwards, and >>>>>>>>>> __get_user_pages_locked will be retried. The migration of >>>>>>>>>> DEVICE_COHERENT pages was implemented by Alistair in patch 5/13 >>>>>>>>>> ("mm/gup: migrate device coherent pages when pinning instead of failing"). >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> __gup_longterm_locked()->check_and_migrate_movable_pages() >>>>>>>>> >>>>>>>>> Which checks folio_is_pinnable() and doesn't do anything if set. >>>>>>>>> >>>>>>>>> Sorry to be dense here, but I don't see how what's stated in this patch >>>>>>>>> works without adjusting folio_is_pinnable(). >>>>>>>> Ugh, I think you might be right about try_grab_folio(). >>>>>>>> >>>>>>>> We didn't update folio_is_pinnable() to include device coherent pages >>>>>>>> because device coherent pages are pinnable. It is really just >>>>>>>> FOLL_LONGTERM that we want to prevent here. >>>>>>>> >>>>>>>> For normal PUP that is done by my change in >>>>>>>> check_and_migrate_movable_pages() which migrates pages being pinned with >>>>>>>> FOLL_LONGTERM. But I think I incorrectly assumed we would take the >>>>>>>> pte_devmap() path in gup_pte_range(), which we don't for coherent pages. >>>>>>>> So I think the check in try_grab_folio() needs to be: >>>>>>> I think I said it already (and I might be wrong without reading the >>>>>>> code), but folio_is_pinnable() is *only* called for long-term pinnings. >>>>>>> >>>>>>> It should actually be called folio_is_longterm_pinnable(). >>>>>>> >>>>>>> That's where that check should go, no? >>>>>> David, I think you're right. We didn't catch this since the LONGTERM gup >>>>>> test we added to hmm-test only calls to pin_user_pages. Apparently >>>>>> try_grab_folio is called only from fast callers (ex. >>>>>> pin_user_pages_fast/get_user_pages_fast). I have added a conditional >>>>>> similar to what Alistair has proposed to return null on LONGTERM && >>>>>> (coherent_pages || folio_is_pinnable) at try_grab_folio. Also a new gup >>>>>> test was added with LONGTERM set that calls pin_user_pages_fast. >>>>>> Returning null under this condition it does causes the migration from >>>>>> dev to system memory. >>>>>> >>>>> Why can't coherent memory simply put its checks into >>>>> folio_is_pinnable()? I don't get it why we have to do things differently >>>>> here. >>>>> >>>>>> Actually, Im having different problems with a call to PageAnonExclusive >>>>>> from try_to_migrate_one during page fault from a HMM test that first >>>>>> migrate pages to device private and forks to mark as COW these pages. >>>>>> Apparently is catching the first BUG VM_BUG_ON_PGFLAGS(!PageAnon(page), >>>>>> page) >>>>> With or without this series? A backtrace would be great. >>>> Here's the back trace. This happens in a hmm-test added in this patch >>>> series. However, I have tried to isolate this BUG by just adding the COW >>>> test with private device memory only. This is only present as follows. >>>> Allocate anonymous mem->Migrate to private device memory->fork->try to >>>> access to parent's anonymous memory (which will suppose to trigger a >>>> page fault and migration to system mem). Just for the record, if the >>>> child is terminated before the parent's memory is accessed, this problem >>>> is not present. >>> The only usage of PageAnonExclusive() in try_to_migrate_one() is: >>> >>> anon_exclusive = folio_test_anon(folio) && >>> PageAnonExclusive(subpage); >>> >>> Which can only possibly fail if subpage is not actually part of the folio. >>> >>> >>> I see some controversial code in the the if (folio_is_zone_device(folio)) case later: >>> >>> * The assignment to subpage above was computed from a >>> * swap PTE which results in an invalid pointer. >>> * Since only PAGE_SIZE pages can currently be >>> * migrated, just set it to page. This will need to be >>> * changed when hugepage migrations to device private >>> * memory are supported. >>> */ >>> subpage = &folio->page; >>> >>> There we have our invalid pointer hint. >>> >>> I don't see how it could have worked if the child quit, though? Maybe >>> just pure luck? >>> >>> >>> Does the following fix your issue: >> Yes, it fixed the issue. Thanks. Should we include this patch in this >> patch series or separated? >> >> Regards, >> Alex Sierra > I'll send it right away "officially" so we can get it into 5.19. Can I > add your tested-by? Of course. Alex Sierra > >
Powered by blists - more mailing lists