linux-kernel - Re: [HMM 03/15] mm/unaddressable-memory: new type of ZONE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPcyv4jr=CNuaGQt80SwR5dpiXy_pDr8aD-w0EtLNE4oGC8WcQ@mail.gmail.com>
Date:   Sun, 23 Apr 2017 06:13:27 -0700
From:   Dan Williams <dan.j.williams@...el.com>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linux MM <linux-mm@...ck.org>,
        John Hubbard <jhubbard@...dia.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        David Nellans <dnellans@...dia.com>,
        Ross Zwisler <ross.zwisler@...ux.intel.com>
Subject: Re: [HMM 03/15] mm/unaddressable-memory: new type of ZONE_DEVICE for
 unaddressable memory

On Sat, Apr 22, 2017 at 11:11 AM, Jerome Glisse <jglisse@...hat.com> wrote:
> On Fri, Apr 21, 2017 at 10:30:01PM -0700, Dan Williams wrote:
>> On Fri, Apr 21, 2017 at 8:30 PM, Jérôme Glisse <jglisse@...hat.com> wrote:
>
> [...]
>
>> > +/*
>> > + * Specialize ZONE_DEVICE memory into multiple types each having differents
>> > + * usage.
>> > + *
>> > + * MEMORY_DEVICE_PERSISTENT:
>> > + * Persistent device memory (pmem): struct page might be allocated in different
>> > + * memory and architecture might want to perform special actions. It is similar
>> > + * to regular memory, in that the CPU can access it transparently. However,
>> > + * it is likely to have different bandwidth and latency than regular memory.
>> > + * See Documentation/nvdimm/nvdimm.txt for more information.
>> > + *
>> > + * MEMORY_DEVICE_UNADDRESSABLE:
>> > + * Device memory that is not directly addressable by the CPU: CPU can neither
>> > + * read nor write _UNADDRESSABLE memory. In this case, we do still have struct
>> > + * pages backing the device memory. Doing so simplifies the implementation, but
>> > + * it is important to remember that there are certain points at which the struct
>> > + * page must be treated as an opaque object, rather than a "normal" struct page.
>> > + * A more complete discussion of unaddressable memory may be found in
>> > + * include/linux/hmm.h and Documentation/vm/hmm.txt.
>> > + */
>> > +enum memory_type {
>> > +       MEMORY_DEVICE_PERSISTENT = 0,
>> > +       MEMORY_DEVICE_UNADDRESSABLE,
>> > +};
>>
>> Ok, this is a bikeshed, but I think it is important. I think these
>> should be called MEMORY_DEVICE_PUBLIC and MEMORY_DEVICE_PRIVATE. The
>> reason is that persistence has nothing to do with the code paths that
>> deal with the pmem use case of ZONE_DEVICE. The only property the mm
>> cares about is that the address range behaves the same as host memory
>> for dma and cpu accesses. The "unaddressable" designation always
>> confuses me because a memory range isn't memory if it's
>> "unaddressable". It is addressable, it's just "private" to the device.
>
> I can change the name but the memory is truely unaddressable, the CPU
> can not access it whatsoever (well it can access a small window but
> even that is not guaranteed).
>

Understood, but that's still "addressable only by certain agents or
through a proxy" which seems closer to "private" to me.

>
>> > +/*
>> > + * For MEMORY_DEVICE_UNADDRESSABLE we use ZONE_DEVICE and extend it with two
>> > + * callbacks:
>> > + *   page_fault()
>> > + *   page_free()
>> > + *
>> > + * Additional notes about MEMORY_DEVICE_UNADDRESSABLE may be found in
>> > + * include/linux/hmm.h and Documentation/vm/hmm.txt. There is also a brief
>> > + * explanation in include/linux/memory_hotplug.h.
>> > + *
>> > + * The page_fault() callback must migrate page back, from device memory to
>> > + * system memory, so that the CPU can access it. This might fail for various
>> > + * reasons (device issues,  device have been unplugged, ...). When such error
>> > + * conditions happen, the page_fault() callback must return VM_FAULT_SIGBUS and
>> > + * set the CPU page table entry to "poisoned".
>> > + *
>> > + * Note that because memory cgroup charges are transferred to the device memory,
>> > + * this should never fail due to memory restrictions. However, allocation
>> > + * of a regular system page might still fail because we are out of memory. If
>> > + * that happens, the page_fault() callback must return VM_FAULT_OOM.
>> > + *
>> > + * The page_fault() callback can also try to migrate back multiple pages in one
>> > + * chunk, as an optimization. It must, however, prioritize the faulting address
>> > + * over all the others.
>> > + *
>> > + *
>> > + * The page_free() callback is called once the page refcount reaches 1
>> > + * (ZONE_DEVICE pages never reach 0 refcount unless there is a refcount bug.
>> > + * This allows the device driver to implement its own memory management.)
>> > + */
>> > +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
>> > +                               unsigned long addr,
>> > +                               struct page *page,
>> > +                               unsigned int flags,
>> > +                               pmd_t *pmdp);
>> > +typedef void (*dev_page_free_t)(struct page *page, void *data);
>> > +
>> >  /**
>> >   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
>> > + * @page_fault: callback when CPU fault on an unaddressable device page
>> > + * @page_free: free page callback when page refcount reaches 1
>> >   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>> >   * @res: physical address range covered by @ref
>> >   * @ref: reference count that pins the devm_memremap_pages() mapping
>> >   * @dev: host device of the mapping for debug
>> > + * @data: private data pointer for page_free()
>> > + * @type: memory type: see MEMORY_* in memory_hotplug.h
>> >   */
>> >  struct dev_pagemap {
>> > +       dev_page_fault_t page_fault;
>> > +       dev_page_free_t page_free;
>> >         struct vmem_altmap *altmap;
>> >         const struct resource *res;
>> >         struct percpu_ref *ref;
>> >         struct device *dev;
>> > +       void *data;
>> > +       enum memory_type type;
>> >  };
>>
>> I think the only attribute that belongs here is @type, the rest are
>> specific to the device_private case.
>>
>> struct dev_private_pagemap {
>>     dev_page_fault_t page_fault;
>>     dev_page_free_t page_free;
>>     void *data;
>>     struct dev_pagemap pgmap;
>> };
>>
>> As Logan pointed out we can kill the internal struct page_map in
>> kernel/memremap.c and let devm_memremap_pages() take a "struct
>> dev_pagemap *" pointer. That way when future dev_pagemap users come
>> along they can wrap the core structure with whatever specific data
>> they need for their use case.
>
> I don't want to make that change as part of HMM, this can be done
> as a patchset on top. Also it is easier to add callback to dev_pagemap
> especially as we want p2p and HMM side by side not as something
> exclusive.

Fair enough, I'm ok to do the conversion as a post HMM merge cleanup.
Does the p2p case actually need ->fault() and ->free()? I'd rather we
start with dev_private_pagemap and if p2p needs a subset of the same
fields we can think about a rename or shuffling things around.

>
> [...]
>
>> > +#if IS_ENABLED(CONFIG_DEVICE_UNADDRESSABLE)
>> > +static inline swp_entry_t make_device_entry(struct page *page, bool write)
>> > +{
>> > +       return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
>> > +                        page_to_pfn(page));
>> > +}
>> > +
>> > +static inline bool is_device_entry(swp_entry_t entry)
>> > +{
>> > +       int type = swp_type(entry);
>> > +       return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
>> > +}
>> > +
>> > +static inline void make_device_entry_read(swp_entry_t *entry)
>> > +{
>> > +       *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
>> > +}
>> > +
>> > +static inline bool is_write_device_entry(swp_entry_t entry)
>> > +{
>> > +       return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
>> > +}
>> > +
>> > +static inline struct page *device_entry_to_page(swp_entry_t entry)
>> > +{
>> > +       return pfn_to_page(swp_offset(entry));
>> > +}
>> > +
>> > +int device_entry_fault(struct vm_area_struct *vma,
>> > +                      unsigned long addr,
>> > +                      swp_entry_t entry,
>> > +                      unsigned int flags,
>> > +                      pmd_t *pmdp);
>>
>>
>> The use of the "device" term is ambiguous. Since these changes are
>> specific to the MEMORY_DEVICE_PRIVATE can we reflect "private" in the
>> name? Change "device" to "devpriv" or "device_private"?
>
> Yes we can.
>
> [...]
>
>> > @@ -194,12 +196,47 @@ void put_zone_device_page(struct page *page)
>> >          * ZONE_DEVICE page refcount should never reach 0 and never be freed
>> >          * to kernel memory allocator.
>> >          */
>> > -       page_ref_dec(page);
>> > +       int count = page_ref_dec_return(page);
>> > +
>> > +       /*
>> > +        * If refcount is 1 then page is freed and refcount is stable as nobody
>> > +        * holds a reference on the page.
>> > +        */
>> > +       if (page->pgmap->page_free && count == 1)
>> > +               page->pgmap->page_free(page, page->pgmap->data);
>>
>>
>> Reading this it isn't clear whether we are dealing with the public or
>> private case. I think this should be:
>>
>> if (count == 1)
>>         zone_device_page_free(page);
>>
>> ...where zone_device_page_free() is a helper that is explicit about
>> the expectations of which ZONE_DEVICE types need to take action on a
>> 'free' event.
>>
>> static void zone_device_page_free(struct page *page)
>> {
>>         struct dev_private_pagemap *priv_pgmap;
>>
>>         if (page->pgmap->type == MEMORY_DEVICE_PUBLIC)
>>                 return;
>>
>>         priv_pgmap = container_of(page->pgmap, typeof(*priv_pgmap), pgmap);
>>         priv_pgmap->page_free(page);
>> }
>
> Ok.
>
>> > @@ -332,6 +369,10 @@ void *devm_memremap_pages(struct device *dev, struct resource *res,
>> >         }
>> >         pgmap->ref = ref;
>> >         pgmap->res = &page_map->res;
>> > +       pgmap->type = MEMORY_DEVICE_PERSISTENT;
>> > +       pgmap->page_fault = NULL;
>> > +       pgmap->page_free = NULL;
>> > +       pgmap->data = NULL;
>>
>>
>> This goes away if we convert to passing in the dev_pagemap, and I
>> think that is a useful cleanup.
>
> Yes but i rather do it as a patchset on top as i am sure p2p folks
> will want to give input and i don't want to delay until we can agree
> on exact fields we want/need.

Ok.

>
> [...]
>
>> > diff --git a/mm/mprotect.c b/mm/mprotect.c
>> > index 8edd0d5..dadb020 100644
>> > --- a/mm/mprotect.c
>> > +++ b/mm/mprotect.c
>> > @@ -126,6 +126,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>> >
>> >                                 pages++;
>> >                         }
>> > +
>> > +                       if (is_write_device_entry(entry)) {
>> > +                               pte_t newpte;
>> > +
>> > +                               /*
>> > +                                * We do not preserve soft-dirtiness. See
>> > +                                * copy_one_pte() for explanation.
>> > +                                */
>> > +                               make_device_entry_read(&entry);
>> > +                               newpte = swp_entry_to_pte(entry);
>> > +                               set_pte_at(mm, addr, pte, newpte);
>> > +
>> > +                               pages++;
>> > +                       }
>> >                 }
>> >         } while (pte++, addr += PAGE_SIZE, addr != end);
>> >         arch_leave_lazy_mmu_mode();
>>
>> I think this is a good example of how the usage of the generic
>> "device" term makes the new "private" code paths harder to spot /
>> discern from the current "public" requirements.
>
> I will change to devpriv
>
> Thank for reviewing.

Thanks for working through this late breaking feedback.