[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2bab28c7-08c0-7ff0-c70e-9bf94da05ce1@redhat.com>
Date: Tue, 30 Mar 2021 17:01:08 +0200
From: David Hildenbrand <david@...hat.com>
To: Jann Horn <jannh@...gle.com>
Cc: kernel list <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Arnd Bergmann <arnd@...db.de>, Michal Hocko <mhocko@...e.com>,
Oscar Salvador <osalvador@...e.de>,
Matthew Wilcox <willy@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Minchan Kim <minchan@...nel.org>,
Jason Gunthorpe <jgg@...pe.ca>,
Dave Hansen <dave.hansen@...el.com>,
Hugh Dickins <hughd@...gle.com>,
Rik van Riel <riel@...riel.com>,
"Michael S . Tsirkin" <mst@...hat.com>,
"Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
Vlastimil Babka <vbabka@...e.cz>,
Richard Henderson <rth@...ddle.net>,
Ivan Kokshaysky <ink@...assic.park.msu.ru>,
Matt Turner <mattst88@...il.com>,
Thomas Bogendoerfer <tsbogend@...ha.franken.de>,
"James E.J. Bottomley" <James.Bottomley@...senpartnership.com>,
Helge Deller <deller@....de>, Chris Zankel <chris@...kel.net>,
Max Filippov <jcmvbkbc@...il.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Peter Xu <peterx@...hat.com>,
Rolf Eike Beer <eike-kernel@...tec.de>,
linux-alpha@...r.kernel.org, linux-mips@...r.kernel.org,
linux-parisc@...r.kernel.org, linux-xtensa@...ux-xtensa.org,
linux-arch <linux-arch@...r.kernel.org>,
Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH v1 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE)
to prefault/prealloc memory
[...]
>>
>> Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE with the
>> following semantics:
>> 1. MADV_POPULATE_READ can be used to preallocate backend memory and
>> prefault page tables just like manually reading each individual page.
>> This will not break any COW mappings -- e.g., it will populate the
>> shared zeropage when applicable.
>
> Please clarify what is meant by "backend memory". As far as I can tell
> from looking at the code, MADV_POPULATE_READ on file mappings will
> allocate zeroed memory in the page cache, and map it as readonly pages
> into userspace, but any attempt to actually write to that memory will
> trigger the filesystem's ->page_mkwrite handler; and e.g. ext4 will
> only try to allocate disk blocks at that point, which may fail. So as
> far as I can tell, for files on filesystems like ext4, the current
> implementation of MADV_POPULATE_READ does not replace fallocate(). Am
> I missing something?
Thanks for pointing that out, I guess I was blinded by tmpfs/hugetlbfs
behavior. There might be cases (!tmpfs, !hugetlbfs) where we indeed need
fallocate()+MADV_POPULATE_READ on file mappings.
The logic is essentially what mlock()/MAP_POPULATE does via
populate_vma_page_range() on shared mappings, so I assumed it would
always properly allocate backend memory.
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
* and we would not want to dirty them for nothing.
*/
My tests with MADV_POPULATE_READ:
1. MAP_SHARED on tmpfs: memory in the file is allocated
2. MAP_PRIVATE on tmpfs: memory in the file is allocated
3. MAP_SHARED on hugetlbfs: memory in the file is allocated
4. MAP_PRIVATE on hugetlbfs: memory in the file is *not* allocated
5. MAP_SHARED on ext4: memory in the file is *not* allocated
6. MAP_PRIVATE on ext4: memory in the file is *not* allocated
1..4 are also the reason why it works with memfd as expected.
For 4 and 6 it's not bad: writing to the private mapping will not result
in backend storage/blocks having to get allocated. So the backend
storage is actually RAM (although we don't allocate backend storage here
but use the shared zero page, but that's a different story).
For 5. we indeed need fallocate() before MADV_POPULATE_READ in case we
could have holes.
Thanks for pointing that out.
>
> If the desired semantics are that disk blocks should be preallocated,
> I think you may have to look up the ->vm_file and then internally call
> vfs_fallocate() to address this, kinda like in madvise_remove()?
Does not sound too complicated, but devil might be in the details. At
least for MAP_SHARED this might be the right thing to do. As discussed
above, for MAP_PRIVATE we usually don't want to do that (and SHMEM is
just weird).
I honestly do wonder if breaking with MAP_POPULATE semantics is
beneficial. For my use cases, doing fallocate() plus MADV_POPULATE_READ
on shared, file-backed mappings would certainly be sufficient. But
having a simple, consistent behavior would be much nicer.
I'll give it a thought!
>> 2. If MADV_POPULATE_READ succeeds, all page tables have been populated
>> (prefaulted) readable once.
>> 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
>> prefault page tables just like manually writing (or
>> reading+writing) each individual page. This will break any COW
>> mappings -- e.g., the shared zeropage is never populated.
>> 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
>> (prefaulted) writable once.
>> 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
>> mappings marked with VM_PFNMAP and VM_IO. Also, proper access
>> permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
>> mapping is encountered, madvise() fails with -EINVAL.
>> 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
>> might have been populated. In that case, madvise() fails with
>> -ENOMEM.
>
> AFAICS that's not true (or misphrased). If MADV_POPULATE_*
> successfully populates a bunch of pages, then fails because of an
> error (e.g. EHWPOISON), it will return EHWPOISON, not ENOMEM, right?
Indeed, leftover from previous version. It's clearer in the man page I
prepared, will fix it up.
>
>> 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
>> when encountering a HW poisoned page in the range.
>> 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
>> cannot protect from the OOM (Out Of Memory) handler killing the
>> process.
>>
>> While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
>> preallocate memory and prefault page tables for VMs), there are valid use
>> cases for MADV_POPULATE_READ:
>> 1. Efficiently populate page tables with zero pages (i.e., shared
>> zeropage). This is necessary when using userfaultfd() WP (Write-Protect
>> to properly catch all modifications within a mapping: for
>> write-protection to be effective for a virtual address, there has to be
>> a page already mapped -- even if it's the shared zeropage.
>
> This sounds like a hack to work around issues that would be better
> addressed by improving userfaultfd?
There are plans to do that, indeed.
>
>> 2. Pre-read a whole mapping from backend storage without marking it
>> dirty, such that eviction won't have to write it back. If no backend
>> memory has been allocated yet, allocate the backend memory. Helpful
>> when preallocating/prefaulting a file stored on disk without having
>> to writeback each and every page on eviction.
>
> This sounds reasonable to me.
Yes, the case with holes / backend memory has to be clarified.
>
>> Although sparse memory mappings are the primary use case, this will
>> also be useful for ordinary preallocations where MAP_POPULATE is not
>> desired especially in QEMU, where users can trigger preallocation of
>> guest RAM after the mapping was created.
>>
>> Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
>> however, the main motivation back than was performance improvements
>> (which should also still be the case, but it is a secondary concern).
>>
>> V. Single-threaded performance comparison
>>
>> There is a performance benefit when using POPULATE_READ / POPULATE_WRITE
>> already when only using a single thread to do prefaulting/preallocation. As
>> we have less pagefaults for huge pages, the performance benefit is
>> negligible with small mappings.
> [...]
>> diff --git a/mm/gup.c b/mm/gup.c
> [...]
>> +long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
>> + unsigned long end, bool write, int *locked)
>> +{
>> + struct mm_struct *mm = vma->vm_mm;
>> + unsigned long nr_pages = (end - start) / PAGE_SIZE;
>> + int gup_flags;
>> +
>> + VM_BUG_ON(!PAGE_ALIGNED(start));
>> + VM_BUG_ON(!PAGE_ALIGNED(end));
>> + VM_BUG_ON_VMA(start < vma->vm_start, vma);
>> + VM_BUG_ON_VMA(end > vma->vm_end, vma);
>> + mmap_assert_locked(mm);
>> +
>> + /*
>> + * FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
>> + * a poisoned page.
>> + * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
>> + * !FOLL_FORCE: Require proper access permissions.
>> + */
>> + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
>> + if (write)
>> + gup_flags |= FOLL_WRITE;
>> +
>> + /*
>> + * See check_vma_flags(): Will return -EFAULT on incompatible mappings
>> + * or with insufficient permissions.
>> + */
>> + return __get_user_pages(mm, start, nr_pages, gup_flags,
>> + NULL, NULL, locked);
>
> You mentioned in the commit message that you don't want to actually
> dirty all the file pages and force writeback; but doesn't
> POPULATE_WRITE still do exactly that? In follow_page_pte(), if
> FOLL_TOUCH and FOLL_WRITE are set, we mark the page as dirty:
Well, I mention that POPULATE_READ explicitly doesn't do that. I
primarily set it because populate_vma_page_range() also sets it.
Is it safe to *not* set it? IOW, fault something writable into a page
table (where the CPU could dirty it without additional page faults)
without marking it accessed? For me, this made logically sense. Thus I
also understood why populate_vma_page_range() set it.
>
> if (flags & FOLL_TOUCH) {
> if ((flags & FOLL_WRITE) &&
> !pte_dirty(pte) && !PageDirty(page))
> set_page_dirty(page);
> /*
> * pte_mkyoung() would be more correct here, but atomic care
> * is needed to avoid losing the dirty bit: it is easier to use
> * mark_page_accessed().
> */
> mark_page_accessed(page);
> }
>
>
>> +}
>> +
>> /*
>> * __mm_populate - populate and/or mlock pages within a range of address space.
>> *
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 3f22c4ceb7b5..ee398696380f 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -335,6 +335,9 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
>> #ifdef CONFIG_MMU
>> extern long populate_vma_page_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end, int *locked);
>> +extern long faultin_vma_page_range(struct vm_area_struct *vma,
>> + unsigned long start, unsigned long end,
>> + bool write, int *locked);
>> extern void munlock_vma_pages_range(struct vm_area_struct *vma,
>> unsigned long start, unsigned long end);
>> static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 01fef79ac761..857460873f7a 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -53,6 +53,8 @@ static int madvise_need_mmap_write(int behavior)
>> case MADV_COLD:
>> case MADV_PAGEOUT:
>> case MADV_FREE:
>> + case MADV_POPULATE_READ:
>> + case MADV_POPULATE_WRITE:
>> return 0;
>> default:
>> /* be safe, default to 1. list exceptions explicitly */
>> @@ -822,6 +824,64 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
>> return -EINVAL;
>> }
>>
>> +static long madvise_populate(struct vm_area_struct *vma,
>> + struct vm_area_struct **prev,
>> + unsigned long start, unsigned long end,
>> + int behavior)
>> +{
>> + const bool write = behavior == MADV_POPULATE_WRITE;
>> + struct mm_struct *mm = vma->vm_mm;
>> + unsigned long tmp_end;
>> + int locked = 1;
>> + long pages;
>> +
>> + *prev = vma;
>> +
>> + while (start < end) {
>> + /*
>> + * We might have temporarily dropped the lock. For example,
>> + * our VMA might have been split.
>> + */
>> + if (!vma || start >= vma->vm_end) {
>> + vma = find_vma(mm, start);
>> + if (!vma || start < vma->vm_start)
>> + return -ENOMEM;
>> + }
>> +
>> + tmp_end = min_t(unsigned long, end, vma->vm_end);
>> + /* Populate (prefault) page tables readable/writable. */
>> + pages = faultin_vma_page_range(vma, start, tmp_end, write,
>> + &locked);
>> + if (!locked) {
>> + mmap_read_lock(mm);
>> + locked = 1;
>> + *prev = NULL;
>> + vma = NULL;
>> + }
>> + if (pages < 0) {
>> + switch (pages) {
>> + case -EINTR:
>> + return -EINTR;
>> + case -EFAULT: /* Incompatible mappings / permissions. */
>> + return -EINVAL;
>> + case -EHWPOISON:
>> + return -EHWPOISON;
>> + case -EBUSY:
>
> What is -EBUSY doing here? __get_user_pages() fixes up -EBUSY from
> faultin_page() to 0, right?
>
>> + case -EAGAIN:
>
> Where can -EAGAIN come from?
On both points: the lack of documentation on return values made me add
these. The faultin_page() path is indeed fine. If the other paths don't
yield any such return values, we can drop both.
Thanks for the review!
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists