[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b476e461-aba9-e92b-d392-270029ab6b18@redhat.com>
Date: Thu, 27 Jan 2022 12:57:46 +0100
From: David Hildenbrand <david@...hat.com>
To: Mike Kravetz <mike.kravetz@...cle.com>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Michal Hocko <mhocko@...e.com>,
Naoya Horiguchi <naoya.horiguchi@...ux.dev>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Peter Xu <peterx@...hat.com>,
Andrea Arcangeli <aarcange@...hat.com>,
Mina Almasry <almasrymina@...gle.com>,
Shuah Khan <shuah@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
On 13.01.22 19:03, Mike Kravetz wrote:
> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
> testing. However, mremap support was recently added in commit
> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
> vma"). While attempting to enable mremap support in the test, it was
> discovered that the mremap test indirectly depends on MADV_DONTNEED.
>
> hugetlb does not support MADV_DONTNEED. However, the only thing
> preventing support is a check in can_madv_lru_vma(). Simply removing
> the check will enable support.
>
> This is sent as a RFC because there is no existing use case calling
> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
> However, adding support makes sense as it is fairly trivial and brings
> hugetlb functionality more in line with 'normal' memory.
>
Just a note:
QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
but instead always goes either via hugetlbfs or via memfd.
For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
to get the job done (IOW: also discards private anon pages). See the
comments in the QEMU code below. I remember that that is somewhat
inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
to make sure
a) All file pages are removed
b) All private anon pages are removed
IIRC hugetlbfs really is different in that regard, but maybe other fs
behave similarly.
That's why QEMU was able to live for now without MADV_DONTNEED support
for hugetlbfs and most probably won't ever need it.
...
/* The logic here is messy;
* madvise DONTNEED fails for hugepages
* fallocate works on hugepages and shmem
* shared anonymous memory requires madvise REMOVE
*/
need_madvise = (rb->page_size == qemu_host_page_size);
need_fallocate = rb->fd != -1;
if (need_fallocate) {
/* For a file, this causes the area of the file to be zero'd
* if read, and for hugetlbfs also causes it to be unmapped
* so a userfault will trigger.
*/
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
start, length);
if (ret) {
ret = -errno;
error_report("ram_block_discard_range: Failed to fallocate "
"%s:%" PRIx64 " +%zx (%d)",
rb->idstr, start, length, ret);
goto err;
}
#else
ret = -ENOSYS;
error_report("ram_block_discard_range: fallocate not available/file"
"%s:%" PRIx64 " +%zx (%d)",
rb->idstr, start, length, ret);
goto err;
#endif
}
if (need_madvise) {
/* For normal RAM this causes it to be unmapped,
* for shared memory it causes the local mapping to disappear
* and to fall back on the file contents (which we just
* fallocate'd away).
*/
#if defined(CONFIG_MADVISE)
if (qemu_ram_is_shared(rb) && rb->fd < 0) {
ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE);
} else {
ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
}
if (ret) {
ret = -errno;
error_report("ram_block_discard_range: Failed to discard range "
"%s:%" PRIx64 " +%zx (%d)",
rb->idstr, start, length, ret);
goto err;
}
#else
...
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists