linux-kernel - Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <X8AAMyvmLUJJSYsf@redhat.com>
Date:   Thu, 26 Nov 2020 14:21:23 -0500
From:   Andrea Arcangeli <aarcange@...hat.com>
To:     Mike Rapoport <rppt@...ux.ibm.com>
Cc:     David Hildenbrand <david@...hat.com>,
        Vlastimil Babka <vbabka@...e.cz>, Mel Gorman <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        Qian Cai <cai@....pw>, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, Baoquan He <bhe@...hat.com>
Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set
 pageblock_skip on reserved pages

On Thu, Nov 26, 2020 at 11:36:02AM +0200, Mike Rapoport wrote:
> I think it's inveneted by your BIOS vendor :)

BTW, all systems I use on a daily basis have that type 20... Only two
of them are reproducing the VM_BUG_ON on a weekly basis on v5.9.

If you search 'E820 "type 20"' you'll get plenty of hits, so it's not
just me at least :).. In fact my guess is there are probably more
workstation/laptops with that type 20 than without. Maybe it only
showup if booting with EFI?

Easy to check with `dmesg | grep "type 20"` after boot.

One guess why this wasn't frequently reproduced is some desktop distro
is doing the mistake of keeping THP enabled = madvise by default and
they're reducing the overall compaction testing? Or maybe they're not
all setting DEBUG_VM=y (but some other distro I'm sure ships v5.9 with
DEBUG_VM=y). Often I hit this bug in kcompactd0 for example, that
wouldn't happen with THP enabled=madvise.

The two bpf tracing tools below can proof how the current
defrag=madvise default only increase the allocation latency from a few
usec to a dozen usec. Only if setting defrag=always the latency goes
up to single digit milliseconds, because of the cost of direct
compaction which is only worth paying for, for MADV_HUGEPAGE ranges
doing long-lived allocations (we know by now that defrag=always was
a suboptimal default).

https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp-comm.bp
https://www.kernel.org/pub/linux/kernel/people/andrea/ebpf/thp.bp

Since 3917c80280c93a7123f1a3a6dcdb10a3ea19737d even app like Redis
using fork for snapshotting purposes should prefer THP
enabled. (besides it would be better if it used uffd-wp as alternative
to fork)

3917c80280c93a7123f1a3a6dcdb10a3ea19737d also resolved another concern
because the decade old "fork() vs gup/O_DIRECT vs thread" race was
supposed to be unnoticeable from userland if the O_DIRECT min I/O
granularity was enforced to be >=PAGE_SIZE. However with THP backed
anon memory, that minimum granularity requirement increase to
HPAGE_PMD_SIZE. Recent kernels are going in the direction of solving
that race by doing cow during fork as it was originally proposed long
time ago
(https://lkml.kernel.org/r/20090311165833.GI27823@random.random) which
I suppose will solve the race with sub-PAGE_SIZE granularity too, but
3917c80280c93a7123f1a3a6dcdb10a3ea19737d alone is enough to reduce the
minumum I/O granularity requirement from HPAGE_PMD_SIZE to PAGE_SIZE
as some userland may have expected. The best of course is to fully
prevent that race condition by setting MADV_DONTFORK on the regions
under O_DIRECT (as qemu does for example).

Overall the only tangible concern left is potential higher memory
usage for servers handling tiny object storage freed at PAGE_SIZE
granularity with MADV_DONTNEED (instead of having a way to copy and
defrag the virtual space of small objects through a callback that
updates the pointer to the object...).

Small object storage relying on jemalloc/tcmalloc for tiny object
management simply need to selectively disable THP to avoid wasting
memory either with MADV_NOHUGEPAGE or with the prctl
PR_SET_THP_DISABLE. Flipping a switch in the OCI schema allows to
disable THP too for those object storage apps making heavy use of
MADV_DONTNEED, not even a single line of code need changing in the app
for it if deployed through the OCI container runtime.

Recent jemalloc uses MADV_NOHUGEPAGE. I didn't check exactly how it's
being used but I've an hope it already does the right thing and
separates small object arena zapped with MADV_DONTNEED at PAGE_SIZE
granularity, with large object arena where THP shall remain
enabled. glibc also should learn to separate small objects and big
objects in different arenas. This has to be handled by the app, like
it is handled optimally already in scylladb that in fact invokes
MADV_HUGEPAGE, or plenty of other databases are using not just THP but
also hugetlbfs which certainly won't fly if MADV_DONTNEED is attempted
at PAGE_SIZE granularity.. or elastic search that also gets a
significant boost from THP etc..

Thanks,
Andrea