linux-kernel - Re: [PATCH v2 3/3] x86: Support huge vmalloc mappings

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <1642472965.lgfksp6krp.astroid@bobo.none>
Date:   Tue, 18 Jan 2022 12:46:01 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Andrew Morton <akpm@...ux-foundation.org>,
        Jonathan Corbet <corbet@....net>,
        Dave Hansen <dave.hansen@...el.com>,
        linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linuxppc-dev@...ts.ozlabs.org,
        Kefeng Wang <wangkefeng.wang@...wei.com>, x86@...nel.org
Cc:     Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Borislav Petkov <bp@...en8.de>,
        Catalin Marinas <catalin.marinas@....com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Paul Mackerras <paulus@...ba.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Will Deacon <will@...nel.org>,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v2 3/3] x86: Support huge vmalloc mappings

Excerpts from Dave Hansen's message of December 29, 2021 2:14 am:
> On 12/28/21 2:26 AM, Kefeng Wang wrote:
>>>> There are some disadvantages about this feature[2], one of the main
>>>> concerns is the possible memory fragmentation/waste in some scenarios,
>>>> also archs must ensure that any arch specific vmalloc allocations that
>>>> require PAGE_SIZE mappings(eg, module alloc with STRICT_MODULE_RWX)
>>>> use the VM_NO_HUGE_VMAP flag to inhibit larger mappings.
>>> That just says that x86 *needs* PAGE_SIZE allocations.  But, what
>>> happens if VM_NO_HUGE_VMAP is not passed (like it was in v1)?  Will the
>>> subsequent permission changes just fragment the 2M mapping?
>> 
>> Yes, without VM_NO_HUGE_VMAP, it could fragment the 2M mapping.
>> 
>> When module alloc with STRICT_MODULE_RWX on x86, it calls
>> __change_page_attr()
>> 
>> from set_memory_ro/rw/nx which will split large page, so there is no
>> need to make
>> 
>> module alloc with HUGE_VMALLOC.
> 
> This all sounds very fragile to me.  Every time a new architecture would
> get added for huge vmalloc() support, the developer needs to know to go
> find that architecture's module_alloc() and add this flag.

This is documented in the Kconfig.

 #
 #  Archs that select this would be capable of PMD-sized vmaps (i.e.,
 #  arch_vmap_pmd_supported() returns true), and they must make no assumptions
 #  that vmalloc memory is mapped with PAGE_SIZE ptes. The VM_NO_HUGE_VMAP flag
 #  can be used to prohibit arch-specific allocations from using hugepages to
 #  help with this (e.g., modules may require it).
 #
 config HAVE_ARCH_HUGE_VMALLOC
         depends on HAVE_ARCH_HUGE_VMAP
         bool

Is it really fair to say it's *very* fragile? Surely it's reasonable to 
read the (not very long) documentation ad understand the consequences for
the arch code before enabling it.

> They next
> guy is going to forget, just like you did.

The miss here could just be a simple oversight or thinko, and caught by 
review, as happens to a lot of things.

> 
> Considering that this is not a hot path, a weak function would be a nice
> choice:
> 
> /* vmalloc() flags used for all module allocations. */
> unsigned long __weak arch_module_vm_flags()
> {
> 	/*
> 	 * Modules use a single, large vmalloc().  Different
> 	 * permissions are applied later and will fragment
> 	 * huge mappings.  Avoid using huge pages for modules.
> 	 */
> 	return VM_NO_HUGE_VMAP;
> }
> 
> Stick that in some the common module code, next to:

Then they have to think about it even less, so I don't know if that's an 
improvement. I don't know what else an arch might be doing with these
allocations, at least modules will blow up pretty quickly, who knows 
what other rare code relies on 4k vmalloc mappings?

The huge vmalloc option is not supposed to be easy to enable. This is 
the same problem Andy was having with the TLB shootdown patches, he 
didn't read the documentation and thought it was supposed to be a 
trivial thing anybody could enable without thinking about it, and was
dutifully pointing out the the nasty "bugs" the feature has in it if
x86 were to enable it improperly.

Thanks,
Nick

> 
>> void * __weak module_alloc(unsigned long size)
>> {
>>         return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> ...
> 
> Then, put arch_module_vm_flags() in *all* of the module_alloc()
> implementations, including the generic one.  That way (even with a new
> architecture) whoever copies-and-pastes their module_alloc()
> implementation is likely to get it right.  The next guy who just does a
> "select HAVE_ARCH_HUGE_VMALLOC" will hopefully just work.
> 
> VM_FLUSH_RESET_PERMS could probably be dealt with in the same way.
>