netdev - Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <5bd16e2c06a2df357400556c6ae01bb5d3c5c32a.camel@intel.com>
Date:   Sat, 26 Mar 2022 00:06:45 +0000
From:   "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
To:     "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "song@...nel.org" <song@...nel.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>
CC:     "songliubraving@...com" <songliubraving@...com>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "ast@...nel.org" <ast@...nel.org>,
        "kernel-team@...com" <kernel-team@...com>,
        "andrii@...nel.org" <andrii@...nel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "iii@...ux.ibm.com" <iii@...ux.ibm.com>
Subject: Re: [PATCH v9 bpf-next 1/9] x86/Kconfig: select
 HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP

On Fri, 2022-02-04 at 10:57 -0800, Song Liu wrote:
> From: Song Liu <songliubraving@...com>
> 
> This enables module_alloc() to allocate huge page for 2MB+ requests.
> To check the difference of this change, we need enable config
> CONFIG_PTDUMP_DEBUGFS, and call module_alloc(2MB). Before the change,
> /sys/kernel/debug/page_tables/kernel shows pte for this map. With the
> change, /sys/kernel/debug/page_tables/ show pmd for thie map.
> 
> Signed-off-by: Song Liu <songliubraving@...com>
> ---
>  arch/x86/Kconfig | 1 +
>  1 file changed, 1 insertion(+)

Hi,

I just saw this upstream today. Glad to see this functionality, but I
think turning on huge vmalloc pages for x86 needs a bit more. I’ll
describe a couple possible failure modes I haven’t actually tested.

One problem is that the direct map permission reset part in vmalloc
assumes any special permissioned pages are mapped 4k on the direct map.
Otherwise the operation could fail to reset a page RW if a PTE page
allocation fails when it tries to split the page to toggle a 4k sized
region NP/P. If you are not familiar, x86 CPA generally leaves the
direct map page sizes mirroring the primary alias (vmalloc). So once
vmalloc has huge pages, the special permissioned direct map aliases
will have them too. This limitation of HAVE_ARCH_HUGE_VMALLOC is
actually hinted about in the Kconfig comments, but I guess it wasn’t
specific that x86 has these properties.

I think to make the vmalloc resetting part safe:
1. set_direct_map_invalid/default() needs to support multiple pages
like this[0].
2. vm_remove_mappings() needs to call them with the correct page size
in the hpage case so they don't cause a split[1].
3. Then hibernate needs to be blocked during this operation so it
doesn’t encounter the now sometimes huge NP pages, which it can’t
handle. Not sure what the right way to do this is, but potentially like
in the diff below[1].

Another problem is that CPA will sometimes now split pages of vmalloc
mappings in cases where it sets a region of an allocation to a
different permission than the rest (for example regular modules calling
set_memory_x() on the text section). Before this change, these couldn’t
fail since the module space mapping would never require a split.
Modules doesn’t check for failure there, so I’m thinking now it would
proceed to try to execute NX memory if the split failed. It could only
happen on allocation of especially large modules. Maybe it should just
be avoided for now by having regular module allocations pass
VM_NO_HUGE_VMAP on x86. And BPF could call __vmalloc_node_range()
directly to get 2MB vmallocs.

[0] 
https://lore.kernel.org/lkml/20210208084920.2884-5-rppt@kernel.org/#t

[1] Untested, but something like this possibly:
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 99e0f3e8d1a5..97c4ca3a29b1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -42,6 +42,7 @@
 #include <linux/sched/mm.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
+#include <linux/suspend.h>
 
 #include "internal.h"
 #include "pgalloc-track.h"
@@ -2241,7 +2242,7 @@ EXPORT_SYMBOL(vm_map_ram);
 
 static struct vm_struct *vmlist __initdata;
 
-static inline unsigned int vm_area_page_order(struct vm_struct *vm)
+static inline unsigned int vm_area_page_order(const struct vm_struct
*vm)
 {
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
        return vm->page_order;
@@ -2560,12 +2561,12 @@ struct vm_struct *remove_vm_area(const void
*addr)
 static inline void set_area_direct_map(const struct vm_struct *area,
                                       int (*set_direct_map)(struct
page *page))
 {
+       unsigned int page_order = vm_area_page_order(area);
        int i;
 
-       /* HUGE_VMALLOC passes small pages to set_direct_map */
-       for (i = 0; i < area->nr_pages; i++)
+       for (i = 0; i < area->nr_pages; i += 1U << page_order)
                if (page_address(area->pages[i]))
-                       set_direct_map(area->pages[i]);
+                       set_direct_map(area->pages[i], 1U <<
page_order);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct.
*/
@@ -2592,6 +2593,10 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
                return;
        }
 
+       /* Hibernate can't handle large NP pages */
+       if (page_order)
+               lock_system_sleep();
+
        /*
         * If execution gets here, flush the vm mapping and reset the
direct
         * map. Find the start and end range of the direct mappings to
make sure
@@ -2617,6 +2622,9 @@ static void vm_remove_mappings(struct vm_struct
*area, int deallocate_pages)
        set_area_direct_map(area, set_direct_map_invalid_noflush);
        _vm_unmap_aliases(start, end, flush_dmap);
        set_area_direct_map(area, set_direct_map_default_noflush);
+
+       if (page_order)
+               unlock_system_sleep();
 }
 
 static void __vunmap(const void *addr, int deallocate_pages)