linux-kernel - Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yl8olpqvZxY8KoNf@bombadil.infradead.org>
Date:   Tue, 19 Apr 2022 14:24:38 -0700
From:   Luis Chamberlain <mcgrof@...nel.org>
To:     "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
Cc:     "rppt@...nel.org" <rppt@...nel.org>,
        "songliubraving@...com" <songliubraving@...com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "hch@...radead.org" <hch@...radead.org>,
        "ast@...nel.org" <ast@...nel.org>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "Torvalds, Linus" <torvalds@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "song@...nel.org" <song@...nel.org>,
        "Kernel-team@...com" <Kernel-team@...com>,
        "pmladek@...e.com" <pmladek@...e.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "hpa@...or.com" <hpa@...or.com>,
        "dborkman@...hat.com" <dborkman@...hat.com>,
        "edumazet@...gle.com" <edumazet@...gle.com>,
        "bp@...en8.de" <bp@...en8.de>, "mbenes@...e.cz" <mbenes@...e.cz>,
        "imbrenda@...ux.ibm.com" <imbrenda@...ux.ibm.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Adam Manzanares <a.manzanares@...sung.com>,
        Christophe Leroy <christophe.leroy@...roup.eu>
Subject: Re: [PATCH v4 bpf 0/4] vmalloc: bpf: introduce VM_ALLOW_HUGE_VMAP

On Tue, Apr 19, 2022 at 01:56:03AM +0000, Edgecombe, Rick P wrote:
> Yea, that was my understanding. X86 modules have to be linked within
> 2GB of the kernel text, also eBPF x86 JIT generates code that expects
> to be within 2GB of the kernel text.

And kprobes / live patching / ftrace.

Another architectural fun fact, powerpc book3s/32 requires executability
to be set per 256 Mbytes segments. Some architectures like this one
will want to also optimize how they use the module alloc area.

Even though today the use cases might be limited, we don't exactly know
how much memory a target device has a well, and so treating memory
failures for "special memory" request as regular memory failures seems
a bit odd, and users could get confused. For instance slapping on
extra memory on a system won't resolve any issues if the limit for a
special type of memory is already hit. Very likely not a problem at all today,
given how small modules / eBPF jit programs are / etc, but conceptually it
would seem wrong to just say -ENOMEM when in fact it's a special type of
required memory which cannot be allocated and the issue cannot possibly be
fixed. I don't think we have an option but to use -ENOMEM but at least
hinting of the special failure would have seem desirable.

Do we have other type of architectural limitations for "special memory"
other than executable? Do we have *new* types of special memory we
should consider which might be similar / limited in nature? And can / could /
should these architectural limitations hopefully be disappear in newer CPUs?
I see vmalloc_pks() as you pointed out [0] . Anything else?

> I think of two types of caches we could have: caches of unmapped pages
> on the direct map and caches of virtual memory mappings. Caches of
> pages on the direct map reduce breakage of the large pages (and is
> somewhat x86 specific problem). Caches of virtual memory mappings
> reduce shootdowns, and are also required to share huge pages. I'll plug
> my old RFC, where I tried to work towards enabling both:
> 
> https://lore.kernel.org/lkml/20201120202426.18009-1-rick.p.edgecombe@intel.com/
> 
> Since then Mike has taken a lot further the direct map cache piece.
> 
> Yea, probably a lot of JIT's are way smaller than a page, but there is
> also hopefully some performance benefit of reduced ITLB pressure and
> TLB shootdowns. I think kprobes/ftrace (or at least one of them) keeps
> its own cache of a page for putting very small trampolines.

The reason I looked into *why* module_alloc() was used was particularly
because it seemed a bit odd to have such ITLB enhancements for such
a niche use case and we couldn't have desired this elsewhere before.

> > Then, since it seems since the vmalloc area was not initialized,
> > wouldn't that break the old JIT spray fixes, refer to commit
> > 314beb9bcabfd ("x86: bpf_jit_comp: secure bpf jit against spraying
> > attacks")?
> 
> Hmm, yea it might be a way to get around the ebpf jit rlimit. The
> allocator could just text_poke() invalid instructions on "free" of the
> jit.
> 
> > 
> > Is that sort of work not needed anymore? If in doubt I at least made
> > the
> > old proof of concept JIT spray stuff compile on recent kernels [0],
> > but
> > I haven't tried out your patches yet. If this is not needed anymore,
> > why not?
> 
> IIRC this got addressed in two ways, randomizing of the jit offset
> inside the vmalloc allocation, and "constant blinding", such that the
> specific attack of inserting unaligned instructions as immediate
> instruction data did not work. Neither of those mitigations seem
> unworkable with a large page caching allocator.

Got it, but was it *also* considerd in the fixes posted recently?

> > The collection of tribal knowedge around these sorts of things would
> > be
> > good to not loose and if we can share, even better.
> 
> Totally agree here. I think the abstraction I was exploring in that RFC
> could remove some of the special permission memory tribal knowledge
> that is lurking in in the cross-arch module.c. I wonder if you have any
> thoughts on something like that? The normal modules proved the hardest.

Yeah modules will be harder now with the new ARCH_WANTS_MODULES_DATA_IN_VMALLOC
which Christophe Leroy added (queued in my modules-next). At a quick
glance it seems like an API in the right direction, but you just need
more architecture folks other than the usual x86 suspects to review.

Perhaps time for a new spin?

[0] https://lore.kernel.org/lkml/20201009201410.3209180-2-ira.weiny@intel.com/

  Luis