linux-kernel - Re: [PATCH v3 bpf-next 5/8] bpf: use module_alloc_huge for bpf_prog

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YolGU5JGE9NVrrrc@bombadil.infradead.org>
Date:   Sat, 21 May 2022 13:06:43 -0700
From:   Luis Chamberlain <mcgrof@...nel.org>
To:     "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
        Christoph Hellwig <hch@....de>,
        Davidlohr Bueso <dave@...olabs.net>
Cc:     "song@...nel.org" <song@...nel.org>,
        "arnd@...db.de" <arnd@...db.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "peterz@...radead.org" <peterz@...radead.org>,
        "ast@...nel.org" <ast@...nel.org>,
        "bpf@...r.kernel.org" <bpf@...r.kernel.org>,
        "kernel-team@...com" <kernel-team@...com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "Torvalds, Linus" <torvalds@...ux-foundation.org>
Subject: Re: [PATCH v3 bpf-next 5/8] bpf: use module_alloc_huge for
 bpf_prog_pack

On Sat, May 21, 2022 at 03:20:28AM +0000, Edgecombe, Rick P wrote:
> On Fri, 2022-05-20 at 18:00 -0700, Luis Chamberlain wrote:
> > although VM_FLUSH_RESET_PERMS is rather new my concern here is we're
> > essentially enabling sloppy users to grow without also addressing
> > what if we have to take the leash back to support
> > VM_FLUSH_RESET_PERMS
> > properly? If the hack to support this on other architectures other
> > than
> > x86 is as simple as the one you in vm_remove_mappings() today:
> > 
> >         if (flush_reset &&
> > !IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
> >                 set_memory_nx(addr, area->nr_pages);
> >                 set_memory_rw(addr, area->nr_pages);
> >         }
> > 
> > then I suppose this isn't a big deal. I'm just concerned here this
> > being
> > a slippery slope of sloppiness leading to something which we will
> > regret later.
> > 
> > My intution tells me this shouldn't be a big issue, but I just want
> > to
> > confirm.
> 
> Yea, I commented the same concern on the last thread:
> 
> https://lore.kernel.org/lkml/83a69976cb93e69c5ad7a9511b5e57c402eee19d.camel@intel.com/
> 
> Song said he plans to make kprobes and ftrace work with this new
> allocator. If that happens VM_FLUSH_RESET_PERMS would only have one
> user - modules. Care to chime in with your plans for modules?

My plans are to not break things and to slowly tidy things up. If
you see linux-next, things are at least starting to be split in
nice pieces. With time, clean that further so to not break things.
You were the one who added VM_FLUSH_RESET_PERMS, wasn't that to deal
with secmem stuff? So wouldn't you know better what you recommend for it?

Seeing all this, given module_alloc() users are growing and seeing
the tiny bit of growth of use in this space, I'd think we should
rename module_alloc() to vmalloc_exec(), and likewise the same for
module_memfree() to vmalloc_exec_free(). But it would be our first
__weak vmalloc, and not sure if that's looked down upon.

> If there
> are actual near term plans to keep working on this,
> VM_FLUSH_RESET_PERMS might be changed again or turn into something
> else. Like if we are about to re-think everything, then it doesn't
> matter as much to fix what would then be old.

I think it's up to you as you added it and I'm not looking to add
any bells or wistles, just tidy things up *slowly*.

> Besides not fixing VM_FLUSH_RESET_PERMS/hibernate though, I think this
> allocator still feels a little rough. For example I don't think we
> actually know how much the huge mappings are helping.

Right, 100% agreed. The performance numbers provided are nice but
they are not anything folks can reproduce at all. I hinted towards
perf stuff which could be used and enable other users later to also
use similar stats to showcase its value if they want to move to
huge pages.

It is a side note, and perhaps a stupid question, as I don't grok mm,
but I'm perplexed about the fact that if the value is seen so high towards
huge pages for exec stuff in kernel, wouldn't there be a few folks who
might want to try this for regular exec stuff? Wouldn't there be much
more gains there?

> It is also
> allocating memory in a big chunk from a single node and reusing it,
> where before we were allocating based on numa node for each jit. Would
> some user's suffer from that? Maybe it's obvious to others, but I would
> have expected to see more discussion of MM things like that.

Curious, why was it moved to use a single node?

> But I like general direction of caching and using text_poke() to write
> the jits a lot. However it works, it seems to make a big impact in at
> least some workloads.
> 
> So yea, seems sloppy, but probably (...I guess?) more good for users
> then sloppy for us.

The impact of sloppiness lies in possible odd bugs later and trying to
decipher what was being done. So I do have concerns with the immediate
tribal knowlege incurred by the current implementation. What is your
own roadmap for VM_FLUSH_RESET_PERMS? Sounds like a future possibly
maybe re-do?

  Luis