netdev - Re: [PATCH bpf-next 2/7] set_memory: introduce set_memory_[ro|x]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <YZfLb/AEoA5UBAnY@cmpxchg.org>
Date:   Fri, 19 Nov 2021 11:06:07 -0500
From:   Johannes Weiner <hannes@...xchg.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Song Liu <songliubraving@...com>,
        the arch/x86 maintainers <x86@...nel.org>,
        bpf <bpf@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "ast@...nel.org" <ast@...nel.org>,
        "daniel@...earbox.net" <daniel@...earbox.net>,
        "andrii@...nel.org" <andrii@...nel.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH bpf-next 2/7] set_memory: introduce
 set_memory_[ro|x]_noalias

On Fri, Nov 19, 2021 at 10:35:56AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 19, 2021 at 04:14:46AM +0000, Song Liu wrote:
> > 
> > 
> > > On Nov 18, 2021, at 10:58 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> > > 
> > > On Thu, Nov 18, 2021 at 06:39:49PM +0000, Song Liu wrote:
> > > 
> > >>> You're going to have to do that anyway if you're going to write to the
> > >>> directmap while executing from the alias.
> > >> 
> > >> Not really. If you look at current version 7/7, the logic is mostly 
> > >> straightforward. We just make all the writes to the directmap, while 
> > >> calculate offset from the alias. 
> > > 
> > > Then you can do the exact same thing but do the writes to a temp buffer,
> > > no different.
> > 
> > There will be some extra work, but I guess I will give it a try. 
> > 
> > > 
> > >>>> The BPF program could have up to 1000000 (BPF_COMPLEXITY_LIMIT_INSNS)
> > >>>> instructions (BPF instructions). So it could easily go beyond a few 
> > >>>> pages. Mapping the 2MB page all together should make the logic simpler. 
> > >>> 
> > >>> Then copy it in smaller chunks I suppose.
> > >> 
> > >> How fast/slow is the __text_poke routine? I guess we cannot do it thousands
> > >> of times per BPF program (in chunks of a few bytes)? 
> > > 
> > > You can copy in at least 4k chunks since any 4k will at most use 2
> > > pages, which is what it does. If that's not fast enough we can look at
> > > doing bigger chunks.
> > 
> > If we do JIT in a buffer first, 4kB chunks should be fast enough. 
> > 
> > Another side of this issue is the split of linear mapping (1GB => 
> > many 4kB). If we only split to PMD, but not PTE, we can probably 
> > recover most of the regression. I will check this with Johannes. 
> 
> __text_poke() shouldn't affect the fragmentation of the kernel
> mapping, it's a user-space alias into the same physical memory. For all
> it cares we're poking into GB pages.

Right, __text_poke won't, it's the initial set_memory_ro/x against the
vmap space that does it, since the linear mapping is updated as an
alias with the same granularity.

However, my guess would also be that once we stop doing that with 4k
pages for every single bpf program, and batch into shared 2M pages
instead, the problem will be much smaller. Maybe negligible.*

So ro linear mapping + __text_poke() sounds like a good idea to me.

[ If not, we could consider putting the linear mapping updates behind
  a hardening option similar to RETPOLINE, PAGE_TABLE_ISOLATION,
  STRICT_MODULE_RWX. The __text_poke() method would mean independence
  from the linear mapping and we can do with that what we want
  then. Many machines aren't exposed enough to necessarily care about
  W^X if the price is too high - and the impact of losing the GB
  mappings is actually significant on central workloads right now.

  But yeah, hopefully we won't have to go there. ]