netdev - Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Jul 2019 10:49:30 -0700
From:   Andrii Nakryiko <andrii.nakryiko@...il.com>
To:     Jiong Wang <jiong.wang@...ronome.com>
Cc:     Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Edward Cree <ecree@...arflare.com>,
        "Naveen N. Rao" <naveen.n.rao@...ux.vnet.ibm.com>,
        Andrii Nakryiko <andriin@...com>,
        Jakub Kicinski <jakub.kicinski@...ronome.com>,
        bpf <bpf@...r.kernel.org>, Networking <netdev@...r.kernel.org>,
        oss-drivers@...ronome.com, Yonghong Song <yhs@...com>
Subject: Re: [RFC bpf-next 0/8] bpf: accelerate insn patching speed

On Tue, Jul 16, 2019 at 1:50 AM Jiong Wang <jiong.wang@...ronome.com> wrote:
>
>
> Andrii Nakryiko writes:
>
> > On Mon, Jul 15, 2019 at 2:21 AM Jiong Wang <jiong.wang@...ronome.com> wrote:
> >>
> >>
> >> Andrii Nakryiko writes:
> >>
> >> > On Thu, Jul 11, 2019 at 4:22 AM Jiong Wang <jiong.wang@...ronome.com> wrote:
> >> >>
> >> >>
> >> >> Andrii Nakryiko writes:
> >> >>
> >> >> > On Thu, Jul 4, 2019 at 2:31 PM Jiong Wang <jiong.wang@...ronome.com> wrote:
> >> >> >>
> >> >> >> This is an RFC based on latest bpf-next about acclerating insn patching
> >> >> >> speed, it is now near the shape of final PATCH set, and we could see the
> >> >> >> changes migrating to list patching would brings, so send out for
> >> >> >> comments. Most of the info are in cover letter. I splitted the code in a
> >> >> >> way to show API migration more easily.
> >> >> >
> >> >> >
> >> >> > Hey Jiong,
> >> >> >
> >> >> >
> >> >> > Sorry, took me a while to get to this and learn more about instruction
> >> >> > patching. Overall this looks good and I think is a good direction.
> >> >> > I'll post high-level feedback here, and some more
> >> >> > implementation-specific ones in corresponding patches.
> >> >>
> >> >> Great, thanks very much for the feedbacks. Most of your feedbacks are
> >> >> hitting those pain points I exactly had ran into. For some of them, I
> >> >> thought similar solutions like yours, but failed due to various
> >> >> reasons. Let's go through them again, I could have missed some important
> >> >> things.
> >> >>
> >> >> Please see my replies below.
> >> >
> >> > Thanks for thoughtful reply :)
> >> >
> >> >>
> >> >> >>
> >> >> >> Test Results
> >> >> >> ===
> >> >> >>   - Full pass on test_verifier/test_prog/test_prog_32 under all three
> >> >> >>     modes (interpreter, JIT, JIT with blinding).
> >> >> >>
> >> >> >>   - Benchmarking shows 10 ~ 15x faster on medium sized prog, and reduce
> >> >> >>     patching time from 5100s (nearly one and a half hour) to less than
> >> >> >>     0.5s for 1M insn patching.
> >> >> >>
> >> >> >> Known Issues
> >> >> >> ===
> >> >> >>   - The following warning is triggered when running scale test which
> >> >> >>     contains 1M insns and patching:
> >> >> >>       warning of mm/page_alloc.c:4639 __alloc_pages_nodemask+0x29e/0x330
> >> >> >>
> >> >> >>     This is caused by existing code, it can be reproduced on bpf-next
> >> >> >>     master with jit blinding enabled, then run scale unit test, it will
> >> >> >>     shown up after half an hour. After this set, patching is very fast, so
> >> >> >>     it shows up quickly.
> >> >> >>
> >> >> >>   - No line info adjustment support when doing insn delete, subprog adj
> >> >> >>     is with bug when doing insn delete as well. Generally, removal of insns
> >> >> >>     could possibly cause remove of entire line or subprog, therefore
> >> >> >>     entries of prog->aux->linfo or env->subprog needs to be deleted. I
> >> >> >>     don't have good idea and clean code for integrating this into the
> >> >> >>     linearization code at the moment, will do more experimenting,
> >> >> >>     appreciate ideas and suggestions on this.
> >> >> >
> >> >> > Is there any specific problem to detect which line info to delete? Or
> >> >> > what am I missing besides careful implementation?
> >> >>
> >> >> Mostly line info and subprog info are range info which covers a range of
> >> >> insns. Deleting insns could causing you adjusting the range or removing one
> >> >> range entirely. subprog info could be fully recalcuated during
> >> >> linearization while line info I need some careful implementation and I
> >> >> failed to have clean code for this during linearization also as said no
> >> >> unit tests to help me understand whether the code is correct or not.
> >> >>
> >> >
> >> > Ok, that's good that it's just about clean implementation. Try to
> >> > implement it as clearly as possible. Then post it here, and if it can
> >> > be improved someone (me?) will try to help to clean it up further.
> >> >
> >> > Not a big expert on line info, so can't comment on that,
> >> > unfortunately. Maybe Yonghong can chime in (cc'ed)
> >> >
> >> >
> >> >> I will described this latter, spent too much time writing the following
> >> >> reply. Might worth an separate discussion thread.
> >> >>
> >> >> >>
> >> >> >>     Insn delete doesn't happen on normal programs, for example Cilium
> >> >> >>     benchmarks, and happens rarely on test_progs, so the test coverage is
> >> >> >>     not good. That's also why this RFC have a full pass on selftest with
> >> >> >>     this known issue.
> >> >> >
> >> >> > I hope you'll add test for deletion (and w/ corresponding line info)
> >> >> > in final patch set :)
> >> >>
> >> >> Will try. Need to spend some time on BTF format.
> >> >> >
> >> >> >>
> >> >> >>   - Could further use mem pool to accelerate the speed, changes are trivial
> >> >> >>     on top of this RFC, and could be 2x extra faster. Not included in this
> >> >> >>     RFC as reducing the algo complexity from quadratic to linear of insn
> >> >> >>     number is the first step.
> >> >> >
> >> >> > Honestly, I think that would add more complexity than necessary, and I
> >> >> > think we can further speed up performance without that, see below.
> >> >> >
> >> >> >>
> >> >> >> Background
> >> >> >> ===
> >> >> >> This RFC aims to accelerate BPF insn patching speed, patching means expand
> >> >> >> one bpf insn at any offset inside bpf prog into a set of new insns, or
> >> >> >> remove insns.
> >> >> >>
> >> >> >> At the moment, insn patching is quadratic of insn number, this is due to
> >> >> >> branch targets of jump insns needs to be adjusted, and the algo used is:
> >> >> >>
> >> >> >>   for insn inside prog
> >> >> >>     patch insn + regeneate bpf prog
> >> >> >>     for insn inside new prog
> >> >> >>       adjust jump target
> >> >> >>
> >> >> >> This is causing significant time spending when a bpf prog requires large
> >> >> >> amount of patching on different insns. Benchmarking shows it could take
> >> >> >> more than half minutes to finish patching when patching number is more
> >> >> >> than 50K, and the time spent could be more than one hour when patching
> >> >> >> number is around 1M.
> >> >> >>
> >> >> >>   15000   :    3s
> >> >> >>   45000   :   29s
> >> >> >>   95000   :  125s
> >> >> >>   195000  :  712s
> >> >> >>   1000000 : 5100s
> >> >> >>
> >> >> >> This RFC introduces new patching infrastructure. Before doing insn
> >> >> >> patching, insns in bpf prog are turned into a singly linked list, insert
> >> >> >> new insns just insert new list node, delete insns just set delete flag.
> >> >> >> And finally, the list is linearized back into array, and branch target
> >> >> >> adjustment is done for all jump insns during linearization. This algo
> >> >> >> brings the time complexity from quadratic to linear of insn number.
> >> >> >>
> >> >> >> Benchmarking shows the new patching infrastructure could be 10 ~ 15x faster
> >> >> >> on medium sized prog, and for a 1M patching it reduce the time from 5100s
> >> >> >> to less than 0.5s.
> >> >> >>
> >> >> >> Patching API
> >> >> >> ===
> >> >> >> Insn patching could happen on two layers inside BPF. One is "core layer"
> >> >> >> where only BPF insns are patched. The other is "verification layer" where
> >> >> >> insns have corresponding aux info as well high level subprog info, so
> >> >> >> insn patching means aux info needs to be patched as well, and subprog info
> >> >> >> needs to be adjusted. BPF prog also has debug info associated, so line info
> >> >> >> should always be updated after insn patching.
> >> >> >>
> >> >> >> So, list creation, destroy, insert, delete is the same for both layer,
> >> >> >> but lineration is different. "verification layer" patching require extra
> >> >> >> work. Therefore the patch APIs are:
> >> >> >>
> >> >> >>    list creation:                bpf_create_list_insn
> >> >> >>    list patch:                   bpf_patch_list_insn
> >> >> >>    list pre-patch:               bpf_prepatch_list_insn
> >> >> >
> >> >> > I think pre-patch name is very confusing, until I read full
> >> >> > description I couldn't understand what it's supposed to be used for.
> >> >> > Speaking of bpf_patch_list_insn, patch is also generic enough to leave
> >> >> > me wondering whether instruction buffer is inserted after instruction,
> >> >> > or instruction is replaced with a bunch of instructions.
> >> >> >
> >> >> > So how about two more specific names:
> >> >> > bpf_patch_list_insn -> bpf_list_insn_replace (meaning replace given
> >> >> > instruction with a list of patch instructions)
> >> >> > bpf_prepatch_list_insn -> bpf_list_insn_prepend (well, I think this
> >> >> > one is pretty clear).
> >> >>
> >> >> My sense on English word is not great, will switch to above which indeed
> >> >> reads more clear.
> >> >>
> >> >> >>    list lineration (core layer): prog = bpf_linearize_list_insn(prog, list)
> >> >> >>    list lineration (veri layer): env = verifier_linearize_list_insn(env, list)
> >> >> >
> >> >> > These two functions are both quite involved, as well as share a lot of
> >> >> > common code. I'd rather have one linearize instruction, that takes env
> >> >> > as an optional parameter. If env is specified (which is the case for
> >> >> > all cases except for constant blinding pass), then adjust aux_data and
> >> >> > subprogs along the way.
> >> >>
> >> >> Two version of lineration and how to unify them was a painpoint to me. I
> >> >> thought to factor out some of the common code out, but it actually doesn't
> >> >> count much, the final size counting + insnsi resize parts are the same,
> >> >> then things start to diverge since the "Copy over insn" loop.
> >> >>
> >> >> verifier layer needs to copy and initialize aux data etc. And jump
> >> >> relocation is different. At core layer, the use case is JIT blinding which
> >> >> could expand an jump_imm insn into a and/or/jump_reg sequence, and the
> >> >
> >> > Sorry, I didn't get what "could expand an jump_imm insn into a
> >> > and/or/jump_reg sequence", maybe you can clarify if I'm missing
> >> > something.
> >> >
> >> > But from your cover letter description, core layer has no jumps at
> >> > all, while verifier has jumps inside patch buffer. So, if you support
> >> > jumps inside of patch buffer, it will automatically work for core
> >> > layer. Or what am I missing?
> >>
> >> I meant in core layer (JIT blinding), there is the following patching:
> >>
> >> input:
> >>   insn 0             insn 0
> >>   insn 1             insn 1
> >>   jmp_imm   >>       mov_imm  \
> >>   insn 2             xor_imm    insn seq expanded from jmp_imm
> >>   insn 3             jmp_reg  /
> >>                      insn 2
> >>                      insn 3
> >>
> >>
> >> jmp_imm is the insn that will be patched, and the actually transformation
> >> is to expand it into mov_imm/xor_imm/jmp_reg sequence. "jmp_reg", sitting
> >> at the end of the patch buffer, must jump to the same destination as the
> >> original jmp_imm, so "jmp_reg" is an insn inside patch buffer but should
> >> be relocated, and the jump destination is outside of patch buffer.
> >
> >
> > Ok, great, thanks for explaining, yeah it's definitely something that
> > we should be able to support. BUT. It got me thinking a bit more and I
> > think I have simpler and more elegant solution now, again, supporting
> > both core-layer and verifier-layer operations.
> >
> > struct bpf_patchable_insn {
> >    struct bpf_patchable_insn *next;
> >    struct bpf_insn insn;
> >    int orig_idx; /* original non-patched index */
> >    int new_idx;  /* new index, will be filled only during linearization */
> > };
> >
> > struct bpf_patcher {
> >     /* dummy head node of a chain of patchable instructions */
> >     struct bpf_patchable_insn insn_head;
> >     /* dynamic array of size(original instruction count)
> >      * this is a map from original instruction index to a first
> >      * patchable instruction that replaced that instruction (or
> >      * just original instruction as bpf_patchable_insn).
> >      */
> >     int *orig_idx_to_patchable_insn;
> >     int cnt;
> > };
> >
> > Few points, but it should be pretty clear just from comments and definitions:
> > 1. When you created bpf_patcher, you create patchabe_insn list, fill
> > orig_idx_to_patchable_insn map to store proper pointers. This array is
> > NEVER changed after that.
> > 2. When replacing instruction, you re-use struct bpf_patchable_insn
> > for first patched instruction, then append after that (not prepend to
> > next instruction to not disrupt orig_idx -> patchable_insn mapping).
> > 3. During linearizations, you first traverse the chain of instructions
> > and trivially assing new_idxs.
> > 4. No need for patchabe_insn->target anymore. All jumps use relative
> > instruction offsets, right?
>
> Yes, all jumps are pc-relative.
>
> > So when you need to determine new
> > instruction index during linearization, you just do (after you
> > calculated new instruction indicies):
> >
> > func adjust_jmp(struct bpf_patcher* patcher, struct bpf_patchable_insn *insn) {
> >    int old_jmp_idx = insn->orig_idx + jmp_offset_of(insn->insn);
> >    int new_jmp_idx = patcher->orig_idx_to_patchable_insn[old_jmp_idx]->new_idx;
> >    adjust_jmp_offset(insn->insn, new_jmp_idx) - insn->orig_idx;
> > }
>
> Hmm, this algo is kinds of the same this RFC, just we have organized "new_index"
> as "idx_map". And in this RFC, only new_idx of one original insn matters,
> no space is allocated for patched insns. (As mentioned, JIT blinding

It's not really about saving space. It's about having a mapping from
original index to a new one (in this case, through struct
bpf_patchable_insn *), which stays correct at all times, thus allowing
to not linearize between patching passes.


> requires the last insn inside patch buffer relocated to original jump
> offset, so there was a little special handling in the relocation loop in
> core layer linearization code)
>
> > The idea is that we want to support quick look-up by original
> > instruction index. That's what orig_idx_to_patchable_insn provides. On
> > the other hand, no existing instruction is ever referencing newly
> > patched instruction by its new offset, so with careful implementation,
> > you can transparently support all the cases, regardless if it's in
> > core layer or verifier layer (so, e.g., verifier layer patched
> > instructions now will be able to jump out of patched buffer, if
> > necessary, neat, right?).
> >
> > It is cleaner than everything we've discussed so far. Unless I missed
> > something critical (it's all quite convoluted, so I might have
> > forgotten some parts already). Let me know what you think.
>
> Let me digest a little bit and do some coding, then I will come back. Some

Sure, give it some thought and give it a go at coding, I bet overall
it will turn out more succinct and simpler. Please post an updated
version when you are done. Thanks!

> issues can only shown up during in-depth coding. I kind of feel handling
> aux reference in verifier layer is the part that will still introduce some
> un-clean code.
>
> <snip>
> >> If there is no dead insn elimination opt, then we could just adjust
> >> offsets. When there is insn deleting, I feel the logic becomes more
> >> complex. One subprog could be completely deleted or partially deleted, so
> >> I feel just recalculate the whole subprog info as a side-product is
> >> much simpler.
> >
> > What's the situation where entirety of subprog can be deleted?
>
> Suppose you have conditional jmp_imm, true path calls one subprog, false
> path calls the other. If insn walker later found it is also true, then the
> subprog at false path won't be marked as "seen", so it is entirely deleted.
>
> I actually thought it is in theory one subprog could be deleted entirely,
> so if we support insn deletion inside verifier, then range info like
> line_info/subprog_info needs to consider one range is deleted.

Seems like this is not a problem, according to Alexei. But in the
worst case, it's now simple to re-calculate all this, given that we
have this simple operation to get new insn idx by old insn idx.

>
> Thanks.
> Regards,
> Jiong