netdev - Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset relocation algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9EE75932-5AED-49D3-86BF-D1FC2A139BF8@fb.com>
Date:   Sat, 27 Jul 2019 18:59:11 +0000
From:   Song Liu <songliubraving@...com>
To:     Andrii Nakryiko <andrii.nakryiko@...il.com>
CC:     Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Yonghong Song <yhs@...com>, Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset
 relocation algorithm



> On Jul 26, 2019, at 11:11 PM, Andrii Nakryiko <andrii.nakryiko@...il.com> wrote:
> 
> On Thu, Jul 25, 2019 at 12:32 PM Song Liu <songliubraving@...com> wrote:
>> 
>> 
>> 
>>> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@...com> wrote:
>>> 
>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>> All the details are described in code comments.
>> 
>> Some description in the change log is still useful. Please at least
>> copy-paste key comments here.
> 
> OK, will add some more.
> 
>> 
>> And, this is looooong. I think it is totally possible to split it into
>> multiple smaller patches.
> 
> I don't really know how to split it further without hurting reviewing
> by artificially splitting related code into separate patches. Remove
> any single function and algorithm will be incomplete.
> 
> Let me give you some high-level overview of how pieces are put
> together. There are 9 non-trivial functions, let's go over their
> purpose in the orderd in which they are defined in file:
> 
> 1. bpf_core_spec_parse()
> 
> This one take bpf_offset_reloc's type_id and accessor string
> ("0:1:2:3") and parses it into more convenient bpf_core_spec
> datastructure, which has calculated offset and high-level spec
> "steps": either named field or array access.
> 
> 2. bpf_core_find_cands()
> 
> Given local type name, finds all possible target BTF types with same
> name (modulo "flavor" differences, ___flavor suffix is just ignored).
> 
> 3. bpf_core_fields_are_compat()
> 
> Given local and target field match, checks that their types are
> compatible (so that we don't accidentally match, e.g., int against
> struct).
> 
> 4. bpf_core_match_member()
> 
> Given named local field, find corresponding field in target struct. To
> understand why it's not trivial, here's an example:
> 
> Local type:
> 
> struct s___local {
>  int a;
> };
> 
> Target type:
> 
> struct s___target {
>  struct {
>    union {
>      int a;
>    };
>  };
> };
> 
> For both cases you can access a as s.a, but in local case, field a is
> immediately inside s___local, while for s___target, you have to
> traverse two levels deeper into anonymous fields to get to an `a`
> inside anonymous union.
> 
> So this function find that `a` by doing exhaustive search across all
> named field and anonymous struct/unions. But otherwise it's pretty
> straightforward recursive function.
> 
> bpf_core_spec_match()
> 
> Just goes over high-level spec steps in local spec and tries to figure
> out both high-level and low-level steps for targe type. Consider the
> above example. For both structs accessing s.a is one high-level step,
> but for s___local it's single low-level step (just another :0 in spec
> string), while for s___target it's three low-level steps: ":0:0:0",
> one step for each BTF type we need to traverse.
> 
> Array access is simpler, it's always one high-level and one low-level step.
> 
> bpf_core_reloc_insn()
> 
> Once we match local and target specs and have local and target
> offsets, do the relocations - check that instruction has expected
> local offset and replace it with target offset.
> 
> bpf_core_find_kernel_btf()
> 
> This is the only function that can be moved into separate patch, but
> it's also very simple. It just iterates over few known possible
> locations for vmlinux image and once found, tries to parse .BTF out of
> it, to be used as target BTF.
> 
> bpf_core_reloc_offset()
> 
> It combines all the above functions to perform single relocation.
> Parse spec, get candidates, for each candidate try to find matching
> target spec. All candidates that matched are cached for given local
> root type.

Thanks for these explanation. They are really helpful. 

I think some example explaining each step of bpf_core_reloc_offset()
will be very helpful. Something like:

Example:

struct s {
	int a;
	struct {
		int b;
		bool c;
	};
};

To get offset for c, we do:

bpf_core_reloc_offset() {
	
	/* input data: xxx */

	/* first step: bpf_core_spec_parse() */

	/* data after first step */

	/* second step: bpf_core_find_cands() */

	/* candidate A and B after second step */

	...
}

Well, it requires quite some work to document this way. Please let me 
know if you feel this is an overkill. 

> 
> bpf_core_reloc_offsets()
> 
> High-level coordination. Iterate over all per-program .BTF.ext offset
> reloc sections, each relocation within them. Find corresponding
> program and try to apply relocations one by one.
> 
> 
> I think the only non-obvious part here is to understand that
> relocation records local raw spec with every single anonymous type
> traversal, which is not that useful when we try to match it against
> target type, which can have very different composition, but still the
> same field access pattern, from C language standpoint (which hides all
> those anonymous type traversals from programmer).
> 
> But it should be pretty clear now, plus also check tests, they have
> lots of cases showing what's compatible and what's not.

I see. I will review the tests. 

>>> 
>>> static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> -                                                  __u32 id)
>>> +                                                  __u32 id,
>>> +                                                  __u32 *res_id)
>>> {
>>>      const struct btf_type *t = btf__type_by_id(btf, id);
>> 
>> Maybe have a local "__u32 rid;"
>> 
>>> 
>>> +     if (res_id)
>>> +             *res_id = id;
>>> +
>> 
>> and do "rid = id;" here
>> 
>>>      while (true) {
>>>              switch (BTF_INFO_KIND(t->info)) {
>>>              case BTF_KIND_VOLATILE:
>>>              case BTF_KIND_CONST:
>>>              case BTF_KIND_RESTRICT:
>>>              case BTF_KIND_TYPEDEF:
>>> +                     if (res_id)
>>> +                             *res_id = t->type;
>> and here
>> 
>>>                      t = btf__type_by_id(btf, t->type);
>>>                      break;
>>>              default:
>> and "*res_id = rid;" right before return?
> 
> Sure, but why?

I think it is cleaner that way. But feel free to ignore if you
think otherwise. 

> 
>> 
>>> @@ -1041,7 +1049,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>> static bool get_map_field_int(const char *map_name, const struct btf *btf,
>>>                            const struct btf_type *def,
>>>                            const struct btf_member *m, __u32 *res) {
> 
> [...]
> 
>>> +struct bpf_core_spec {
>>> +     const struct btf *btf;
>>> +     /* high-level spec: named fields and array indicies only */
>> 
>> typo: indices
> 
> thanks!
> 
>> 
>>> +     struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
>>> +     /* high-level spec length */
>>> +     int len;
>>> +     /* raw, low-level spec: 1-to-1 with accessor spec string */
>>> +     int raw_spec[BPF_CORE_SPEC_MAX_LEN];
>>> +     /* raw spec length */
>>> +     int raw_len;
>>> +     /* field byte offset represented by spec */
>>> +     __u32 offset;
>>> +};
> 
> [...]
> 
>>> + *
>>> + *   int x = &s->a[3]; // access string = '0:1:2:3'
>>> + *
>>> + * Low-level spec has 1:1 mapping with each element of access string (it's
>>> + * just a parsed access string representation): [0, 1, 2, 3].
>>> + *
>>> + * High-level spec will capture only 3 points:
>>> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
>>> + *   - field 'a' access (corresponds to '2' in low-level spec);
>>> + *   - array element #3 access (corresponds to '3' in low-level spec).
>>> + *
>>> + */
>> 
>> IIUC, high-level points are subset of low-level points. How about we introduce
>> "anonymous" high-level points, so that high-level points and low-level points
>> are 1:1 mapping?
> 
> No, that will just hurt and complicate things. See above explanation
> about why we need high-level points (it's what you as C programmer try
> to achieve vs low-level spec is what C-language does in reality, with
> all the anonymous struct/union traversal).
> 
> What's wrong with this separation? Think about it as recording
> "intent" (high-level spec) vs "mechanics" (low-level spec, how exactly
> to achieve that intent, in excruciating details).

There is nothing wrong with separation. I just personally think it is 
cleaner the other way. That's why I raised the question. 

I will go with your assessment, as you looked into this much more than 
I did. :-)

[...]

>>> +
>>> +     memset(spec, 0, sizeof(*spec));
>>> +     spec->btf = btf;
>>> +
>>> +     /* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */
>>> +     while (*spec_str) {
>>> +             if (*spec_str == ':')
>>> +                     ++spec_str;
>>> +             if (sscanf(spec_str, "%d%n", &access_idx, &parsed_len) != 1)
>>> +                     return -EINVAL;
>>> +             if (spec->raw_len == BPF_CORE_SPEC_MAX_LEN)
>>> +                     return -E2BIG;
>>> +             spec_str += parsed_len;
>>> +             spec->raw_spec[spec->raw_len++] = access_idx;
>>> +     }
>>> +
>>> +     if (spec->raw_len == 0)
>>> +             return -EINVAL;
>>> +
>>> +     for (i = 0; i < spec->raw_len; i++) {
>>> +             t = skip_mods_and_typedefs(btf, id, &id);
>>> +             if (!t)
>>> +                     return -EINVAL;
>>> +
>>> +             access_idx = spec->raw_spec[i];
>>> +
>>> +             if (i == 0) {
>>> +                     /* first spec value is always reloc type array index */
>>> +                     spec->spec[spec->len].type_id = id;
>>> +                     spec->spec[spec->len].idx = access_idx;
>>> +                     spec->len++;
>>> +
>>> +                     sz = btf__resolve_size(btf, id);
>>> +                     if (sz < 0)
>>> +                             return sz;
>>> +                     spec->offset += access_idx * sz;
>>          spec->offset = access_idx * sz;  should be enough
> 
> No. spec->offset is carefully maintained across multiple low-level
> steps, as we traverse down embedded structs/unions.
> 
> Think about, e.g.:
> 
> struct s {
>    int a;
>    struct {
>        int b;
>    };
> };
> 
> Imagine you are trying to match s.b access. With what you propose
> you'll end up with offset 0, but it should be 4.

Hmm... this is just for i == 0, right? Which line updated spec->offset
after "memset(spec, 0, sizeof(*spec));"?

> 
>> 
>>> +                     continue;
>>> +             }
>> 
>> Maybe pull i == 0 case out of the for loop?
>> 
>>> +
>>> +             if (btf_is_composite(t)) {
> 
> [...]
> 
>>> +
>>> +     if (spec->len == 0)
>>> +             return -EINVAL;
>> 
>> Can this ever happen?
> 
> Not really, because I already check raw_len == 0 and exit with error.
> I'll remove.
> 
>> 
>>> +
>>> +     return 0;
>>> +}
>>> +
> 
> [...]
> 
>>> +
>>> +/*
>>> + * Given single high-level accessor (either named field or array index) in
>>> + * local type, find corresponding high-level accessor for a target type. Along
>>> + * the way, maintain low-level spec for target as well. Also keep updating
>>> + * target offset.
>>> + */
>> 
>> Please describe the recursive algorithm here. I am kinda lost.
> 
> Explained above. I'll extend description a bit. But it's just
> recursive exhaustive search:
> 1. if struct field is anonymous and is struct/union, go one level
> deeper and try to find field with given name inside those.
> 2. if field has name and it matched what we are searching - check type
> compatibility. It has to be compatible, so if it's not, then it's not
> a match.
> 
>> Also, please document the meaning of zero, positive, negative return values.
> 
> Ok. It's standard <0 - error, 0 - false, 1 - true.
> 
>> 
>>> +static int bpf_core_match_member(const struct btf *local_btf,
>>> +                              const struct bpf_core_accessor *local_acc,
>>> +                              const struct btf *targ_btf,
>>> +                              __u32 targ_id,
>>> +                              struct bpf_core_spec *spec,
>>> +                              __u32 *next_targ_id)
>>> +{
> 
> [...]
> 
>>> +             if (local_acc->name) {
>>> +                     if (!btf_is_composite(targ_type))
>>> +                             return 0;
>>> +
>>> +                     matched = bpf_core_match_member(local_spec->btf,
>>> +                                                     local_acc,
>>> +                                                     targ_btf, targ_id,
>>> +                                                     targ_spec, &targ_id);
>>> +                     if (matched <= 0)
>>> +                             return matched;
>>> +             } else {
>>> +                     /* for i=0, targ_id is already treated as array element
>>> +                      * type (because it's the original struct), for others
>>> +                      * we should find array element type first
>>> +                      */
>>> +                     if (i > 0) {
>> 
>> i == 0 case would go into "if (local_acc->name)" branch, no?
> 
> No, i == 0 is always an array access. s->a.b.c is the same as
> s[0].a.b.c, so relocation's first spec element is always either zero
> for pointer access or any non-negative index for array access. But it
> is always array access.

I see. Thanks for the explanation.

Song