netdev - Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset relocation algorithm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99D7A5F9-48FD-4C64-A5F1-E0CD300F7316@fb.com>
Date:   Sun, 28 Jul 2019 00:24:28 +0000
From:   Song Liu <songliubraving@...com>
To:     Andrii Nakryiko <andrii.nakryiko@...il.com>
CC:     Andrii Nakryiko <andriin@...com>, bpf <bpf@...r.kernel.org>,
        Networking <netdev@...r.kernel.org>,
        Alexei Starovoitov <ast@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Yonghong Song <yhs@...com>, Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH bpf-next 02/10] libbpf: implement BPF CO-RE offset
 relocation algorithm



> On Jul 27, 2019, at 12:09 PM, Andrii Nakryiko <andrii.nakryiko@...il.com> wrote:
> 
> On Sat, Jul 27, 2019 at 11:59 AM Song Liu <songliubraving@...com> wrote:
>> 
>> 
>> 
>>> On Jul 26, 2019, at 11:11 PM, Andrii Nakryiko <andrii.nakryiko@...il.com> wrote:
>>> 
>>> On Thu, Jul 25, 2019 at 12:32 PM Song Liu <songliubraving@...com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 24, 2019, at 12:27 PM, Andrii Nakryiko <andriin@...com> wrote:
>>>>> 
>>>>> This patch implements the core logic for BPF CO-RE offsets relocations.
>>>>> All the details are described in code comments.
>>>> 
>>>> Some description in the change log is still useful. Please at least
>>>> copy-paste key comments here.
>>> 
>>> OK, will add some more.
>>> 
>>>> 
>>>> And, this is looooong. I think it is totally possible to split it into
>>>> multiple smaller patches.
>>> 
>>> I don't really know how to split it further without hurting reviewing
>>> by artificially splitting related code into separate patches. Remove
>>> any single function and algorithm will be incomplete.
>>> 
>>> Let me give you some high-level overview of how pieces are put
>>> together. There are 9 non-trivial functions, let's go over their
>>> purpose in the orderd in which they are defined in file:
>>> 
>>> 1. bpf_core_spec_parse()
>>> 
>>> This one take bpf_offset_reloc's type_id and accessor string
>>> ("0:1:2:3") and parses it into more convenient bpf_core_spec
>>> datastructure, which has calculated offset and high-level spec
>>> "steps": either named field or array access.
>>> 
>>> 2. bpf_core_find_cands()
>>> 
>>> Given local type name, finds all possible target BTF types with same
>>> name (modulo "flavor" differences, ___flavor suffix is just ignored).
>>> 
>>> 3. bpf_core_fields_are_compat()
>>> 
>>> Given local and target field match, checks that their types are
>>> compatible (so that we don't accidentally match, e.g., int against
>>> struct).
>>> 
>>> 4. bpf_core_match_member()
>>> 
>>> Given named local field, find corresponding field in target struct. To
>>> understand why it's not trivial, here's an example:
>>> 
>>> Local type:
>>> 
>>> struct s___local {
>>> int a;
>>> };
>>> 
>>> Target type:
>>> 
>>> struct s___target {
>>> struct {
>>>   union {
>>>     int a;
>>>   };
>>> };
>>> };
>>> 
>>> For both cases you can access a as s.a, but in local case, field a is
>>> immediately inside s___local, while for s___target, you have to
>>> traverse two levels deeper into anonymous fields to get to an `a`
>>> inside anonymous union.
>>> 
>>> So this function find that `a` by doing exhaustive search across all
>>> named field and anonymous struct/unions. But otherwise it's pretty
>>> straightforward recursive function.
>>> 
>>> bpf_core_spec_match()
>>> 
>>> Just goes over high-level spec steps in local spec and tries to figure
>>> out both high-level and low-level steps for targe type. Consider the
>>> above example. For both structs accessing s.a is one high-level step,
>>> but for s___local it's single low-level step (just another :0 in spec
>>> string), while for s___target it's three low-level steps: ":0:0:0",
>>> one step for each BTF type we need to traverse.
>>> 
>>> Array access is simpler, it's always one high-level and one low-level step.
>>> 
>>> bpf_core_reloc_insn()
>>> 
>>> Once we match local and target specs and have local and target
>>> offsets, do the relocations - check that instruction has expected
>>> local offset and replace it with target offset.
>>> 
>>> bpf_core_find_kernel_btf()
>>> 
>>> This is the only function that can be moved into separate patch, but
>>> it's also very simple. It just iterates over few known possible
>>> locations for vmlinux image and once found, tries to parse .BTF out of
>>> it, to be used as target BTF.
>>> 
>>> bpf_core_reloc_offset()
>>> 
>>> It combines all the above functions to perform single relocation.
>>> Parse spec, get candidates, for each candidate try to find matching
>>> target spec. All candidates that matched are cached for given local
>>> root type.
>> 
>> Thanks for these explanation. They are really helpful.
>> 
>> I think some example explaining each step of bpf_core_reloc_offset()
>> will be very helpful. Something like:
>> 
>> Example:
>> 
>> struct s {
>>        int a;
>>        struct {
>>                int b;
>>                bool c;
>>        };
>> };
>> 
>> To get offset for c, we do:
>> 
>> bpf_core_reloc_offset() {
>> 
>>        /* input data: xxx */
>> 
>>        /* first step: bpf_core_spec_parse() */
>> 
>>        /* data after first step */
>> 
>>        /* second step: bpf_core_find_cands() */
>> 
>>        /* candidate A and B after second step */
>> 
>>        ...
>> }
>> 
>> Well, it requires quite some work to document this way. Please let me
>> know if you feel this is an overkill.
> 
> Yeah :) And it's not just work, but I think it's bad if comments
> become too specific and document very low-level steps, because code
> might evolve and comments can quickly get out of sync and just add to
> confusion. Which is why I tried to document high-level ideas, leaving
> it up to the source code to be the ultimate reference of minutia
> details.

Fair enough. 

> 
>> 
>>> 
>>> bpf_core_reloc_offsets()
>>> 
>>> High-level coordination. Iterate over all per-program .BTF.ext offset
>>> reloc sections, each relocation within them. Find corresponding
>>> program and try to apply relocations one by one.
>>> 
>>> 
>>> I think the only non-obvious part here is to understand that
>>> relocation records local raw spec with every single anonymous type
>>> traversal, which is not that useful when we try to match it against
>>> target type, which can have very different composition, but still the
>>> same field access pattern, from C language standpoint (which hides all
>>> those anonymous type traversals from programmer).
>>> 
>>> But it should be pretty clear now, plus also check tests, they have
>>> lots of cases showing what's compatible and what's not.
>> 
>> I see. I will review the tests.
>> 
>>>>> 
>>>>> static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>>>> -                                                  __u32 id)
>>>>> +                                                  __u32 id,
>>>>> +                                                  __u32 *res_id)
>>>>> {
>>>>>     const struct btf_type *t = btf__type_by_id(btf, id);
>>>> 
>>>> Maybe have a local "__u32 rid;"
>>>> 
>>>>> 
>>>>> +     if (res_id)
>>>>> +             *res_id = id;
>>>>> +
>>>> 
>>>> and do "rid = id;" here
>>>> 
>>>>>     while (true) {
>>>>>             switch (BTF_INFO_KIND(t->info)) {
>>>>>             case BTF_KIND_VOLATILE:
>>>>>             case BTF_KIND_CONST:
>>>>>             case BTF_KIND_RESTRICT:
>>>>>             case BTF_KIND_TYPEDEF:
>>>>> +                     if (res_id)
>>>>> +                             *res_id = t->type;
>>>> and here
>>>> 
>>>>>                     t = btf__type_by_id(btf, t->type);
>>>>>                     break;
>>>>>             default:
>>>> and "*res_id = rid;" right before return?
>>> 
>>> Sure, but why?
>> 
>> I think it is cleaner that way. But feel free to ignore if you
>> think otherwise.
>> 
>>> 
>>>> 
>>>>> @@ -1041,7 +1049,7 @@ static const struct btf_type *skip_mods_and_typedefs(const struct btf *btf,
>>>>> static bool get_map_field_int(const char *map_name, const struct btf *btf,
>>>>>                           const struct btf_type *def,
>>>>>                           const struct btf_member *m, __u32 *res) {
>>> 
>>> [...]
>>> 
>>>>> +struct bpf_core_spec {
>>>>> +     const struct btf *btf;
>>>>> +     /* high-level spec: named fields and array indicies only */
>>>> 
>>>> typo: indices
>>> 
>>> thanks!
>>> 
>>>> 
>>>>> +     struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN];
>>>>> +     /* high-level spec length */
>>>>> +     int len;
>>>>> +     /* raw, low-level spec: 1-to-1 with accessor spec string */
>>>>> +     int raw_spec[BPF_CORE_SPEC_MAX_LEN];
>>>>> +     /* raw spec length */
>>>>> +     int raw_len;
>>>>> +     /* field byte offset represented by spec */
>>>>> +     __u32 offset;
>>>>> +};
>>> 
>>> [...]
>>> 
>>>>> + *
>>>>> + *   int x = &s->a[3]; // access string = '0:1:2:3'
>>>>> + *
>>>>> + * Low-level spec has 1:1 mapping with each element of access string (it's
>>>>> + * just a parsed access string representation): [0, 1, 2, 3].
>>>>> + *
>>>>> + * High-level spec will capture only 3 points:
>>>>> + *   - intial zero-index access by pointer (&s->... is the same as &s[0]...);
>>>>> + *   - field 'a' access (corresponds to '2' in low-level spec);
>>>>> + *   - array element #3 access (corresponds to '3' in low-level spec).
>>>>> + *
>>>>> + */
>>>> 
>>>> IIUC, high-level points are subset of low-level points. How about we introduce
>>>> "anonymous" high-level points, so that high-level points and low-level points
>>>> are 1:1 mapping?
>>> 
>>> No, that will just hurt and complicate things. See above explanation
>>> about why we need high-level points (it's what you as C programmer try
>>> to achieve vs low-level spec is what C-language does in reality, with
>>> all the anonymous struct/union traversal).
>>> 
>>> What's wrong with this separation? Think about it as recording
>>> "intent" (high-level spec) vs "mechanics" (low-level spec, how exactly
>>> to achieve that intent, in excruciating details).
>> 
>> There is nothing wrong with separation. I just personally think it is
>> cleaner the other way. That's why I raised the question.
>> 
>> I will go with your assessment, as you looked into this much more than
>> I did. :-)
> 
> For me it's a machine view of the problem (raw spec) vs human view of
> the problem (high-level spec, which resembles how you think about this
> in C code). I'll keep it separate unless it proves to be problematic
> going forward.

[...]

>>> 
>>> No. spec->offset is carefully maintained across multiple low-level
>>> steps, as we traverse down embedded structs/unions.
>>> 
>>> Think about, e.g.:
>>> 
>>> struct s {
>>>   int a;
>>>   struct {
>>>       int b;
>>>   };
>>> };
>>> 
>>> Imagine you are trying to match s.b access. With what you propose
>>> you'll end up with offset 0, but it should be 4.
>> 
>> Hmm... this is just for i == 0, right? Which line updated spec->offset
>> after "memset(spec, 0, sizeof(*spec));"?
> 
> Ah, I missed that you are referring to the special i == 0 case. I can
> do assignment, yes, you are right. I'll probably also extract it out
> of the loop to make it less confusing.

Yes, please. 

> 
>>>>> +                     continue;
>>>>> +             }
>>>> 
>>>> Maybe pull i == 0 case out of the for loop?

As I mentioned earlier. ;-)

>>>> 
>>>>> +
>>>>> +             if (btf_is_composite(t)) {
>>> 
>>> [...]
>>> 
>>>>> +
>>>>> +     if (spec->len == 0)
>>>>> +             return -EINVAL;
>>>> 
>>>> Can this ever happen?
>>> 
>>> Not really, because I already check raw_len == 0 and exit with error.
>>> I'll remove.
>>> 
>>>> 
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>> 

[...]