linux-kernel - Re: [PATCH] bpf: Unregister fentry when bpf_trampoline_unlink

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e36407a-22fb-0fc1-b147-ebdc12ea0823@meta.com>
Date:   Thu, 27 Apr 2023 11:54:16 -0700
From:   Yonghong Song <yhs@...a.com>
To:     Chen Zhongjin <chenzhongjin@...wei.com>, bpf@...r.kernel.org,
        linux-kernel@...r.kernel.org
Cc:     ast@...nel.org, daniel@...earbox.net, john.fastabend@...il.com,
        andrii@...nel.org, martin.lau@...ux.dev, song@...nel.org,
        yhs@...com, kpsingh@...nel.org, sdf@...gle.com, haoluo@...gle.com,
        jolsa@...nel.org
Subject: Re: [PATCH] bpf: Unregister fentry when bpf_trampoline_unlink_prog
 fails to update image



On 4/26/23 6:40 PM, Chen Zhongjin wrote:
> On 2023/4/27 2:17, Yonghong Song wrote:
>>
>>
>> On 4/26/23 2:55 AM, Chen Zhongjin wrote:
>>> In bpf_link_free, bpf trampoline will update the image and remove the
>>> unlinked prog.
>>>
>>> bpf_trampoline_unlink_prog is committed as 'never fail', however it 
>>> depends
>>> on the result of image update. It is possible to fail if memory 
>>> allocation
>>> fail in bpf_trampoline_update.
>>
>> Could you give more details which memory allocation fail here?
>> bpf_tramp_image_alloc()? Are you using some error injection or
>> this happens in your production workload?
>>
> I guess it's an error injection because syzkaller reported this.

Okay, this is what I suspected as well.

> 
>>>
>>> The error result of bpf_trampoline_update can't be passed to 
>>> bpf_link_free
>>> because link release callback returns void. Then it will free the prog
>>> whether image updating is successful or not.
>>> If the old image tries to call a freed prog, it makes kernel panic.
>>>
>>>      BUG: unable to handle page fault for address: ffffffffc04a8d20
>>>      #PF: supervisor instruction fetch in kernel mode
>>>      #PF: error_code(0x0010) - not-present page
>>>      RIP: 0010:0xffffffffc04a8d20
>>>      Code: Unable to access opcode bytes at RIP 0xffffffffc04a8cf6.
>>>      ...
>>>      Call Trace:
>>>      ? bpf_trampoline_78223_0
>>>      bpf_traced_function
>>>      ...
>>
>> Could you explain how 'the old image tries to call a freed prog'?
>> IIUC, the previous bpf_link_free() should not be available to
>> call the bpf prog, right?
>>
> What I mean here is, if failed to update the image, the image keeps 
> unchanged but the unlinked prog will be freed later.
> Next time when it enter the trampoline the image will call freed prog.

Okay. This description is better.

> 
>>>
>>> Fix this when bpf_trampoline_update failed in 
>>> bpf_trampoline_unlink_prog,
>>> unregister fentry to disable the trampoline. Then other progs on the
>>> trampoline can be unlinked safely and finally the trampoline will be
>>> released.
>>
>> Do we still leak tr->cur_image here?
>>
> No, bpf_tramp_image_put() will free everything when all progs_cnt 
> decline to zero in bpf_trampoline_update(). It is a release function, 
> but called 'put'.

Okay, I see. But it depends on where memory allocation failure
happens. For example, see below:

     static int bpf_trampoline_update(struct bpf_trampoline *tr,
     bool lock_direct_mutex)
     {
         struct bpf_tramp_image *im;
         struct bpf_tramp_links *tlinks;
         u32 orig_flags = tr->flags;
         bool ip_arg = false;
         int err, total;

         tlinks = bpf_trampoline_get_progs(tr, &total, &ip_arg);
         if (IS_ERR(tlinks))
                 return PTR_ERR(tlinks);

         if (total == 0) {
                 err = unregister_fentry(tr, tr->cur_image->image);
                 bpf_tramp_image_put(tr->cur_image);
                 tr->cur_image = NULL;
                 tr->selector = 0;
                 goto out;
         }

         im = bpf_tramp_image_alloc(tr->key, tr->selector);
         if (IS_ERR(im)) {
                 err = PTR_ERR(im);
                 goto out;
         }

	...

In bpf_trampoline_get_progs(),

     static struct bpf_tramp_links *
     bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int 
*total, bool *ip_arg)
     {
         struct bpf_tramp_link *link;
         struct bpf_tramp_links *tlinks;
         struct bpf_tramp_link **links;
         int kind;

         *total = 0;
         tlinks = kcalloc(BPF_TRAMP_MAX, sizeof(*tlinks), GFP_KERNEL);

If we have memory allocation failure, PTR_ERR(tlinks) will be returned
and there is no chance to reach 'total == 0' so tr->cur_image will not
be freed.

But if the memory allocation failure happens in 
bpf_tramp_image_alloc(...), yes, it is possible eventually when
all progs are unlinked, 'total' could be 0, unregister_fentry()
will be called again (hopefully no side effect) and the image
will be freed.

> 
>>>
>>> Fixes: 88fd9e5352fe ("bpf: Refactor trampoline update code")
>>
>> If the above is a refactoring patch, you should not use that
>> as 'Fixes' patch, you should pick one truely introduced the issue.
>>
>>> Signed-off-by: Chen Zhongjin <chenzhongjin@...wei.com>
>>> ---
>>>   kernel/bpf/trampoline.c | 5 ++++-
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
>>> index d0ed7d6f5eec..6daa93b30e81 100644
>>> --- a/kernel/bpf/trampoline.c
>>> +++ b/kernel/bpf/trampoline.c
>>> @@ -604,7 +604,10 @@ static int __bpf_trampoline_unlink_prog(struct 
>>> bpf_tramp_link *link, struct bpf_
>>>       }
>>>       hlist_del_init(&link->tramp_hlist);
>>>       tr->progs_cnt[kind]--;
>>> -    return bpf_trampoline_update(tr, true /* lock_direct_mutex */);
>>> +    err =  bpf_trampoline_update(tr, true /* lock_direct_mutex */);
>>> +    if (err && tr->cur_image)
>>> +        unregister_fentry(tr, tr->cur_image->image);
>>
>> If error happens for the all subsequent programs,
>> unregister_fentry() will be called multiple times. Any side effect?
> It will fail with no side effect. Actually if there is no error, 
> modify_fentry() will fail in update() as well. The fentry is available 
> until all progs are unlinked and the broken image is freed by 
> bpf_tramp_image_put().

This will cause a sudden behavior change if there are other active
programs. For example, prog1 and prog2, prog1 unlink triggered
memory allocation failure and caused unregister_fentry which restored
the original ip. Then prog2 even does not have a chance to run any
more even if memory suddenly not an issue any more.

> 
> However with an extra state to record this happens, it's possible to 
> re-register the fentry with new image when the next link/unlink calls 
> update(). It will generate a new image and replace/free the error one.
>>
>> Overall, I think this is an extreme corner case which happens
>> when kernel memory is extreme tight. If this is the case, not
>> sure whether it is worthwhile to fix it or not.
>>
> Yes, it's a really rare case. I'm just not sure whether it needs some 
> best-effort to avoid kernel panic at this point.
> 
> If you think it's not necessary. Just let it go.

At least with current patch, even we could avoid kernel panic, the
user expected behavior might change. To handle all error conditions
in bpf_trampoline_update() might be a big task. So I suggest to just
let it go.

> 
> Thanks for your time!
> 
> Best,
> Chen
>>> +    return err;
>>>   }
>>>   /* bpf_trampoline_unlink_prog() should never fail. */