netdev - Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f30d24e5-5578-4804-d425-37f26f819972@gmail.com>
Date:   Tue, 6 Mar 2018 10:18:12 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     David Miller <davem@...emloft.net>
Cc:     ast@...nel.org, daniel@...earbox.net, netdev@...r.kernel.org,
        davejwatson@...com
Subject: Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to
 monitor socket TX/RX data

On 03/06/2018 07:47 AM, David Miller wrote:
> From: John Fastabend <john.fastabend@...il.com>
> Date: Mon, 5 Mar 2018 23:06:01 -0800
> 
>> On 03/05/2018 10:42 PM, David Miller wrote:
>>> From: John Fastabend <john.fastabend@...il.com>
>>> Date: Mon, 5 Mar 2018 22:22:21 -0800
>>>
>>>> All I meant by this is if an application uses sendfile() call
>>>> there is no good way to know when/if the kernel side will copy or
>>>> xmit the  data. So a reliable user space application will need to
>>>> only modify the data if it "knows" there are no outstanding sends
>>>> in-flight. So if we assume applications follow this then it
>>>> is OK to avoid the copy. Of course this is not good enough for
>>>> security, but for monitoring/statistics (my use case 1 it works).
>>>
>>> For an application implementing a networking file system, it's pretty
>>> legitimate for file contents to change before the page gets DMA's to
>>> the networking card.
>>>
>>
>> Still there are useful BPF programs that can tolerate this. So I
>> would prefer to allow BPF programs to operate in the no-copy mode
>> if wanted. It doesn't have to be the default though as it currently
>> is. A l7 load balancer is a good example of this.
> 
> Maybe I'd be ok if it were not the default.  But do you really want to
> expose a potential attack vector, even if the app gets to choose and
> say "I'm ok"?
> 

Yes, because I have use cases where I don't need to read the data, but
have already "approved" the data. One example applications like
nginx can serve static http data. Just reading over the code what they
do, when sendfile is enabled, is a sendmsg call with the header. We want
to enforce the policy on the header. Then we know the next N bytes are
OK. Nginx will then send the payload over sendfile syscall. We already
know the data is good from initial sendmsg call the next N bytes can
get the verdict SK_PASS without even touching the data. If we do a
copy in this case we see significant performance degradation.

The other use case is the L7 load balancer mentioned above. If we are
using RR policies or some other heuristic if the user modifies the
payload after the BPF verdict that is also fine. A malicious user
could rewrite the header and try to game the load balancer but the
BPF program can always just dev/null (SK_DROP) the application when
it detects this. This also assumes the load balancer is using the
header for its heuristic some interesting heuristics may not use
the header at all.

>>> And that's perfectly fine, and we everything such that this will work
>>> properly.
>>>
>>> The card checksums what ends up being DMA'd so nothing from the
>>> networking side is broken.
>>
>> Assuming the card has checksum support correct? Which is why we have
>> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
>> helpers called by the drivers when they do not support the protocol
>> being used. So probably OK assumption if using supported protocols and
>> hardware? Perhaps in general folks just use normal protocols and
>> hardware so it works.
> 
> If the hardware doesn't support the checksums, we linearize the SKB
> (therefore obtain a snapshot of the data), and checksum.  Exactly what
> would happen if the hardware did the checksum.
> 
> So OK in that case too.
> 
> We always guarantee that you will always get a correct checksum on
> outgoing packets, even if you modify the page contents meanwhile.
> 

Agreed the checksum is correct, but the user doesn't know if the linearize
happened while it was modifying the data, potentially creating data with
a partial update. Because the user modifying the data doesn't block the
linearize operation in the kernel and vice versa the linearize operation
can happen in parallel with the user side data modification. So maybe
I'm still missing something but it seems the data can be in some unknown
state on the wire.

Either way though I think its fine to make the default sendpage hook do
the copy. A flag to avoid the copy can be added later to resolve my use
cases above. I'll code this up in a v2 today/tomorrow.

>> So the "I need at least X more bytes" is the msg_cork_bytes() in patch
>> 7. I could handle the sendpage case the same as I handle the sendmsg
>> case and copy the data into the buffer until N bytes are received. I
>> had planned to add this mode in a follow up series but could add it in
>> this series so we have all the pieces in one submission.
>>
>> Although I used a scatterlist instead of a linear buffer. I was
>> planning to add a helper to pull in next sg list item if needed
>> rather than try to allocate a large linear block up front.
> 
> For non-deep packet inspection cases this re-running of the parser case
> will probably not trigger at all.
> 

Agreed, its mostly there to handle cases where the sendmsg call
only sent part of a application (kafka, http, etc) header. This can
happen if user is sending multiple messages in a single sendmsg/sendfile
call. But, yeah I see it rarely in practice its mostly there for
completeness and to handle these edge cases.