netdev - Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b75e280-7297-904a-a140-17ea04c8d968@gmail.com>
Date:   Tue, 6 Mar 2018 19:25:01 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     David Miller <davem@...emloft.net>
Cc:     ast@...nel.org, daniel@...earbox.net, netdev@...r.kernel.org,
        davejwatson@...com
Subject: Re: [bpf-next PATCH 05/16] bpf: create tcp_bpf_ulp allowing BPF to
 monitor socket TX/RX data

On 03/06/2018 10:18 AM, John Fastabend wrote:
> On 03/06/2018 07:47 AM, David Miller wrote:
>> From: John Fastabend <john.fastabend@...il.com>
>> Date: Mon, 5 Mar 2018 23:06:01 -0800
>>
>>> On 03/05/2018 10:42 PM, David Miller wrote:
>>>> From: John Fastabend <john.fastabend@...il.com>
>>>> Date: Mon, 5 Mar 2018 22:22:21 -0800
>>>>
>>>>> All I meant by this is if an application uses sendfile() call
>>>>> there is no good way to know when/if the kernel side will copy or
>>>>> xmit the  data. So a reliable user space application will need to
>>>>> only modify the data if it "knows" there are no outstanding sends
>>>>> in-flight. So if we assume applications follow this then it
>>>>> is OK to avoid the copy. Of course this is not good enough for
>>>>> security, but for monitoring/statistics (my use case 1 it works).
>>>>
>>>> For an application implementing a networking file system, it's pretty
>>>> legitimate for file contents to change before the page gets DMA's to
>>>> the networking card.
>>>>
>>>
>>> Still there are useful BPF programs that can tolerate this. So I
>>> would prefer to allow BPF programs to operate in the no-copy mode
>>> if wanted. It doesn't have to be the default though as it currently
>>> is. A l7 load balancer is a good example of this.
>>
>> Maybe I'd be ok if it were not the default.  But do you really want to
>> expose a potential attack vector, even if the app gets to choose and
>> say "I'm ok"?
>>
> 
> Yes, because I have use cases where I don't need to read the data, but
> have already "approved" the data. One example applications like
> nginx can serve static http data. Just reading over the code what they
> do, when sendfile is enabled, is a sendmsg call with the header. We want
> to enforce the policy on the header. Then we know the next N bytes are
> OK. Nginx will then send the payload over sendfile syscall. We already
> know the data is good from initial sendmsg call the next N bytes can
> get the verdict SK_PASS without even touching the data. If we do a
> copy in this case we see significant performance degradation.
> 
> The other use case is the L7 load balancer mentioned above. If we are
> using RR policies or some other heuristic if the user modifies the
> payload after the BPF verdict that is also fine. A malicious user
> could rewrite the header and try to game the load balancer but the
> BPF program can always just dev/null (SK_DROP) the application when
> it detects this. This also assumes the load balancer is using the
> header for its heuristic some interesting heuristics may not use
> the header at all.
> 
>>>> And that's perfectly fine, and we everything such that this will work
>>>> properly.
>>>>
>>>> The card checksums what ends up being DMA'd so nothing from the
>>>> networking side is broken.
>>>
>>> Assuming the card has checksum support correct? Which is why we have
>>> the SKBTX_SHARED_FRAG checked in skb_has_shared_frag() and the checksum
>>> helpers called by the drivers when they do not support the protocol
>>> being used. So probably OK assumption if using supported protocols and
>>> hardware? Perhaps in general folks just use normal protocols and
>>> hardware so it works.
>>
>> If the hardware doesn't support the checksums, we linearize the SKB
>> (therefore obtain a snapshot of the data), and checksum.  Exactly what
>> would happen if the hardware did the checksum.
>>
>> So OK in that case too.
>>
>> We always guarantee that you will always get a correct checksum on
>> outgoing packets, even if you modify the page contents meanwhile.
>>
> 
> Agreed the checksum is correct, but the user doesn't know if the linearize
> happened while it was modifying the data, potentially creating data with
> a partial update. Because the user modifying the data doesn't block the
> linearize operation in the kernel and vice versa the linearize operation
> can happen in parallel with the user side data modification. So maybe
> I'm still missing something but it seems the data can be in some unknown
> state on the wire.
> 
> Either way though I think its fine to make the default sendpage hook do
> the copy. A flag to avoid the copy can be added later to resolve my use
> cases above. I'll code this up in a v2 today/tomorrow.

Hi,

Thought about this a bit more and chatted with Daniel a bit. I think
a better solution is to set data_start = data_end = 0 by default in the
sendpage case. This will disallow any read/writes into the sendpage
data. Then if the user needs to read/write data we can use a helper
bpf_sk_msg_pull_data(start_byte, end_byte) which can pull the data into a
linear buffer as needed. This will ensure any user writes will not
change data after the BPF verdict (your concern). Also it will minimize
the amount of data that needs to be copied (my concern). In some of my
use cases where no data is needed we can simple not use the helper. Then
on the sendmsg side we can continue to set the (data_start, data_end)
pointers to the first scatterlist element. But, also use this helper to
set the data pointers past the first scatterlist element if needed. So
if someone wants to read past the first 4k bytes on a large send for
example this can be done with the above helper. BPF programs just
need to check (start,end) data pointers and can be oblivious to
if the program is being invoked by a call from sendpage or sendmsg.

I think this is a fairly elegant solution. Finally we can further
optimize later with a flag if needed to cover the case where we
want to read lots of bytes but _not_ do the copy. We can debate
the usefulness of this later with actual perf data.

All this has the added bonus that all I need is another patch on
top to add the helper. Pseudo code might look like this,

my_bpf_prog(struct sk_msg_md *msg) {
	void *data_end = msg->data_end;
	void *data_start = msg->data_start;

	need = PARSE_BYTES;

	// ensure user actually sent full header
	if (msg->size < PARSE_BYTES) {
		bpf_msg_cork(PARSE_BYTES);
		return SK_DROP;
	}

	/* ensure we can read full header, if this is a
	 * sendmsg system call AND PARSE_BYTES are all in
	 * the first scatterlist elem this is a no-op.
	 * If this is a sendpage call will put PARSE_BYTES
	 * in a psock buffer to avoid user modifications.
	 */
	if (data_end - data_start < PARSE_BYTES) {
		err = bpf_sk_msg_pull_data(0, PARSE_BYTES, flags);
		if (err)
			return SK_DROP;
	}

	// we have the full header parse it now
	verdict = my_bpf_header_parser(msg);
	return verdict;
}

Future optimization can work with prologue to pull in bytes
more efficiently. And for what its worth I found a couple bugs
in the error path of the sendpage hook I can fix in the v2 as well.

What do you think? 

@Daniel, sound more or less like what you were thinking?

> 
>>> So the "I need at least X more bytes" is the msg_cork_bytes() in patch
>>> 7. I could handle the sendpage case the same as I handle the sendmsg
>>> case and copy the data into the buffer until N bytes are received. I
>>> had planned to add this mode in a follow up series but could add it in
>>> this series so we have all the pieces in one submission.
>>>
>>> Although I used a scatterlist instead of a linear buffer. I was
>>> planning to add a helper to pull in next sg list item if needed
>>> rather than try to allocate a large linear block up front.
>>
>> For non-deep packet inspection cases this re-running of the parser case
>> will probably not trigger at all.
>>
> 
> Agreed, its mostly there to handle cases where the sendmsg call
> only sent part of a application (kafka, http, etc) header. This can
> happen if user is sending multiple messages in a single sendmsg/sendfile
> call. But, yeah I see it rarely in practice its mostly there for
> completeness and to handle these edge cases.
>