netdev - Re: BUG_ON triggered in skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 14 Mar 2018 00:18:39 +0100
From:   Daniel Borkmann <daniel@...earbox.net>
To:     Alexei Starovoitov <ast@...com>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Yonghong Song <yhs@...com>,
        Steffen Klassert <steffen.klassert@...unet.com>
Cc:     netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>,
        saeedm@...lanox.com, diptanu@...com
Subject: Re: BUG_ON triggered in skb_segment

On 03/14/2018 12:09 AM, Alexei Starovoitov wrote:
> On 3/13/18 3:47 PM, Eric Dumazet wrote:
>>
>>
>> On 03/13/2018 03:37 PM, Yonghong Song wrote:
>>> Adding additional cc's:
>>>    Saeed Mahameed as this is most likely mlx5 driver related.
>>>    Diptanu Gon Choudhury who initially reported the issue.
>>>
>>>
>>> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>>>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>>>
>>>>>>
>>>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>>>> ...
>>>>>>>> Setup:
>>>>>>>> =====
>>>>>>>>
>>>>>>>> The test will involve three machines:
>>>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>>>
>>>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>>>> forward packet
>>>>>>>> to proper destination. The control plane will configure M_nat
>>>>>>>> properly
>>>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>>>
>>>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress)
>>>>>>>> qdisc.
>>>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>>>> based on protocol change.
>>>>>>>> After the conversion, the program will make proper change on
>>>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the
>>>>>>>> packet out
>>>>>>>> through bpf_redirect.
>>>>>>>>
>>>>>>>> Experiment:
>>>>>>>> ===========
>>>>>>>>
>>>>>>>> MTU: 1500B for all three machines.
>>>>>>>>
>>>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>>>
>>>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>>>> M_ipv4 (both ways).
>>>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>>>> failed with the above BUG_ON, really fast.
>>>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>>>
>>>>>>>> The error path likely to be (also from the above call stack):
>>>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>>
>>>> Just out of curiosity, are these packets created with LRO or GRO?
>>>> Usually LRO is disabled if forwarding is enabled on a machine,
>>>> because segmented LGO packets are likely corrupt.
>>>
>>> In our experiments, LRO is disabled.
>>> On mlx5, when GRO is on, the BUG_ON will happen, and
>>> when GRO is off, the BUG_ON will not happen.
>>>
>>>>
>>>> These packets take an alternative redirect path, so not sure what
>>>> happens here.
>>>>
>>>>>>>>
>>>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>>>> skb->data_len. The values are below:
>>>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>>>> They should be equal to avoid BUG.
>>>>>>>>
>>>>>>>> In another experiment, I got:
>>>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>>>
>>>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>>>> this is just a coincidence or not.
>>>>>>>>
>>>>>>>> Workaround:
>>>>>>>> ===========
>>>>>>>>
>>>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>>>> kernel will not receive big packets and hence gso is not really
>>>>>>>> called.
>>>>>>>>
>>>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON
>>>>>>>> before?
>>>>>>>> Any suggestion on how to debug this?
>>>>>>>>
>>>>>>>
>>>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>>>> geometry.
>>>>>>>
>>>>>>> In your case it seems you had to adjust gso_size (calling
>>>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>>>> specific MTU/MSS restrictions.
>>>>>>>
>>>>>>> You will have to make skb_segment() more generic if you really
>>>>>>> want this.
>>>>>>
>>>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>>>> whether it satisfies my above use case or not? Thanks!
>>>>>
>>>>> As I said this  helper ends up modifying gso_size by +/- 20
>>>>> (sizeof(ipv6
>>>>> header) - sizeof(ipv4 header))
>>>>>
>>>>> So it wont work if skb_segment() is called after this change.
>>>>
>>>> Even HW TSO use gso_size to segment the packets. Would'nt this
>>>> result in broken packets too, if gso_size is modified on a
>>>> forwarding path?
>>>>
>>>>>
>>>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>>>> mlx4/mlx5 NICs certainly support TSO.
>>>
>>> This is a very good observation.
>>> We did the same experiment on mlx4, the same kernel, the same
>>> userspace apps, the same bpf program. The only difference is mlx4 vs.
>>> mlx5.
>>> The mlx4 works fine with LRO=off and GRO=on, while
>>> mlx5 failed with the same LRO/GRO configuration.
>>>
>>> On mlx4 box, we are able to see TSO packets are increasing as the
>>> large file scp is in progress.
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45495
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45865
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46337
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46724
>>>
>>> And use bcc tool to track to func call count for skb_segment
>>> and find it is called 0 times. Clearly, mlx4 is able to take
>>> the packet as TSO and hence the packet will not go to
>>> the stack.
>>>
>>> # funccount.py -i 3 'skb_segment'
>>> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
>>>
>>> FUNC                                    COUNT
>>>
>>> FUNC                                    COUNT
>>> ...
>>>
>>> CC Saeed Mahameed who may help debug and provide some insights
>>> what is the problem.
>>
>> There are many reasons why a GRO packet might need to be segmented by
>> software (skb_segment())
>>
>> This is the step where you have crashes, so really mlx4/mlx5 difference
>> do not matter.
>>
>> gso_size should not be dynamically changed. This needs to be fixed in
>> eBPF helpers eventually.
> 
> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
> It's not clear why it's not crashing there, but we cannot just
> reject changing proto in bpf programs now.

Right, used in Cilium since the helper was added. Back then I recall I tested
this back to back over ixgbe w/ the various options mixed on/off with BPF prog
implementing nat64 on each side, running over various netperf streams; and later
integrated the logic into Cilium itself. Perhaps something changed with more
recent kernels as well. I'll look closer into this issue tomorrow.

> We have to fix whatever needs to be fixed in skb_segment
> (if bug is there) or fix whatever necessary on mlx5 side.
> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
> through virtio would do, so if skb_segment() needs to do something
> special with skb the SKB_GSO_DODGY flag is already there.