netdev - Re: BUG_ON triggered in skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4c0d599e-95a3-be7d-bfeb-e51a16648cc8@gmail.com>
Date:   Tue, 13 Mar 2018 15:47:47 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Yonghong Song <yhs@...com>,
        Steffen Klassert <steffen.klassert@...unet.com>,
        Eric Dumazet <eric.dumazet@...il.com>
Cc:     Daniel Borkmann <daniel@...earbox.net>,
        Alexei Starovoitov <ast@...com>,
        netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>,
        saeedm@...lanox.com, diptanu@...com
Subject: Re: BUG_ON triggered in skb_segment



On 03/13/2018 03:37 PM, Yonghong Song wrote:
> Adding additional cc's:
>    Saeed Mahameed as this is most likely mlx5 driver related.
>    Diptanu Gon Choudhury who initially reported the issue.
> 
> 
> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>
>>>
>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>
>>>>
>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>> ...
>>>>>> Setup:
>>>>>> =====
>>>>>>
>>>>>> The test will involve three machines:
>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>
>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>> forward packet
>>>>>> to proper destination. The control plane will configure M_nat 
>>>>>> properly
>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>
>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress) 
>>>>>> qdisc.
>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>> based on protocol change.
>>>>>> After the conversion, the program will make proper change on
>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet 
>>>>>> out
>>>>>> through bpf_redirect.
>>>>>>
>>>>>> Experiment:
>>>>>> ===========
>>>>>>
>>>>>> MTU: 1500B for all three machines.
>>>>>>
>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>
>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>> M_ipv4 (both ways).
>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>> failed with the above BUG_ON, really fast.
>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>
>>>>>> The error path likely to be (also from the above call stack):
>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>
>> Just out of curiosity, are these packets created with LRO or GRO?
>> Usually LRO is disabled if forwarding is enabled on a machine,
>> because segmented LGO packets are likely corrupt.
> 
> In our experiments, LRO is disabled.
> On mlx5, when GRO is on, the BUG_ON will happen, and
> when GRO is off, the BUG_ON will not happen.
> 
>>
>> These packets take an alternative redirect path, so not sure what
>> happens here.
>>
>>>>>>
>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>> skb->data_len. The values are below:
>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>> They should be equal to avoid BUG.
>>>>>>
>>>>>> In another experiment, I got:
>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>
>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>> this is just a coincidence or not.
>>>>>>
>>>>>> Workaround:
>>>>>> ===========
>>>>>>
>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>> kernel will not receive big packets and hence gso is not really 
>>>>>> called.
>>>>>>
>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>>>>> Any suggestion on how to debug this?
>>>>>>
>>>>>
>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>> geometry.
>>>>>
>>>>> In your case it seems you had to adjust gso_size (calling
>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>> specific MTU/MSS restrictions.
>>>>>
>>>>> You will have to make skb_segment() more generic if you really want 
>>>>> this.
>>>>
>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>> whether it satisfies my above use case or not? Thanks!
>>>
>>> As I said this  helper ends up modifying gso_size by +/- 20 (sizeof(ipv6
>>> header) - sizeof(ipv4 header))
>>>
>>> So it wont work if skb_segment() is called after this change.
>>
>> Even HW TSO use gso_size to segment the packets. Would'nt this
>> result in broken packets too, if gso_size is modified on a
>> forwarding path?
>>
>>>
>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>> mlx4/mlx5 NICs certainly support TSO.
> 
> This is a very good observation.
> We did the same experiment on mlx4, the same kernel, the same userspace 
> apps, the same bpf program. The only difference is mlx4 vs. mlx5.
> The mlx4 works fine with LRO=off and GRO=on, while
> mlx5 failed with the same LRO/GRO configuration.
> 
> On mlx4 box, we are able to see TSO packets are increasing as the
> large file scp is in progress.
> # ethtool -S eth0 | grep tso
>       tso_packets: 45495
> # ethtool -S eth0 | grep tso
>       tso_packets: 45865
> # ethtool -S eth0 | grep tso
>       tso_packets: 46337
> # ethtool -S eth0 | grep tso
>       tso_packets: 46724
> 
> And use bcc tool to track to func call count for skb_segment
> and find it is called 0 times. Clearly, mlx4 is able to take
> the packet as TSO and hence the packet will not go to
> the stack.
> 
> # funccount.py -i 3 'skb_segment'
> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
> 
> FUNC                                    COUNT
> 
> FUNC                                    COUNT
> ...
> 
> CC Saeed Mahameed who may help debug and provide some insights
> what is the problem.

There are many reasons why a GRO packet might need to be segmented by 
software (skb_segment())

This is the step where you have crashes, so really mlx4/mlx5 difference 
do not matter.

gso_size should not be dynamically changed. This needs to be fixed in 
eBPF helpers eventually.

Traditionally, NAT might involve changing MSS option in TCP SYN and 
SYNACK ...

(This to avoid dropping too big packets, when the total size of a 
segment is increased by 20 bytes when converting IPV4 header to IPV6 one)

In the opposite direction, packet size is reduced by 20 bytes, so MSS 
can be left as is.