lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 14 Mar 2018 00:18:39 +0100
From:   Daniel Borkmann <>
To:     Alexei Starovoitov <>,
        Eric Dumazet <>,
        Yonghong Song <>,
        Steffen Klassert <>
Cc:     netdev <>, Martin Lau <>,,
Subject: Re: BUG_ON triggered in skb_segment

On 03/14/2018 12:09 AM, Alexei Starovoitov wrote:
> On 3/13/18 3:47 PM, Eric Dumazet wrote:
>> On 03/13/2018 03:37 PM, Yonghong Song wrote:
>>> Adding additional cc's:
>>>    Saeed Mahameed as this is most likely mlx5 driver related.
>>>    Diptanu Gon Choudhury who initially reported the issue.
>>> On 3/13/18 1:44 AM, Steffen Klassert wrote:
>>>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote:
>>>>> On 03/12/2018 11:08 PM, Yonghong Song wrote:
>>>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>>>>>>> ...
>>>>>>>> Setup:
>>>>>>>> =====
>>>>>>>> The test will involve three machines:
>>>>>>>>     M_ipv6 <-> M_nat <-> M_ipv4
>>>>>>>> The M_nat will do ipv4<->ipv6 address translation and then
>>>>>>>> forward packet
>>>>>>>> to proper destination. The control plane will configure M_nat
>>>>>>>> properly
>>>>>>>> will understand virtual ipv4 address for machine M_ipv6, and
>>>>>>>> virtual ipv6 address for machine M_ipv4.
>>>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress)
>>>>>>>> qdisc.
>>>>>>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>>>>>>> based on protocol change.
>>>>>>>> After the conversion, the program will make proper change on
>>>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the
>>>>>>>> packet out
>>>>>>>> through bpf_redirect.
>>>>>>>> Experiment:
>>>>>>>> ===========
>>>>>>>> MTU: 1500B for all three machines.
>>>>>>>> The tso/lro/gro are enabled on the M_nat box.
>>>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>>>>>>> It works for transfering a small file (4KB) between M_ipv6 and
>>>>>>>> M_ipv4 (both ways).
>>>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4,
>>>>>>>> failed with the above BUG_ON, really fast.
>>>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>>>>>> The error path likely to be (also from the above call stack):
>>>>>>>>     nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>> Just out of curiosity, are these packets created with LRO or GRO?
>>>> Usually LRO is disabled if forwarding is enabled on a machine,
>>>> because segmented LGO packets are likely corrupt.
>>> In our experiments, LRO is disabled.
>>> On mlx5, when GRO is on, the BUG_ON will happen, and
>>> when GRO is off, the BUG_ON will not happen.
>>>> These packets take an alternative redirect path, so not sure what
>>>> happens here.
>>>>>>>> In one of experiments, I explicitly printed the skb->len and
>>>>>>>> skb->data_len. The values are below:
>>>>>>>>     skb_segment: len 2856, data_len 2686
>>>>>>>> They should be equal to avoid BUG.
>>>>>>>> In another experiment, I got:
>>>>>>>>     skb_segment: len 1428, data_len 1258
>>>>>>>> In both cases, the difference is 170 bytes. Not sure whether
>>>>>>>> this is just a coincidence or not.
>>>>>>>> Workaround:
>>>>>>>> ===========
>>>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>>>>>>> kernel will not receive big packets and hence gso is not really
>>>>>>>> called.
>>>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON
>>>>>>>> before?
>>>>>>>> Any suggestion on how to debug this?
>>>>>>> skb_segment() works if incoming GRO packet is not modified in its
>>>>>>> geometry.
>>>>>>> In your case it seems you had to adjust gso_size (calling
>>>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks
>>>>>>> skb_segment() badly, because geometry changes, unless you had
>>>>>>> specific MTU/MSS restrictions.
>>>>>>> You will have to make skb_segment() more generic if you really
>>>>>>> want this.
>>>>>> In net/core/filter.c function bpf_skb_change_proto, which is called
>>>>>> in the bpf program, does some GSO adjustment. Could you help check
>>>>>> whether it satisfies my above use case or not? Thanks!
>>>>> As I said this  helper ends up modifying gso_size by +/- 20
>>>>> (sizeof(ipv6
>>>>> header) - sizeof(ipv4 header))
>>>>> So it wont work if skb_segment() is called after this change.
>>>> Even HW TSO use gso_size to segment the packets. Would'nt this
>>>> result in broken packets too, if gso_size is modified on a
>>>> forwarding path?
>>>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since
>>>>> mlx4/mlx5 NICs certainly support TSO.
>>> This is a very good observation.
>>> We did the same experiment on mlx4, the same kernel, the same
>>> userspace apps, the same bpf program. The only difference is mlx4 vs.
>>> mlx5.
>>> The mlx4 works fine with LRO=off and GRO=on, while
>>> mlx5 failed with the same LRO/GRO configuration.
>>> On mlx4 box, we are able to see TSO packets are increasing as the
>>> large file scp is in progress.
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45495
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 45865
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46337
>>> # ethtool -S eth0 | grep tso
>>>       tso_packets: 46724
>>> And use bcc tool to track to func call count for skb_segment
>>> and find it is called 0 times. Clearly, mlx4 is able to take
>>> the packet as TSO and hence the packet will not go to
>>> the stack.
>>> # -i 3 'skb_segment'
>>> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end.
>>> FUNC                                    COUNT
>>> FUNC                                    COUNT
>>> ...
>>> CC Saeed Mahameed who may help debug and provide some insights
>>> what is the problem.
>> There are many reasons why a GRO packet might need to be segmented by
>> software (skb_segment())
>> This is the step where you have crashes, so really mlx4/mlx5 difference
>> do not matter.
>> gso_size should not be dynamically changed. This needs to be fixed in
>> eBPF helpers eventually.
> we have bpf_skb_proto_6_to_4() that was used by cilium for long time.
> It's not clear why it's not crashing there, but we cannot just
> reject changing proto in bpf programs now.

Right, used in Cilium since the helper was added. Back then I recall I tested
this back to back over ixgbe w/ the various options mixed on/off with BPF prog
implementing nat64 on each side, running over various netperf streams; and later
integrated the logic into Cilium itself. Perhaps something changed with more
recent kernels as well. I'll look closer into this issue tomorrow.

> We have to fix whatever needs to be fixed in skb_segment
> (if bug is there) or fix whatever necessary on mlx5 side.
> In bpf helper we mark it as SKB_GSO_DODGY just like packets coming
> through virtio would do, so if skb_segment() needs to do something
> special with skb the SKB_GSO_DODGY flag is already there.

Powered by blists - more mailing lists