lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 12 Mar 2018 23:25:09 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Yonghong Song <yhs@...com>, Eric Dumazet <eric.dumazet@...il.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Alexei Starovoitov <ast@...com>,
        netdev <netdev@...r.kernel.org>, Martin Lau <kafai@...com>
Subject: Re: BUG_ON triggered in skb_segment



On 03/12/2018 11:08 PM, Yonghong Song wrote:
> 
> 
> On 3/12/18 11:04 PM, Eric Dumazet wrote:
>>
>>
>> On 03/12/2018 10:45 PM, Yonghong Song wrote:
>>> Hi,
>>>
>>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>>> net-next function skb_segment, line 3667.
>>>
>>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>> 3473                             netdev_features_t features)
>>> 3474 {
>>> 3475         struct sk_buff *segs = NULL;
>>> 3476         struct sk_buff *tail = NULL;
>>> ...
>>> 3665                 while (pos < offset + len) {
>>> 3666                         if (i >= nfrags) {
>>> 3667                                 BUG_ON(skb_headlen(list_skb));
>>> 3668
>>> 3669                                 i = 0;
>>> 3670                                 nfrags = 
>>> skb_shinfo(list_skb)->nr_frags;
>>> 3671                                 frag = skb_shinfo(list_skb)->frags;
>>> 3672                                 frag_skb = list_skb;
>>> ...
>>>
>>> call stack:
>>> ...
>>> #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41
>>>   #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>>>   #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>>>   #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>>>   #4 [ffff883ffef03668] die at ffffffff8101deb2
>>>   #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>>>   #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>>>   #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>>>   #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>>>      [exception RIP: skb_segment+3044]
>>>      RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
>>>      RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
>>>      RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
>>>      RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
>>>      R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
>>>      R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
>>>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>   #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>>> #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8
>>> #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b
>>> #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8
>>> #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9
>>> #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed
>>> #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2
>>> #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb
>>> #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa
>>> #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed
>>> #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90
>>> #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66
>>> #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209
>>> #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d
>>> #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8
>>> #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538
>>> #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0
>>> #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6
>>> #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168
>>> #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2
>>> #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2
>>> #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e
>>> #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e
>>> #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3
>>> #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515
>>> #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11
>>> --- <IRQ stack> ---
>>> bt: cannot transition from IRQ stack to current process stack:
>>>          IRQ stack pointer: ffff883ffef034f8
>>>      process stack pointer: ffffffff81a01ae9
>>>         current stack base: ffffc9000c5c4000
>>> ...
>>> Setup:
>>> =====
>>>
>>> The test will involve three machines:
>>>    M_ipv6 <-> M_nat <-> M_ipv4
>>>
>>> The M_nat will do ipv4<->ipv6 address translation and then forward 
>>> packet
>>> to proper destination. The control plane will configure M_nat properly
>>> will understand virtual ipv4 address for machine M_ipv6, and
>>> virtual ipv6 address for machine M_ipv4.
>>>
>>> M_nat runs a bpf program, which is attached to clsact (ingress) qdisc.
>>> The program uses bpf_skb_change_proto to do protocol conversion.
>>> bpf_skb_change_proto will adjust skb header_len and len properly
>>> based on protocol change.
>>> After the conversion, the program will make proper change on
>>> ethhdr and ip4/6 header, recalculate checksum, and send the packet out
>>> through bpf_redirect.
>>>
>>> Experiment:
>>> ===========
>>>
>>> MTU: 1500B for all three machines.
>>>
>>> The tso/lro/gro are enabled on the M_nat box.
>>>
>>> ping works on both ways of M_ipv6 <-> M_ipv4.
>>> It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 
>>> (both ways).
>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed 
>>> with the above BUG_ON, really fast.
>>> Did not really test from M_ipv4 to M_ipv6 with large file.
>>>
>>> The error path likely to be (also from the above call stack):
>>>    nic -> lro/gro -> bpf_program -> gso (BUG_ON)
>>>
>>> In one of experiments, I explicitly printed the skb->len and 
>>> skb->data_len. The values are below:
>>>    skb_segment: len 2856, data_len 2686
>>> They should be equal to avoid BUG.
>>>
>>> In another experiment, I got:
>>>    skb_segment: len 1428, data_len 1258
>>>
>>> In both cases, the difference is 170 bytes. Not sure whether
>>> this is just a coincidence or not.
>>>
>>> Workaround:
>>> ===========
>>>
>>> A workaround to avoid BUG_ON is to disable lro/gro. This way,
>>> kernel will not receive big packets and hence gso is not really called.
>>>
>>> I am not familiar with gso code. Does anybody hit this BUG_ON before?
>>> Any suggestion on how to debug this?
>>>
>>
>> skb_segment() works if incoming GRO packet is not modified in its 
>> geometry.
>>
>> In your case it seems you had to adjust gso_size (calling 
>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks 
>> skb_segment() badly, because geometry changes, unless you had specific 
>> MTU/MSS restrictions.
>>
>> You will have to make skb_segment() more generic if you really want this.
> 
> In net/core/filter.c function bpf_skb_change_proto, which is called
> in the bpf program, does some GSO adjustment. Could you help check
> whether it satisfies my above use case or not? Thanks!

As I said this  helper ends up modifying gso_size by +/- 20 
(sizeof(ipv6 header) - sizeof(ipv4 header))

So it wont work if skb_segment() is called after this change.

Not clear why the GRO packet is not sent as is (as a TSO packet) since 
mlx4/mlx5 NICs certainly support TSO.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ