[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cd4bcd39-5f11-757e-8e43-5e494a0cb551@fb.com>
Date: Tue, 20 Mar 2018 15:53:39 -0700
From: Yonghong Song <yhs@...com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: Eric Dumazet <edumazet@...gle.com>, <ast@...com>,
Daniel Borkmann <daniel@...earbox.net>, <diptanu@...com>,
Netdev <netdev@...r.kernel.org>, Kernel Team <kernel-team@...com>
Subject: Re: [PATCH net-next v2 1/2] net: permit skb_segment on head_frag
frag_list skb
On 3/20/18 11:08 AM, Alexander Duyck wrote:
> On Tue, Mar 20, 2018 at 8:55 AM, Yonghong Song <yhs@...com> wrote:
>> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
>> function skb_segment(), line 3667. The bpf program attaches to
>> clsact ingress, calls bpf_skb_change_proto to change protocol
>> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
>> to send the changed packet out.
>>
>> 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> 3473 netdev_features_t features)
>> 3474 {
>> 3475 struct sk_buff *segs = NULL;
>> 3476 struct sk_buff *tail = NULL;
>> ...
>> 3665 while (pos < offset + len) {
>> 3666 if (i >= nfrags) {
>> 3667 BUG_ON(skb_headlen(list_skb));
>> 3668
>> 3669 i = 0;
>> 3670 nfrags = skb_shinfo(list_skb)->nr_frags;
>> 3671 frag = skb_shinfo(list_skb)->frags;
>> 3672 frag_skb = list_skb;
>> ...
>>
>> call stack:
>> ...
>> #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
>> #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
>> #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
>> #4 [ffff883ffef03668] die at ffffffff8101deb2
>> #5 [ffff883ffef03698] do_trap at ffffffff8101a700
>> #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
>> #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
>> #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
>> [exception RIP: skb_segment+3044]
>> RIP: ffffffff817e4dd4 RSP: ffff883ffef03860 RFLAGS: 00010216
>> RAX: 0000000000002bf6 RBX: ffff883feb7aaa00 RCX: 0000000000000011
>> RDX: ffff883fb87910c0 RSI: 0000000000000011 RDI: ffff883feb7ab500
>> RBP: ffff883ffef03928 R8: 0000000000002ce2 R9: 00000000000027da
>> R10: 000001ea00000000 R11: 0000000000002d82 R12: ffff883f90a1ee80
>> R13: ffff883fb8791120 R14: ffff883feb7abc00 R15: 0000000000002ce2
>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>> #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
>> --- <IRQ stack> ---
>> ...
>>
>> The triggering input skb has the following properties:
>> list_skb = skb->frag_list;
>> skb->nfrags != NULL && skb_headlen(list_skb) != 0
>> and skb_segment() is not able to handle a frag_list skb
>> if its headlen (list_skb->len - list_skb->data_len) is not 0.
>>
>> This patch addressed the issue by handling skb_headlen(list_skb) != 0
>> case properly if list_skb->head_frag is true, which is expected in
>> most cases. A one-element frag array is created for the list_skb head
>> and processed before list_skb->frags are processed.
>>
>> Reported-by: Diptanu Gon Choudhury <diptanu@...com>
>> Signed-off-by: Yonghong Song <yhs@...com>
>> ---
>> net/core/skbuff.c | 42 +++++++++++++++++++++++++++++-------------
>> 1 file changed, 29 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 715c134..0ad4cda 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -3475,9 +3475,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>> struct sk_buff *segs = NULL;
>> struct sk_buff *tail = NULL;
>> struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
>> - skb_frag_t *frag = skb_shinfo(head_skb)->frags;
>> + skb_frag_t *frag = skb_shinfo(head_skb)->frags, head_frag;
>
> I would move head_frag down into the while loop. No point in making it
> global to this function and eating up the extra stack space if you
> don't need it.
We need head_frag declared outside the main segmentation loop.
Its value may survive across different main loop iterations.
However, I see your point to save the stack space.
Will use a pointer here and do allocation on demand instead.
>
>> unsigned int mss = skb_shinfo(head_skb)->gso_size;
>> unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
>> + struct sk_buff *check_list_skb = list_skb;
>
> This seems like a waste of a pointer. You can probably just repurpose
> nfrags to achieve the same purpose you are achieving below. Note that
> nfrags is a signed int and only needing to store up to only 18 or so
> total frags so we can probably just reserve a nfrags value of -1 to
> indicate that we are using the header frag.
Just did some prototyping. Indeed, we could reserve the i = -1 to
indicate the special head_frag.
>
>> struct sk_buff *frag_skb = head_skb;
>> unsigned int offset = doffset;
>> unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
>> @@ -3590,6 +3591,7 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>> nskb = skb_clone(list_skb, GFP_ATOMIC);
>> list_skb = list_skb->next;
>> + check_list_skb = list_skb;
>>
>> if (unlikely(!nskb))
>> goto err;
>
> If my understanding is correct then this is unneeded if you just
> change how you use i and nfrags.
Right.
>
>> @@ -3664,21 +3666,35 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
>>
>> while (pos < offset + len) {
>
> You could probably just declare here since this would be more local to
> where it is actually used and the fact is it can be overwritten with
> each iteration of the loop anyway so there is no need to reserve the
> space before you get here.
The actual space can be allocated once and reused later. But the space
needs to be preserved across main segmentation loop.
>
>> if (i >= nfrags) {
>> - BUG_ON(skb_headlen(list_skb));
>> -
>> - i = 0;
>> - nfrags = skb_shinfo(list_skb)->nr_frags;
>> - frag = skb_shinfo(list_skb)->frags;
>> - frag_skb = list_skb;
>> + if (skb_headlen(list_skb) && check_list_skb == list_skb) {
>> + struct page *page;
>> +
>> + BUG_ON(!list_skb->head_frag);
>> +
>> + i = 0;
>> + nfrags = 1;
>> + page = virt_to_head_page(list_skb->head);
>> + head_frag.page.p = page;
>> + head_frag.page_offset = list_skb->data -
>> + (unsigned char *)page_address(page);
>> + head_frag.size = skb_headlen(list_skb);
>> + frag = &head_frag;
>> + check_list_skb = list_skb->next;
>
> The whole need for check_list_skb can be worked around if we take
> advantage of the fact that i and nfrags are both integers so we could
> use -1 for both to indicate we are processing the head frag. The only
> bit that gets messy is the fact that we have to add special handling
> for the case where skb_shinfo(list_skb)->nr_frags is 0. If that is the
> case we would have to set nfrags to 0 and bump the list_skb =
> list_skb->next so that we avoid trying to pull frags from an otherwise
> empty buffer.
Right. if nr_frags = 0, we do need to avoid pulling frags.
Thanks for the review, will send out v3 shortly.
>
>> + } else {
>> + i = 0;
>> + nfrags = skb_shinfo(list_skb)->nr_frags;
>> + frag = skb_shinfo(list_skb)->frags;
>> + frag_skb = list_skb;
>>
>> - BUG_ON(!nfrags);
>> + BUG_ON(!nfrags);
>>
>> - if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> - skb_zerocopy_clone(nskb, frag_skb,
>> - GFP_ATOMIC))
>> - goto err;
>> + if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
>> + skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
>> + goto err;
>>
>> - list_skb = list_skb->next;
>> + list_skb = list_skb->next;
>> + check_list_skb = list_skb;
>> + }
>> }
>>
>> if (unlikely(skb_shinfo(nskb)->nr_frags >=
>> --
>> 2.9.5
>>
Powered by blists - more mailing lists