[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1060942f-d479-7399-df13-d312f963a823@virtuozzo.com>
Date: Sat, 28 Aug 2021 11:01:00 +0300
From: Vasily Averin <vvs@...tuozzo.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
Christoph Paasch <christoph.paasch@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>,
Hideaki YOSHIFUJI <yoshfuji@...ux-ipv6.org>,
David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>,
netdev <netdev@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>, kernel@...nvz.org,
Julian Wiedmann <jwi@...ux.ibm.com>,
Alexey Kuznetsov <kuznet@....inr.ac.ru>
Subject: Re: [PATCH NET-NEXT] ipv6: skb_expand_head() adjust skb->truesize
incorrectly
On 8/27/21 7:47 PM, Eric Dumazet wrote:
>
>
> On 8/27/21 8:23 AM, Vasily Averin wrote:
>
>> I asked Alexey Kuznetsov to look at this problem. Below is his answer:
>> "I think the current scheme is obsolete. It was created
>> when we had only two kinds of skb accounting (rmem & wmem)
>> and with more kinds of accounting it just does not work.
>> Even there we had ignored problems with adjusting accounting.
>>
>> Logically the best solution would be replacing ->destructor,
>> set_owner* etc with skb_ops. Something like:
>>
>> struct skb_ops
>> {
>> void init(struct sk_buff * skb, struct skb_ops * ops, struct
>> sock * owner);
>> void fini(struct sk_buff * skb);
>> void update(struct sk_buff * skb, int adjust);
>> void inherit(struct sk_buff * skb2, struct sk_buff * skb);
>> };
>>
>> init - is replacement for skb_set_owner_r|w
>> fini - is replacement for skb_orphan
>> update - is new operation to be used in places where skb->truesize changes,
>> instead of awful constructions like:
>>
>> if (!skb->sk || skb->destructor == sock_edemux)
>> skb->truesize += size - osize;
>>
>> Now it will look like:
>>
>> if (skb->ops)
>> skb->ops->update(skb, size - osize);
>>
>> inherit - is replacement for also awful constructs like:
>>
>> if (skb->sk)
>> skb_set_owner_w(skb2, skb->sk);
>>
>> Now it will be:
>>
>> if (skb->ops)
>> skb->ops->inherit(skb2, skb);
>>
>> The implementation looks mostly obvious.
>> Some troubles can be only with new functionality:
>> update of accounting was never done before.
>>
>>
>> More efficient, functionally equivalent, but uglier and less flexible
>> alternative would be removal of ->destructor, replaced with
>> a small numeric indicator of ownership:
>>
>> enum
>> {
>> SKB_OWNER_NONE, /* aka destructor == NULL */
>> SKB_OWNER_WMEM, /* aka destructor == sk_wfree */
>> SKB_OWNER_RMEM, /* aka destructor == sk_rfree */
>> SKB_OWNER_SK, /* aka destructor == sk_edemux */
>> SKB_OWNER_TCP, /* aka destructor == tcp_wfree */
>> }
>>
>> And the same init,fini,inherit,update become functions
>> w/o any inidirect calls. Not sure it is really more efficient though."
>>
>
> Well, this does not look as stable material, and would add a bunch
> of indirect calls which are quite expensive these days (CONFIG_RETPOLINE=y)
>
> I suggest we work on a fix, using existing infra, then eventually later
> try to refactor if this is really bringing improvements.
>
> A fix could simply be a revert of 0c9f227bee119 ("ipv6: use skb_expand_head in ip6_xmit")
> since only IPv6 has the problem (because of arbitrary headers size)
I think it is not enough.
Root of the problem is that skb_expand_head() works incorrectly with non-shared skb.
In this case it do not call skb_clone before pskb_expand_head() execution,
and as result pskb_expand_head() and does not adjust skb->truesize.
I think non-shared skb is more frequent case,
so all skb_expand_head() are affected.
Therefore we need to revert all my patch set in net-next:
f1260ff skbuff: introduce skb_expand_head()
e415ed3 ipv6: use skb_expand_head in ip6_finish_output2
0c9f227 ipv6: use skb_expand_head in ip6_xmit
5678a59 ipv4: use skb_expand_head in ip_finish_output2
14ee70c vrf: use skb_expand_head in vrf_finish_output
53744a4 ax25: use skb_expand_head
a1e975e bpf: use skb_expand_head in bpf_out_neigh_v4/6
07e1d6b Merge branch 'skb_expand_head'
with fixup
06669e6 vrf: fix NULL dereference in vrf_finish_output()
And then rework ip6_finish_output2() in upstream,
to call skb_realloc_headroom() like it was done in first patch version:
https://lkml.org/lkml/2021/7/7/469.
Thank you,
Vasily Averin
Powered by blists - more mailing lists