netdev - Re: [PATCH net] veth: Avoid drop packets when xdp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <567d3635f6e7969c4e1a0e4bc759556c472d1dff.camel@redhat.com>
Date:   Thu, 29 Sep 2022 08:57:36 +0200
From:   Paolo Abeni <pabeni@...hat.com>
To:     Heng Qi <hengqi@...ux.alibaba.com>,
        Toke Høiland-Jørgensen <toke@...hat.com>,
        netdev@...r.kernel.org
Cc:     "David S. Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH net] veth: Avoid drop packets when xdp_redirect performs

Hello,

On Thu, 2022-09-29 at 10:50 +0800, Heng Qi wrote:
> 在 2022/9/28 下午10:58, Toke Høiland-Jørgensen 写道:
> > Heng Qi <hengqi@...ux.alibaba.com> writes:
> > 
> > > 在 2022/9/27 下午8:20, Toke Høiland-Jørgensen 写道:
> > > > Heng Qi <hengqi@...ux.alibaba.com> writes:
> > > > 
> > > > > In the current processing logic, when xdp_redirect occurs, it transmits
> > > > > the xdp frame based on napi.
> > > > > 
> > > > > If napi of the peer veth is not ready, the veth will drop the packets.
> > > > > This doesn't meet our expectations.
> > > > Erm, why don't you just enable NAPI? Loading an XDP program is not
> > > > needed these days, you can just enable GRO on both peers...
> > > In general, we don't expect veth to drop packets when it doesn't mount
> > > the xdp program or otherwise, because this is not as expected.
> > Well, did you consider that maybe your expectation is wrong? ;)
> 
> For users who don't know what other conditions are required for the readiness of napi,
> all they can observe is why the packets cannot be sent to the peer veth, which is also
> the problem we encountered in the actual case scenarios.
> 
> 
> > 
> > > > > In this context, if napi is not ready, we convert the xdp frame to a skb,
> > > > > and then use veth_xmit() to deliver it to the peer veth.
> > > > > 
> > > > > Like the following case:
> > > > > Even if veth1's napi cannot be used, the packet redirected from the NIC
> > > > > will be transmitted to veth1 successfully:
> > > > > 
> > > > > NIC   ->   veth0----veth1
> > > > >    |                   |
> > > > > (XDP)             (no XDP)
> > > > > 
> > > > > Signed-off-by: Heng Qi <hengqi@...ux.alibaba.com>
> > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
> > > > > ---
> > > > >    drivers/net/veth.c | 36 +++++++++++++++++++++++++++++++++++-
> > > > >    1 file changed, 35 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > > > > index 466da01..e1f5561 100644
> > > > > --- a/drivers/net/veth.c
> > > > > +++ b/drivers/net/veth.c
> > > > > @@ -469,8 +469,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> > > > >    	/* The napi pointer is set if NAPI is enabled, which ensures that
> > > > >    	 * xdp_ring is initialized on receive side and the peer device is up.
> > > > >    	 */
> > > > > -	if (!rcu_access_pointer(rq->napi))
> > > > > +	if (!rcu_access_pointer(rq->napi)) {
> > > > > +		for (i = 0; i < n; i++) {
> > > > > +			struct xdp_frame *xdpf = frames[i];
> > > > > +			struct netdev_queue *txq = NULL;
> > > > > +			struct sk_buff *skb;
> > > > > +			int queue_mapping;
> > > > > +			u16 mac_len;
> > > > > +
> > > > > +			skb = xdp_build_skb_from_frame(xdpf, dev);
> > > > > +			if (unlikely(!skb)) {
> > > > > +				ret = nxmit;
> > > > > +				goto out;
> > > > > +			}
> > > > > +
> > > > > +			/* We need to restore ETH header, because it is pulled
> > > > > +			 * in eth_type_trans.
> > > > > +			 */
> > > > > +			mac_len = skb->data - skb_mac_header(skb);
> > > > > +			skb_push(skb, mac_len);
> > > > > +
> > > > > +			nxmit++;
> > > > > +
> > > > > +			queue_mapping = skb_get_queue_mapping(skb);
> > > > > +			txq = netdev_get_tx_queue(dev, netdev_cap_txqueue(dev, queue_mapping));
> > > > > +			__netif_tx_lock(txq, smp_processor_id());
> > > > > +			if (unlikely(veth_xmit(skb, dev) != NETDEV_TX_OK)) {
> > > > > +				__netif_tx_unlock(txq);
> > > > > +				ret = nxmit;
> > > > > +				goto out;
> > > > > +			}
> > > > > +			__netif_tx_unlock(txq);
> > > > Locking and unlocking the txq repeatedly for each packet? Yikes! Did you
> > > > measure the performance overhead of this?
> > > Yes, there are indeed some optimizations that can be done here,
> > > like putting the lock outside the loop.
> > > But in __dev_queue_xmit(), where each packet sent is also protected by a lock.
> > ...which is another reason why this is a bad idea: it's going to perform
> > terribly, which means we'll just end up with users wondering why their
> > XDP performance is terrible and we're going to have to tell them to turn
> > on GRO anyway. So why not do this from the beginning?
> > 
> > If you want to change the default, flipping GRO to be on by default is a
> > better solution IMO. I don't actually recall why we didn't do that when
> > the support was added, but maybe Paolo remembers?

I preferred to avoid changing the default behavior. Additionally, the
veth GRO stage needs some tuning to really be able to  aggregate the
packets (e.g. napi thread or gro_flush_timeout > 0)

> As I said above in the real case, the user's concern is not why the performance
> of xdp becomes bad, but why the data packets are not received.

Well, that arguably tells the end-user there is something wrong in
their setup. On the flip side, having a functionally working setup with
horrible performances would likely lead the users (perhaps not yours,
surely others) in very wrong directions (from "XDP is slow" to "the
problem is in the application")...

> The default number of veth queues is not num_possible_cpus(). When GRO is enabled
> by default, if there is only one veth queue, but multiple CPUs read and write at the
> same time, the efficiency of napi is actually very low due to the existence of
> production locks and races. On the contrary, the default veth_xmit() each cpu has
> its own unique queue, and this way of sending and receiving packets is also efficient.
> 
This patch adds a bit of complexity and it looks completely avoidable
with some configuration - you could enable GRO and set the number of
queues to num_possible_cpus().

I agree with Toke, you should explain the end-users that their
expecations are wrong, and guide them towards a better setup.

Thanks!

Paolo