lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <65845456cad2a_50168294ba@willemb.c.googlers.com.notmuch>
Date: Thu, 21 Dec 2023 10:05:58 -0500
From: Willem de Bruijn <willemdebruijn.kernel@...il.com>
To: Heng Qi <hengqi@...ux.alibaba.com>, 
 Willem de Bruijn <willemdebruijn.kernel@...il.com>, 
 netdev@...r.kernel.org, 
 virtualization@...ts.linux-foundation.org
Cc: Jason Wang <jasowang@...hat.com>, 
 "Michael S. Tsirkin" <mst@...hat.com>, 
 "David S. Miller" <davem@...emloft.net>, 
 Eric Dumazet <edumazet@...gle.com>, 
 Jakub Kicinski <kuba@...nel.org>, 
 Paolo Abeni <pabeni@...hat.com>, 
 Alexei Starovoitov <ast@...nel.org>, 
 Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
Subject: Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

Heng Qi wrote:
> 
> 
> 在 2023/12/20 下午10:45, Willem de Bruijn 写道:
> > Heng Qi wrote:
> >> virtio-net has two ways to switch napi_tx: one is through the
> >> module parameter, and the other is through coalescing parameter
> >> settings (provided that the nic status is down).
> >>
> >> Sometimes we face performance regression caused by napi_tx,
> >> then we need to switch napi_tx when debugging. However, the
> >> existing methods are a bit troublesome, such as needing to
> >> reload the driver or turn off the network card. So try to make
> >> this update.
> >>
> >> Signed-off-by: Heng Qi <hengqi@...ux.alibaba.com>
> >> Reviewed-by: Xuan Zhuo <xuanzhuo@...ux.alibaba.com>
> > The commit does not explain why it is safe to do so.
> 
> virtnet_napi_tx_disable ensures that already scheduled tx napi ends and 
> no new tx napi will be scheduled.
> 
> Afterwards, if the __netif_tx_lock_bh lock is held, the stack cannot 
> send the packet.
> 
> Then we can safely toggle the weight to indicate where to clear the buffers.
> 
> >
> > The tx-napi weights are not really weights: it is a boolean whether
> > napi is used for transmit cleaning, or whether packets are cleaned
> > in ndo_start_xmit.
> 
> Right.
> 
> >
> > There certainly are some subtle issues with regard to pausing/waking
> > queues when switching between modes.
> 
> What are "subtle issues" and if there are any, we find them.

A single runtime test is not sufficient to exercise all edge cases.

Please don't leave it to reviewers to establish the correctness of a
patch.

The napi_tx and non-napi code paths differ in how they handle at least
the following structures:

1. skb: non-napi orphans these in ndo_start_xmit. Without napi this is
needed as delay until the next ndo_start_xmit and thus completion is
unbounded.

When switching to napi mode, orphaned skbs may now be cleaned by the
napi handler. This is indeed safe.

When switching from napi to non-napi, the unbound latency resurfaces.
It is a small edge case, and I think a potentially acceptable risk, if
the user of this knob is aware of the risk.

2. virtqueue callback ("interrupt" masking). The non-napi path enables
the interrupt (disables the mask) when available descriptors falls
beneath a low watermark, and reenables when it recovers above a high
watermark. Napi disables when napi is scheduled, and reenables on
napi complete.

3. dev_queue->state (QUEUE_STATE_DRV_XOFF). if the ring falls below
a low watermark, the driver stops the stack for queuing more packets.
In napi mode, it schedules napi to clean packets. See the calls to
netif_xmit_stopped, netif_stop_subqueue, netif_start_subqueue and
netif_tx_wake_queue.

Some if this can be assumed safe by looking at existing analogous
code, such as the queue stop/start in virtnet_tx_resize.

But that all virtqueue callback and dev_queue->state transitions are
correct when switching between modes at runtime is not trivial to
establish, deserves some thought and explanation in the commit
message.

> So far my test results show it's working fine.
> 
> >
> > Calling napi_enable/napi_disable without bringing down the device is
> > allowed. The actually napi.weight field is only updated when neither
> > napi nor ndo_start_xmit is running.
> 
> YES.
> 
> > So I don't see an immediate issue.
> 
> Switching napi_tx requires reloading the driver or downing the nic, 
> which is troublesome.
> I think it would be better if we could find a better way.
> 
> Thanks!
> 
> >
> >
> >> ---
> >>   drivers/net/virtio_net.c | 81 ++++++++++++++++++----------------------
> >>   1 file changed, 37 insertions(+), 44 deletions(-)
> >>
> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> index 10614e9f7cad..12f8e1f9971c 100644
> >> --- a/drivers/net/virtio_net.c
> >> +++ b/drivers/net/virtio_net.c
> >> @@ -3559,16 +3559,37 @@ static int virtnet_coal_params_supported(struct ethtool_coalesce *ec)
> >>   	return 0;
> >>   }
> >>   
> >> -static int virtnet_should_update_vq_weight(int dev_flags, int weight,
> >> -					   int vq_weight, bool *should_update)
> >> +static void virtnet_switch_napi_tx(struct virtnet_info *vi, u32 qstart,
> >> +				   u32 qend, u32 tx_frames)
> >>   {
> >> -	if (weight ^ vq_weight) {
> >> -		if (dev_flags & IFF_UP)
> >> -			return -EBUSY;
> >> -		*should_update = true;
> >> -	}
> >> +	struct net_device *dev = vi->dev;
> >> +	int new_weight, cur_weight;
> >> +	struct netdev_queue *txq;
> >> +	struct send_queue *sq;
> >>   
> >> -	return 0;
> >> +	new_weight = tx_frames ? NAPI_POLL_WEIGHT : 0;
> >> +	for (; qstart < qend; qstart++) {
> >> +		sq = &vi->sq[qstart];
> >> +		cur_weight = sq->napi.weight;
> >> +		if (!(new_weight ^ cur_weight))
> >> +			continue;
> >> +
> >> +		if (!(dev->flags & IFF_UP)) {
> >> +			sq->napi.weight = new_weight;
> >> +			continue;
> >> +		}
> >> +
> >> +		if (cur_weight)
> >> +			virtnet_napi_tx_disable(&sq->napi);
> >> +
> >> +		txq = netdev_get_tx_queue(dev, qstart);
> >> +		__netif_tx_lock_bh(txq);
> >> +		sq->napi.weight = new_weight;
> >> +		__netif_tx_unlock_bh(txq);
> >> +
> >> +		if (!cur_weight)
> >> +			virtnet_napi_tx_enable(vi, sq->vq, &sq->napi);
> >> +	}
> >>   }
> >>   
> >>   static int virtnet_set_coalesce(struct net_device *dev,
> >> @@ -3577,25 +3598,11 @@ static int virtnet_set_coalesce(struct net_device *dev,
> >>   				struct netlink_ext_ack *extack)
> >>   {
> >>   	struct virtnet_info *vi = netdev_priv(dev);
> >> -	int ret, queue_number, napi_weight;
> >> -	bool update_napi = false;
> >> -
> >> -	/* Can't change NAPI weight if the link is up */
> >> -	napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
> >> -	for (queue_number = 0; queue_number < vi->max_queue_pairs; queue_number++) {
> >> -		ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
> >> -						      vi->sq[queue_number].napi.weight,
> >> -						      &update_napi);
> >> -		if (ret)
> >> -			return ret;
> >> -
> >> -		if (update_napi) {
> >> -			/* All queues that belong to [queue_number, vi->max_queue_pairs] will be
> >> -			 * updated for the sake of simplicity, which might not be necessary
> >> -			 */
> >> -			break;
> >> -		}
> >> -	}
> >> +	int ret;
> >> +
> >> +	/* Param tx_frames can be used to switch napi_tx */
> >> +	virtnet_switch_napi_tx(vi, 0, vi->max_queue_pairs,
> >> +			       ec->tx_max_coalesced_frames);
> >>   
> >>   	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_NOTF_COAL))
> >>   		ret = virtnet_send_notf_coal_cmds(vi, ec);
> >> @@ -3605,11 +3612,6 @@ static int virtnet_set_coalesce(struct net_device *dev,
> >>   	if (ret)
> >>   		return ret;
> >>   
> >> -	if (update_napi) {
> >> -		for (; queue_number < vi->max_queue_pairs; queue_number++)
> >> -			vi->sq[queue_number].napi.weight = napi_weight;
> >> -	}
> >> -
> >>   	return ret;
> >>   }
> >>   
> >> @@ -3641,19 +3643,13 @@ static int virtnet_set_per_queue_coalesce(struct net_device *dev,
> >>   					  struct ethtool_coalesce *ec)
> >>   {
> >>   	struct virtnet_info *vi = netdev_priv(dev);
> >> -	int ret, napi_weight;
> >> -	bool update_napi = false;
> >> +	int ret;
> >>   
> >>   	if (queue >= vi->max_queue_pairs)
> >>   		return -EINVAL;
> >>   
> >> -	/* Can't change NAPI weight if the link is up */
> >> -	napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
> >> -	ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
> >> -					      vi->sq[queue].napi.weight,
> >> -					      &update_napi);
> >> -	if (ret)
> >> -		return ret;
> >> +	/* Param tx_frames can be used to switch napi_tx */
> >> +	virtnet_switch_napi_tx(vi, queue, queue, ec->tx_max_coalesced_frames);
> >>   
> >>   	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> >>   		ret = virtnet_send_notf_coal_vq_cmds(vi, ec, queue);
> >> @@ -3663,9 +3659,6 @@ static int virtnet_set_per_queue_coalesce(struct net_device *dev,
> >>   	if (ret)
> >>   		return ret;
> >>   
> >> -	if (update_napi)
> >> -		vi->sq[queue].napi.weight = napi_weight;
> >> -
> >>   	return 0;
> >>   }
> >>   
> >> -- 
> >> 2.19.1.6.gb485710b
> >>
> 



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ