netdev - Re: [PATCH v2 3/3] tg3: Fix tx_pending checks for tg3_tso

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1408681896.8268.7.camel@prashant>
Date:	Thu, 21 Aug 2014 21:31:36 -0700
From:	Prashant Sreedharan <prashant@...adcom.com>
To:	Benjamin Poirier <bpoirier@...e.de>
CC:	Michael Chan <mchan@...adcom.com>, <netdev@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 3/3] tg3: Fix tx_pending checks for tg3_tso_bug

On Thu, 2014-08-21 at 14:57 -0700, Benjamin Poirier wrote:
> In tg3_set_ringparam(), the tx_pending test to cover the cases where
> tg3_tso_bug() is entered has two problems
> 1) the check is only done for certain hardware whereas the workaround
> is now used more broadly. IOW, the check may not be performed when it
> is needed.
> 2) the check is too optimistic.
> 
> For example, with a 5761 (SHORT_DMA_BUG), tg3_set_ringparam() skips over the
> "tx_pending <= (MAX_SKB_FRAGS * 3)" check because TSO_BUG is false. Even if it
> did do the check, with a full sized skb, frag_cnt_est = 135 but the check is
> for <= MAX_SKB_FRAGS * 3 (= 17 * 3 = 51). So the check is insufficient. This
> leads to the following situation: by setting, ex. tx_pending = 100, there can
> be an skb that triggers tg3_tso_bug() and that is large enough to cause
> tg3_tso_bug() to stop the queue even when it is empty. We then end up with a
> netdev watchdog transmit timeout.
> 
> Given that 1) some of the conditions tested for in tg3_tx_frag_set() apply
> regardless of the chipset flags and that 2) it is difficult to estimate ahead
> of time the max possible number of frames that a large skb may be split into
> by gso, we instead take the approach of adjusting dev->gso_max_segs according
> to the requested tx_pending size.
> 
> This puts us in the exceptional situation that a single skb that triggers
> tg3_tso_bug() may require the entire tx ring. Usually the tx queue is woken up
> when at least a quarter of it is available (TG3_TX_WAKEUP_THRESH) but that
> would be insufficient now. To avoid useless wakeups, the tx queue wake up
> threshold is made dynamic. Likewise, usually the tx queue is stopped as soon
> as an skb with max frags may overrun it. Since the skbs submitted from
> tg3_tso_bug() use a controlled number of descriptors, the tx queue stop
> threshold may be lowered.
> 
> Signed-off-by: Benjamin Poirier <bpoirier@...e.de>
> ---
> Changes v1->v2
> * in tg3_set_ringparam(), reduce gso_max_segs further to budget 3 descriptors
>   per gso seg instead of only 1 as in v1
> * in tg3_tso_bug(), check that this estimation (3 desc/seg) holds, otherwise
>   linearize some skbs as needed
> * in tg3_start_xmit(), make the queue stop threshold a parameter, for the
>   reason explained in the commit description
> 
> I was concerned that this last change, because of the extra call in the
> default xmit path, may impact performance so I performed an rr latency test
> but I did not measure a significant impact. That test was with default mtu and
> ring size.
> 
> # perf stat -r10 -ad netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr
> 
> * without patches
> 	rr values: 7039.63 6865.03 6939.21 6919.31 6931.88 6932.74 6925.1 6953.33 6868.43 6935.65
> 	sample size: 10
> 	mean: 6931.031
> 	standard deviation: 48.10918
> 	quantiles: 6865.03 6920.757 6932.31 6938.32 7039.63
> 	6930±50
> 
>  Performance counter stats for 'netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr' (10 runs):
> 
>      480643.024723 task-clock                #    8.001 CPUs utilized            ( +-  0.00% ) [100.00%]
>            855,136 context-switches          #    0.002 M/sec                    ( +-  0.23% ) [100.00%]
>                521 CPU-migrations            #    0.000 M/sec                    ( +-  6.49% ) [100.00%]
>                104 page-faults               #    0.000 M/sec                    ( +-  2.73% )
>    298,416,906,437 cycles                    #    0.621 GHz                      ( +-  4.08% ) [15.01%]
>    812,072,320,370 stalled-cycles-frontend   #  272.13% frontend cycles idle     ( +-  1.89% ) [25.01%]
>    685,633,562,247 stalled-cycles-backend    #  229.76% backend  cycles idle     ( +-  2.50% ) [35.00%]
>    117,665,891,888 instructions              #    0.39  insns per cycle
>                                              #    6.90  stalled cycles per insn  ( +-  2.22% ) [45.00%]
>     26,158,399,505 branches                  #   54.424 M/sec                    ( +-  2.10% ) [50.00%]
>        205,688,614 branch-misses             #    0.79% of all branches          ( +-  0.78% ) [50.00%]
>     27,882,474,171 L1-dcache-loads           #   58.011 M/sec                    ( +-  1.98% ) [50.00%]
>        369,911,372 L1-dcache-load-misses     #    1.33% of all L1-dcache hits    ( +-  0.62% ) [50.00%]
>         76,240,847 LLC-loads                 #    0.159 M/sec                    ( +-  1.04% ) [40.00%]
>              3,220 LLC-load-misses           #    0.00% of all LL-cache hits     ( +- 19.49% ) [ 5.00%]
> 
>       60.074059340 seconds time elapsed                                          ( +-  0.00% )
> 
> * with patches
> 	rr values: 6732.65 6920.1 6909.46 7032.41 6864.43 6897.6 6815.19 6967.83 6849.23 6929.52
> 	sample size: 10
> 	mean: 6891.842
> 	standard deviation: 82.91901
> 	quantiles: 6732.65 6853.03 6903.53 6927.165 7032.41
> 	6890±80
> 
>  Performance counter stats for 'netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr' (10 runs):
> 
>      480675.949728 task-clock                #    8.001 CPUs utilized            ( +-  0.01% ) [100.00%]
>            850,461 context-switches          #    0.002 M/sec                    ( +-  0.37% ) [100.00%]
>                564 CPU-migrations            #    0.000 M/sec                    ( +-  5.67% ) [100.00%]
>                417 page-faults               #    0.000 M/sec                    ( +- 76.04% )
>    287,019,442,295 cycles                    #    0.597 GHz                      ( +-  7.16% ) [15.01%]
>    828,198,830,689 stalled-cycles-frontend   #  288.55% frontend cycles idle     ( +-  3.01% ) [25.01%]
>    718,230,307,166 stalled-cycles-backend    #  250.24% backend  cycles idle     ( +-  3.53% ) [35.00%]
>    117,976,598,188 instructions              #    0.41  insns per cycle
>                                              #    7.02  stalled cycles per insn  ( +-  4.06% ) [45.00%]
>     26,715,853,108 branches                  #   55.580 M/sec                    ( +-  3.77% ) [50.00%]
>        198,787,673 branch-misses             #    0.74% of all branches          ( +-  0.86% ) [50.00%]
>     28,416,922,166 L1-dcache-loads           #   59.119 M/sec                    ( +-  3.54% ) [50.00%]
>        367,613,007 L1-dcache-load-misses     #    1.29% of all L1-dcache hits    ( +-  0.47% ) [50.00%]
>         75,260,575 LLC-loads                 #    0.157 M/sec                    ( +-  2.24% ) [40.00%]
>              5,777 LLC-load-misses           #    0.01% of all LL-cache hits     ( +- 36.03% ) [ 5.00%]
> 
>       60.077898757 seconds time elapsed                                          ( +-  0.01% )
> 
> I reproduced this bug using the same approach explained in patch 1.
> The bug reproduces with tx_pending <= 135
> 
> ---
>  drivers/net/ethernet/broadcom/tg3.c | 67 +++++++++++++++++++++++++++++--------
>  drivers/net/ethernet/broadcom/tg3.h |  1 +
>  2 files changed, 54 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 0cecd6d..c29f2e3 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -204,6 +204,10 @@ static inline void _tg3_flag_clear(enum TG3_FLAGS flag, unsigned long *bits)
>  /* minimum number of free TX descriptors required to wake up TX process */
>  #define TG3_TX_WAKEUP_THRESH(tnapi)	max_t(u32, (tnapi)->tx_pending / 4, \
>  					      MAX_SKB_FRAGS + 1)
> +/* estimate a certain number of descriptors per gso segment */
> +#define TG3_TX_DESC_PER_SEG(seg_nb)	((seg_nb) * 3)
> +#define TG3_TX_SEG_PER_DESC(desc_nb)	((desc_nb) / 3)
> +
>  #define TG3_TX_BD_DMA_MAX_2K		2048
>  #define TG3_TX_BD_DMA_MAX_4K		4096
>  
> @@ -6609,10 +6613,10 @@ static void tg3_tx(struct tg3_napi *tnapi)
>  	smp_mb();
>  
>  	if (unlikely(netif_tx_queue_stopped(txq) &&
> -		     (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi)))) {
> +		     (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh))) {
>  		__netif_tx_lock(txq, smp_processor_id());
>  		if (netif_tx_queue_stopped(txq) &&
> -		    (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi)))
> +		    (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh))
>  			netif_tx_wake_queue(txq);
>  		__netif_tx_unlock(txq);
>  	}
> @@ -7830,6 +7834,8 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
>  }
>  
>  static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
> +static netdev_tx_t __tg3_start_xmit(struct sk_buff *, struct net_device *,
> +				    u32);
>  
>  /* Use GSO to workaround all TSO packets that meet HW bug conditions
>   * indicated in tg3_tx_frag_set()
> @@ -7838,11 +7844,13 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>  		       struct netdev_queue *txq, struct sk_buff *skb)
>  {
>  	struct sk_buff *segs, *nskb;
> -	u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
> +	unsigned int segs_remaining = skb_shinfo(skb)->gso_segs;
> +	u32 desc_cnt_est = TG3_TX_DESC_PER_SEG(segs_remaining);
>  
> -	/* Estimate the number of fragments in the worst case */
> -	if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
> +	if (unlikely(tg3_tx_avail(tnapi) <= desc_cnt_est)) {
>  		netif_tx_stop_queue(txq);
> +		tnapi->wakeup_thresh = desc_cnt_est;
> +		BUG_ON(tnapi->wakeup_thresh >= tnapi->tx_pending);
>  
>  		/* netif_tx_stop_queue() must be done before checking
>  		 * checking tx index in tg3_tx_avail() below, because in
> @@ -7850,7 +7858,7 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>  		 * netif_tx_queue_stopped().
>  		 */
>  		smp_mb();
> -		if (tg3_tx_avail(tnapi) <= frag_cnt_est)
> +		if (tg3_tx_avail(tnapi) <= tnapi->wakeup_thresh)
>  			return NETDEV_TX_BUSY;
>  
>  		netif_tx_wake_queue(txq);
> @@ -7858,14 +7866,33 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>  
>  	segs = skb_gso_segment(skb, tp->dev->features &
>  				    ~(NETIF_F_TSO | NETIF_F_TSO6));
> -	if (IS_ERR(segs) || !segs)
> +	if (IS_ERR_OR_NULL(segs))
>  		goto tg3_tso_bug_end;
>  
>  	do {
> +		unsigned int desc_cnt = skb_shinfo(segs)->nr_frags + 1;
> +
>  		nskb = segs;
>  		segs = segs->next;
>  		nskb->next = NULL;
> -		tg3_start_xmit(nskb, tp->dev);
> +
> +		if (tg3_tx_avail(tnapi) <= segs_remaining - 1 + desc_cnt &&
> +		    skb_linearize(nskb)) {
> +			nskb->next = segs;
> +			segs = nskb;
> +			do {
> +				nskb = segs->next;
> +
> +				dev_kfree_skb_any(segs);
> +				segs = nskb;
> +			} while (segs);
If skb_linearize() fails need to increment the tp->tx_dropped count
> +			goto tg3_tso_bug_end;
> +		}
> +		segs_remaining--;
> +		if (segs_remaining)
> +			__tg3_start_xmit(nskb, tp->dev, segs_remaining);
To clarify passing segs_remaining will make sure the queue is never
stopped correct ?
> +		else
> +			tg3_start_xmit(nskb, tp->dev);
>  	} while (segs);
>  
>  tg3_tso_bug_end:
> @@ -7877,6 +7904,12 @@ tg3_tso_bug_end:
>  /* hard_start_xmit for all devices */
>  static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
> +	return __tg3_start_xmit(skb, dev, MAX_SKB_FRAGS + 1);
> +}
> +
> +static netdev_tx_t __tg3_start_xmit(struct sk_buff *skb,
> +				    struct net_device *dev, u32 stop_thresh)
> +{
>  	struct tg3 *tp = netdev_priv(dev);
>  	u32 len, entry, base_flags, mss, vlan = 0;
>  	u32 budget;
> @@ -7905,12 +7938,17 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	if (unlikely(budget <= (skb_shinfo(skb)->nr_frags + 1))) {
>  		if (!netif_tx_queue_stopped(txq)) {
>  			netif_tx_stop_queue(txq);
> +			tnapi->wakeup_thresh = TG3_TX_WAKEUP_THRESH(tnapi);
>  
>  			/* This is a hard error, log it. */
>  			netdev_err(dev,
>  				   "BUG! Tx Ring full when queue awake!\n");
>  		}
> -		return NETDEV_TX_BUSY;
> +		smp_mb();
> +		if (tg3_tx_avail(tnapi) <= tnapi->wakeup_thresh)
> +			return NETDEV_TX_BUSY;
> +
> +		netif_tx_wake_queue(txq);
>  	}
>  
>  	entry = tnapi->tx_prod;
> @@ -8087,8 +8125,9 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	tw32_tx_mbox(tnapi->prodmbox, entry);
>  
>  	tnapi->tx_prod = entry;
> -	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
> +	if (unlikely(tg3_tx_avail(tnapi) <= stop_thresh)) {
>  		netif_tx_stop_queue(txq);
> +		tnapi->wakeup_thresh = TG3_TX_WAKEUP_THRESH(tnapi);
>  
>  		/* netif_tx_stop_queue() must be done before checking
>  		 * checking tx index in tg3_tx_avail() below, because in
> @@ -8096,7 +8135,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  		 * netif_tx_queue_stopped().
>  		 */
>  		smp_mb();
> -		if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
> +		if (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh)
>  			netif_tx_wake_queue(txq);
>  	}
>  
> @@ -12319,9 +12358,7 @@ static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam *e
>  	if ((ering->rx_pending > tp->rx_std_ring_mask) ||
>  	    (ering->rx_jumbo_pending > tp->rx_jmb_ring_mask) ||
>  	    (ering->tx_pending > TG3_TX_RING_SIZE - 1) ||
> -	    (ering->tx_pending <= MAX_SKB_FRAGS + 1) ||
> -	    (tg3_flag(tp, TSO_BUG) &&
> -	     (ering->tx_pending <= (MAX_SKB_FRAGS * 3))))
> +	    (ering->tx_pending <= MAX_SKB_FRAGS + 1))
>  		return -EINVAL;
>  
>  	if (netif_running(dev)) {
> @@ -12341,6 +12378,7 @@ static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam *e
>  	if (tg3_flag(tp, JUMBO_RING_ENABLE))
>  		tp->rx_jumbo_pending = ering->rx_jumbo_pending;
>  
> +	dev->gso_max_segs = TG3_TX_SEG_PER_DESC(ering->tx_pending - 1);
>  	for (i = 0; i < tp->irq_max; i++)
>  		tp->napi[i].tx_pending = ering->tx_pending;
>  
> @@ -17817,6 +17855,7 @@ static int tg3_init_one(struct pci_dev *pdev,
>  		else
>  			sndmbx += 0xc;
>  	}
> +	dev->gso_max_segs = TG3_TX_SEG_PER_DESC(TG3_DEF_TX_RING_PENDING - 1);
>  
>  	tg3_init_coal(tp);
>  
> diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
> index 461acca..6a7e13d 100644
> --- a/drivers/net/ethernet/broadcom/tg3.h
> +++ b/drivers/net/ethernet/broadcom/tg3.h
> @@ -3006,6 +3006,7 @@ struct tg3_napi {
>  	u32				tx_pending;
>  	u32				last_tx_cons;
>  	u32				prodmbox;
> +	u32				wakeup_thresh;
>  	struct tg3_tx_buffer_desc	*tx_ring;
>  	struct tg3_tx_ring_info		*tx_buffers;
>  


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html