lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1457706771.2663.37.camel@edumazet-ThinkPad-T530>
Date:	Fri, 11 Mar 2016 06:32:51 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Saeed Mahameed <saeedm@...lanox.com>
Cc:	"David S. Miller" <davem@...emloft.net>, netdev@...r.kernel.org,
	Or Gerlitz <ogerlitz@...lanox.com>,
	Eran Ben Elisha <eranbe@...lanox.com>,
	Tal Alon <talal@...lanox.com>,
	Tariq Toukan <tariqt@...lanox.com>,
	Jesper Dangaard Brouer <brouer@...hat.com>
Subject: Re: [PATCH net-next 08/13] net/mlx5e: Add fragmented memory support
 for RX multi packet WQE

On ven., 2016-03-11 at 15:39 +0200, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt@...lanox.com>
> 
> If the allocation of a linear (physically continuous) MPWQE fails,
> we allocate a fragmented MPWQE.
> 
> This is implemented via device's UMR (User Memory Registration)
> which allows to register multiple memory fragments into ConnectX
> hardware as a continuous buffer.
> UMR registration is an asynchronous operation and is done via
> ICO SQs.
> 
...

> +static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> +				    struct mlx5e_mpw_info *wi,
> +				    int i)
> +{
> +	struct page *page;
> +
> +	page = alloc_page(GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
> +	if (!page)
> +		return -ENOMEM;
> +
> +	wi->umr.dma_info[i].page = page;
> +	wi->umr.dma_info[i].addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> +						PCI_DMA_FROMDEVICE);
> +	if (dma_mapping_error(rq->pdev, wi->umr.dma_info[i].addr)) {
> +		put_page(page);
> +		return -ENOMEM;
> +	}
> +	wi->umr.mtt[i] = cpu_to_be64(wi->umr.dma_info[i].addr | MLX5_EN_WR);
> +
> +	return 0;
> +}
> +
> +static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> +					   struct mlx5e_rx_wqe *wqe,
> +					   u16 ix)
> +{
> +	struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> +	int mtt_sz = mlx5e_get_wqe_mtt_sz();
> +	u32 dma_offset = rq->ix * MLX5_CHANNEL_MAX_NUM_PAGES * PAGE_SIZE +
> +		ix * rq->wqe_sz;
> +	int i;
> +
> +	wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> +				   MLX5_MPWRQ_WQE_NUM_PAGES,
> +				   GFP_ATOMIC | __GFP_COMP | __GFP_COLD);
> +	if (!wi->umr.dma_info)
> +		goto err_out;
> +
> +	 /* To avoid copying garbage after the mtt array, we allocate
> +	  * a little more.
> +	  */
> +	wi->umr.mtt = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> +			  GFP_ATOMIC | __GFP_COMP | __GFP_COLD);

__GFP_COLD right before a memset(0) (kzalloc) makes little sense.


> +	if (!wi->umr.mtt)
> +		goto err_free_umr;
> +
> +	wi->umr.mtt = PTR_ALIGN(wi->umr.mtt, MLX5_UMR_ALIGN);
> +	wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> +				      PCI_DMA_TODEVICE);
> +	if (dma_mapping_error(rq->pdev, wi->umr.mtt_addr))
> +		goto err_free_mtt;
> +
...

>  
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +static void mlx5e_add_skb_frag(struct sk_buff *skb, int len, struct page *page,
> +			       int page_offset)
> +{
> +	int f = skb_shinfo(skb)->nr_frags++;
> +	skb_frag_t *fr = &skb_shinfo(skb)->frags[f];
> +
> +	skb->len += len;
> +	skb->data_len += len;
> +	get_page(page);
> +	skb_frag_set_page(skb, f, page);
> +	skb_frag_size_set(fr, len);
> +	fr->page_offset = page_offset;
> +	skb->truesize  = SKB_TRUESIZE(skb->len);
> +}

Really I am speechless.

It is hard to believe how much effort some drivers authors spend trying
to fool linux stack and risk OOM a host under stress.

SKB_TRUESIZE() is absolutely not something a driver is allowed to use.

Here you want instead :

skb->truesize += PAGE_SIZE;

Assuming you allocate and use an order-0 page per fragment. Fact that
you receive say 100 bytes datagram is irrelevant to truesize.

truesize is the real memory usage of one skb. Not the minimal size of an
optimally allocated skb for a given payload.


Better RX speed should not be done at the risk of system stability.

Now if for some reason you need to increase max TCP RWIN, that would be
a TCP stack change, not some obscure lie in a driver trying to be faster
than competitors.

Thanks.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ