linux-kernel - Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250514093435.GE3339421@horms.kernel.org>
Date: Wed, 14 May 2025 10:34:35 +0100
From: Simon Horman <horms@...nel.org>
To: mhklinux@...look.com
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
	decui@...rosoft.com, andrew+netdev@...n.ch, davem@...emloft.net,
	edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
	James.Bottomley@...senpartnership.com, martin.petersen@...cle.com,
	linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org, linux-scsi@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in
 the page buffer array

On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelley58@...il.com wrote:
> From: Michael Kelley <mhklinux@...look.com>
> 
> Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in
> SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux
> driver for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
> 
> The root cause of the problem is that the netvsc driver in Linux may
> send an incorrectly formatted VMBus message to Hyper-V when
> transmitting the network packet. The incorrect formatting occurs when
> the rndis header of the VMBus message crosses a page boundary due to
> how the Linux skb head memory is aligned. In such a case, two PFNs are
> required to describe the location of the rndis header, even though
> they are contiguous in guest physical address (GPA) space. Hyper-V
> requires that two rndis header PFNs be in a single "GPA range" data
> struture, but current netvsc code puts each PFN in its own GPA range,
> which Hyper-V rejects as an error.
> 
> The incorrect formatting occurs only for larger packets that netvsc
> must transmit via a VMBus "GPA Direct" message. There's no problem
> when netvsc transmits a smaller packet by copying it into a pre-
> allocated send buffer slot because the pre-allocated slots don't have
> page crossing issues.
> 
> After commit 14ad6ed30a10 ("net: allow small head cache usage with
> large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs
> much more frequently in VMs with 16 or more vCPUs. It may occur every
> few seconds, or even more frequently, in an ssh session that outputs a
> lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
> allocated, making it much more likely that the rndis header will cross
> a page boundary when the vCPU count is 16 or more. The changes in
> commit 14ad6ed30a10 are perfectly valid -- they just had the side
> effect of making the netvsc bug more prominent.
> 
> Current code in init_page_array() creates a separate page buffer array
> entry for each PFN required to identify the data to be transmitted.
> Contiguous PFNs get separate entries in the page buffer array, and any
> information about contiguity is lost.
> 
> Fix the core issue by having init_page_array() construct the page
> buffer array to represent contiguous ranges rather than individual
> pages. When these ranges are subsequently passed to
> netvsc_build_mpb_array(), it can build GPA ranges that contain
> multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete
> error status: 2". If instead the network packet is sent by copying
> into a pre-allocated send buffer slot, the copy proceeds using the
> contiguous ranges rather than individual pages, but the result of the
> copying is the same. Also fix rndis_filter_send_request() to construct
> a contiguous range, since it has its own page buffer array.
> 
> This change has a side benefit in CoCo VMs in that netvsc_dma_map()
> calls dma_map_single() on each contiguous range instead of on each
> page. This results in fewer calls to dma_map_single() but on larger
> chunks of memory, which should reduce contention on the swiotlb.
> 
> Since the page buffer array now contains one entry for each contiguous
> range instead of for each individual page, the number of entries in
> the array can be reduced, saving 208 bytes of stack space in
> netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
> 
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> Cc: <stable@...r.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@...look.com>
> ---
>  drivers/net/hyperv/hyperv_net.h   | 12 ++++++
>  drivers/net/hyperv/netvsc_drv.c   | 63 ++++++++-----------------------
>  drivers/net/hyperv/rndis_filter.c | 24 +++---------
>  3 files changed, 32 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 70f7cb383228..76725f25abd5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -893,6 +893,18 @@ struct nvsp_message {
>  				 sizeof(struct nvsp_message))
>  #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)
>  
> +/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
> + * Typically it's the max SKB fragments plus 2 for the rndis packet and the
> + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
> + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
> + * in a GPA direct packet sent to netvsp over VMBus.
> + */
> +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
> +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
> +#else
> +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
> +#endif
> +
>  /* Estimated requestor size:
>   * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
>   */

...

> @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
>  	 * 2. skb linear data
>  	 * 3. skb fragment data
>  	 */
> -	slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
> -				  offset_in_hvpage(hdr),
> -				  len,
> -				  &pb[slots_used]);
>  
> +	pb[0].offset = offset_in_hvpage(hdr);
> +	pb[0].len = len;
> +	pb[0].pfn = virt_to_hvpfn(hdr);
>  	packet->rmsg_size = len;
> -	packet->rmsg_pgcnt = slots_used;
> +	packet->rmsg_pgcnt = 1;
>  
> -	slots_used += fill_pg_buf(virt_to_hvpfn(data),
> -				  offset_in_hvpage(data),
> -				  skb_headlen(skb),
> -				  &pb[slots_used]);
> +	pb[1].offset = offset_in_hvpage(skb->data);
> +	pb[1].len = skb_headlen(skb);
> +	pb[1].pfn = virt_to_hvpfn(skb->data);
>  
>  	for (i = 0; i < frags; i++) {
>  		skb_frag_t *frag = skb_shinfo(skb)->frags + i;
> +		struct hv_page_buffer *cur_pb = &pb[i + 2];

Hi Michael,

If I got things right then then pb is allocated on the stack
in netvsc_xmit and has MAX_DATA_RANGES elements.

If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to
MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here?

> +		u64 pfn = page_to_hvpfn(skb_frag_page(frag));
> +		u32 offset = skb_frag_off(frag);
>  
> -		slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
> -					  skb_frag_off(frag),
> -					  skb_frag_size(frag),
> -					  &pb[slots_used]);
> +		cur_pb->offset = offset_in_hvpage(offset);
> +		cur_pb->len = skb_frag_size(frag);
> +		cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
>  	}
> -	return slots_used;
> +	return frags + 2;
>  }
>  
>  static int count_skb_frag_slots(struct sk_buff *skb)
> @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct net_device *net, bool xdp_tx)
>  	struct net_device *vf_netdev;
>  	u32 rndis_msg_size;
>  	u32 hash;
> -	struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
> +	struct hv_page_buffer pb[MAX_DATA_RANGES];
>  
>  	/* If VF is present and up then redirect packets to it.
>  	 * Skip the VF if it is marked down or has no carrier.

...