[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250514093435.GE3339421@horms.kernel.org>
Date: Wed, 14 May 2025 10:34:35 +0100
From: Simon Horman <horms@...nel.org>
To: mhklinux@...look.com
Cc: kys@...rosoft.com, haiyangz@...rosoft.com, wei.liu@...nel.org,
decui@...rosoft.com, andrew+netdev@...n.ch, davem@...emloft.net,
edumazet@...gle.com, kuba@...nel.org, pabeni@...hat.com,
James.Bottomley@...senpartnership.com, martin.petersen@...cle.com,
linux-hyperv@...r.kernel.org, linux-kernel@...r.kernel.org,
netdev@...r.kernel.org, linux-scsi@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH net 3/5] hv_netvsc: Preserve contiguous PFN grouping in
the page buffer array
On Mon, May 12, 2025 at 05:06:02PM -0700, mhkelley58@...il.com wrote:
> From: Michael Kelley <mhklinux@...look.com>
>
> Starting with commit dca5161f9bd0 ("hv_netvsc: Check status in
> SEND_RNDIS_PKT completion message") in the 6.3 kernel, the Linux
> driver for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
>
> The root cause of the problem is that the netvsc driver in Linux may
> send an incorrectly formatted VMBus message to Hyper-V when
> transmitting the network packet. The incorrect formatting occurs when
> the rndis header of the VMBus message crosses a page boundary due to
> how the Linux skb head memory is aligned. In such a case, two PFNs are
> required to describe the location of the rndis header, even though
> they are contiguous in guest physical address (GPA) space. Hyper-V
> requires that two rndis header PFNs be in a single "GPA range" data
> struture, but current netvsc code puts each PFN in its own GPA range,
> which Hyper-V rejects as an error.
>
> The incorrect formatting occurs only for larger packets that netvsc
> must transmit via a VMBus "GPA Direct" message. There's no problem
> when netvsc transmits a smaller packet by copying it into a pre-
> allocated send buffer slot because the pre-allocated slots don't have
> page crossing issues.
>
> After commit 14ad6ed30a10 ("net: allow small head cache usage with
> large MAX_SKB_FRAGS values") in the 6.14-rc4 kernel, the error occurs
> much more frequently in VMs with 16 or more vCPUs. It may occur every
> few seconds, or even more frequently, in an ssh session that outputs a
> lot of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
> allocated, making it much more likely that the rndis header will cross
> a page boundary when the vCPU count is 16 or more. The changes in
> commit 14ad6ed30a10 are perfectly valid -- they just had the side
> effect of making the netvsc bug more prominent.
>
> Current code in init_page_array() creates a separate page buffer array
> entry for each PFN required to identify the data to be transmitted.
> Contiguous PFNs get separate entries in the page buffer array, and any
> information about contiguity is lost.
>
> Fix the core issue by having init_page_array() construct the page
> buffer array to represent contiguous ranges rather than individual
> pages. When these ranges are subsequently passed to
> netvsc_build_mpb_array(), it can build GPA ranges that contain
> multiple PFNs, as required to avoid the error "nvsp_rndis_pkt_complete
> error status: 2". If instead the network packet is sent by copying
> into a pre-allocated send buffer slot, the copy proceeds using the
> contiguous ranges rather than individual pages, but the result of the
> copying is the same. Also fix rndis_filter_send_request() to construct
> a contiguous range, since it has its own page buffer array.
>
> This change has a side benefit in CoCo VMs in that netvsc_dma_map()
> calls dma_map_single() on each contiguous range instead of on each
> page. This results in fewer calls to dma_map_single() but on larger
> chunks of memory, which should reduce contention on the swiotlb.
>
> Since the page buffer array now contains one entry for each contiguous
> range instead of for each individual page, the number of entries in
> the array can be reduced, saving 208 bytes of stack space in
> netvsc_xmit() when MAX_SKG_FRAGS has the default value of 17.
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=217503
>
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217503
> Cc: <stable@...r.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@...look.com>
> ---
> drivers/net/hyperv/hyperv_net.h | 12 ++++++
> drivers/net/hyperv/netvsc_drv.c | 63 ++++++++-----------------------
> drivers/net/hyperv/rndis_filter.c | 24 +++---------
> 3 files changed, 32 insertions(+), 67 deletions(-)
>
> diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
> index 70f7cb383228..76725f25abd5 100644
> --- a/drivers/net/hyperv/hyperv_net.h
> +++ b/drivers/net/hyperv/hyperv_net.h
> @@ -893,6 +893,18 @@ struct nvsp_message {
> sizeof(struct nvsp_message))
> #define NETVSC_MIN_IN_MSG_SIZE sizeof(struct vmpacket_descriptor)
>
> +/* Maximum # of contiguous data ranges that can make up a trasmitted packet.
> + * Typically it's the max SKB fragments plus 2 for the rndis packet and the
> + * linear portion of the SKB. But if MAX_SKB_FRAGS is large, the value may
> + * need to be limited to MAX_PAGE_BUFFER_COUNT, which is the max # of entries
> + * in a GPA direct packet sent to netvsp over VMBus.
> + */
> +#if MAX_SKB_FRAGS + 2 < MAX_PAGE_BUFFER_COUNT
> +#define MAX_DATA_RANGES (MAX_SKB_FRAGS + 2)
> +#else
> +#define MAX_DATA_RANGES MAX_PAGE_BUFFER_COUNT
> +#endif
> +
> /* Estimated requestor size:
> * out_ring_size/min_out_msg_size + in_ring_size/min_in_msg_size
> */
...
> @@ -371,28 +338,28 @@ static u32 init_page_array(void *hdr, u32 len, struct sk_buff *skb,
> * 2. skb linear data
> * 3. skb fragment data
> */
> - slots_used += fill_pg_buf(virt_to_hvpfn(hdr),
> - offset_in_hvpage(hdr),
> - len,
> - &pb[slots_used]);
>
> + pb[0].offset = offset_in_hvpage(hdr);
> + pb[0].len = len;
> + pb[0].pfn = virt_to_hvpfn(hdr);
> packet->rmsg_size = len;
> - packet->rmsg_pgcnt = slots_used;
> + packet->rmsg_pgcnt = 1;
>
> - slots_used += fill_pg_buf(virt_to_hvpfn(data),
> - offset_in_hvpage(data),
> - skb_headlen(skb),
> - &pb[slots_used]);
> + pb[1].offset = offset_in_hvpage(skb->data);
> + pb[1].len = skb_headlen(skb);
> + pb[1].pfn = virt_to_hvpfn(skb->data);
>
> for (i = 0; i < frags; i++) {
> skb_frag_t *frag = skb_shinfo(skb)->frags + i;
> + struct hv_page_buffer *cur_pb = &pb[i + 2];
Hi Michael,
If I got things right then then pb is allocated on the stack
in netvsc_xmit and has MAX_DATA_RANGES elements.
If MAX_SKB_FRAGS is largs and MAX_DATA_RANGES has been limited to
MAX_DATA_RANGES. And frags is large. Is is possible to overrun pb here?
> + u64 pfn = page_to_hvpfn(skb_frag_page(frag));
> + u32 offset = skb_frag_off(frag);
>
> - slots_used += fill_pg_buf(page_to_hvpfn(skb_frag_page(frag)),
> - skb_frag_off(frag),
> - skb_frag_size(frag),
> - &pb[slots_used]);
> + cur_pb->offset = offset_in_hvpage(offset);
> + cur_pb->len = skb_frag_size(frag);
> + cur_pb->pfn = pfn + (offset >> HV_HYP_PAGE_SHIFT);
> }
> - return slots_used;
> + return frags + 2;
> }
>
> static int count_skb_frag_slots(struct sk_buff *skb)
> @@ -483,7 +450,7 @@ static int netvsc_xmit(struct sk_buff *skb, struct net_device *net, bool xdp_tx)
> struct net_device *vf_netdev;
> u32 rndis_msg_size;
> u32 hash;
> - struct hv_page_buffer pb[MAX_PAGE_BUFFER_COUNT];
> + struct hv_page_buffer pb[MAX_DATA_RANGES];
>
> /* If VF is present and up then redirect packets to it.
> * Skip the VF if it is marked down or has no carrier.
...
Powered by blists - more mailing lists