linux-kernel - Re: [PATCH v2] xen-netfront: Fix Rx stall during network stress and OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <38ccfaea-0a65-a6f3-c19a-e6f9c0d4ef76@oracle.com>
Date:   Sun, 29 Jan 2017 18:09:21 -0500
From:   Boris Ostrovsky <boris.ostrovsky@...cle.com>
To:     Vineeth Remanan Pillai <vineethp@...zon.com>,
        David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, Wei Liu <wei.liu2@...rix.com>,
        Paul Durrant <paul.durrant@...rix.com>,
        xen-devel <xen-devel@...ts.xen.org>
Subject: Re: [PATCH v2] xen-netfront: Fix Rx stall during network stress and
 OOM



On 01/19/2017 11:35 AM, Vineeth Remanan Pillai wrote:
> From: Vineeth Remanan Pillai <vineethp@...zon.com>
>
> During an OOM scenario, request slots could not be created as skb
> allocation fails. So the netback cannot pass in packets and netfront
> wrongly assumes that there is no more work to be done and it disables
> polling. This causes Rx to stall.
>
> The issue is with the retry logic which schedules the timer if the
> created slots are less than NET_RX_SLOTS_MIN. The count of new request
> slots to be pushed are calculated as a difference between new req_prod
> and rsp_cons which could be more than the actual slots, if there are
> unconsumed responses.
>
> The fix is to calculate the count of newly created slots as the
> difference between new req_prod and old req_prod.
>
> Signed-off-by: Vineeth Remanan Pillai <vineethp@...zon.com>
> Reviewed-by: Juergen Gross <jgross@...e.com>
> ---
> Changes in v2:
>     - Removed the old implementation of enabling polling on
>       skb allocation error.
>     - Corrected the refill timer logic to schedule when newly
>       created slots since last push is less than NET_RX_SLOTS_MIN.


There are couple of problems with this patch.
1. The 'if' clause now evaluates to true on pretty much every call to 
xennet_alloc_rx_buffers().
2. It tickles a latent bug during resume where the timer triggers before 
we re-connect. The trouble is that we now try to dereference 
queue->rx.sring which is NULL since we disconnect in netfront_resume(). 
(Curiously, I only observe it with 32-bit guests)

I'll send a patch later that will delete the timer since it looks like a 
bug to me in any case but the first problem seems to be more serious 
than the problem that this patch addresses.

-boris

>
>  drivers/net/xen-netfront.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 40f26b6..2c7c29f 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -321,7 +321,7 @@ static void xennet_alloc_rx_buffers(struct
> netfront_queue *queue)
>      queue->rx.req_prod_pvt = req_prod;
>
>      /* Not enough requests? Try again later. */
> -    if (req_prod - queue->rx.rsp_cons < NET_RX_SLOTS_MIN) {
> +    if (req_prod - queue->rx.sring->req_prod < NET_RX_SLOTS_MIN) {
>          mod_timer(&queue->rx_refill_timer, jiffies + (HZ/10));
>          return;
>      }