netdev - Re: [PATCH net] net: Fix packet reordering caused by GRO and listified RX cooperation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <da13831f11d0141728a96954685fdf40@dlink.ru>
Date:   Sat, 18 Jan 2020 13:05:19 +0300
From:   Alexander Lobakin <alobakin@...nk.ru>
To:     Saeed Mahameed <saeedm@...lanox.com>
Cc:     ecree@...arflare.com, Maxim Mikityanskiy <maximmi@...lanox.com>,
        Jiri Pirko <jiri@...lanox.com>, edumazet@...gle.com,
        netdev@...r.kernel.org, davem@...emloft.net,
        Tariq Toukan <tariqt@...lanox.com>
Subject: Re: [PATCH net] net: Fix packet reordering caused by GRO and
 listified RX cooperation

Hi Saeed,

Saeed Mahameed wrote 18.01.2020 01:47:
> On Fri, 2020-01-17 at 15:09 +0000, Maxim Mikityanskiy wrote:
>> Commit 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
>> napi_gro_receive()") introduces batching of GRO_NORMAL packets in
>> napi_skb_finish. However, dev_gro_receive, that is called just before
>> napi_skb_finish, can also pass skbs to the networking stack: e.g.,
>> when
>> the GRO session is flushed, napi_gro_complete is called, which passes
>> pp
>> directly to netif_receive_skb_internal, skipping napi->rx_list. It
>> means
>> that the packet stored in pp will be handled by the stack earlier
>> than
>> the packets that arrived before, but are still waiting in napi-
>> >rx_list.
>> It leads to TCP reorderings that can be observed in the TCPOFOQueue
>> counter in netstat.
>> 
>> This commit fixes the reordering issue by making napi_gro_complete
>> also
>> use napi->rx_list, so that all packets going through GRO will keep
>> their
>> order.
>> 
>> Fixes: 6570bc79c0df ("net: core: use listified Rx for GRO_NORMAL in
>> napi_gro_receive()")
>> Signed-off-by: Maxim Mikityanskiy <maximmi@...lanox.com>
>> Cc: Alexander Lobakin <alobakin@...nk.ru>
>> Cc: Edward Cree <ecree@...arflare.com>
>> ---
>> Alexander and Edward, please verify the correctness of this patch. If
>> it's necessary to pass that SKB to the networking stack right away, I
>> can change this patch to flush napi->rx_list by calling
>> gro_normal_list
>> first, instead of putting the SKB in the list.
>> 
> 
> actually this will break performance of traffic that needs to skip
> gro.. and we will loose bulking, so don't do it :)
> 
> But your point is valid when napi_gro_complete() is called outside of
> napi_gro_receive() path.
> 
> see below..
> 
>>  net/core/dev.c | 55 +++++++++++++++++++++++++-----------------------
>> --
>>  1 file changed, 28 insertions(+), 27 deletions(-)
>> 
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 0ad39c87b7fd..db7a105bbc77 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5491,9 +5491,29 @@ static void flush_all_backlogs(void)
>>  	put_online_cpus();
>>  }
>> 
>> +/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
>> +static void gro_normal_list(struct napi_struct *napi)
>> +{
>> +	if (!napi->rx_count)
>> +		return;
>> +	netif_receive_skb_list_internal(&napi->rx_list);
>> +	INIT_LIST_HEAD(&napi->rx_list);
>> +	napi->rx_count = 0;
>> +}
>> +
>> +/* Queue one GRO_NORMAL SKB up for list processing. If batch size
>> exceeded,
>> + * pass the whole batch up to the stack.
>> + */
>> +static void gro_normal_one(struct napi_struct *napi, struct sk_buff
>> *skb)
>> +{
>> +	list_add_tail(&skb->list, &napi->rx_list);
>> +	if (++napi->rx_count >= gro_normal_batch)
>> +		gro_normal_list(napi);
>> +}
>> +
>>  INDIRECT_CALLABLE_DECLARE(int inet_gro_complete(struct sk_buff *,
>> int));
>>  INDIRECT_CALLABLE_DECLARE(int ipv6_gro_complete(struct sk_buff *,
>> int));
>> -static int napi_gro_complete(struct sk_buff *skb)
>> +static int napi_gro_complete(struct napi_struct *napi, struct
>> sk_buff *skb)
>>  {
>>  	struct packet_offload *ptype;
>>  	__be16 type = skb->protocol;
>> @@ -5526,7 +5546,8 @@ static int napi_gro_complete(struct sk_buff
>> *skb)
>>  	}
>> 
>>  out:
>> -	return netif_receive_skb_internal(skb);
>> +	gro_normal_one(napi, skb);
>> +	return NET_RX_SUCCESS;
>>  }
>> 
> 
> The patch looks fine when napi_gro_complete() is called form
> napi_gro_receive() path.
> 
> But napi_gro_complete() is also used by napi_gro_flush() which is
> called in other contexts, which might break, if they really meant to
> flush to the stack..
> 
> examples:
> 1. drives that use napi_gro_flush() which is not "eventually" followed
> by napi_complete_done(), might break.. possible bug in those drivers
> though. drivers must always return with napi_complete_done();

Drivers *should not* use napi_gro_flush() by themselves. This was
discussed several times here and at the moment me and Edward are
waiting for proper NAPI usage in iwlwifi driver to unexport this
one and make it static.

> 2. the following code in napi_complete_done()
> 
> /* When the NAPI instance uses a timeout and keeps postponing
>  * it, we need to bound somehow the time packets are kept in
>  * the GRO layer
>  */
>   napi_gro_flush(n, !!timeout);
> 
> with the new implementation we won't really flush to the stack unless

Oh, I got this one. This is really an issue. gro_normal_list() is
called earlier than napi_gro_flush() in napi_complete_done(), so
several skbs might stuck in napi->rx_list until next NAPI session.
Thanks for pointing this out, I missed it.

> one possible solution: is to call gro_normal_list(napi); inside:
> napi_gro_flush() ?
> 
> another possible solution:
> allays make sure to follow napi_gro_flush(); with gro_normal_list(n);
> 
> since i see two places in dev.c where we do:
> 
> gro_normal_list(n);
> if (cond) {
>    napi_gro_flush();
> }
> 
> instead, we can change them to:
> 
> if (cond) {
>    /* flush gro to napi->rx_list, with your implementation  */
>    napi_gro_flush();
> }
> gro_normal_list(n); /* Now flush to the stack */
> 
> And your implementation will be correct for such use cases.

I think this one would be more straightforward and correct.
But this needs tests for sure. I could do them only Monday, 20
unfortunately.

Or we can call gro_normal_list() directly in napi_gro_complete()
as Maxim proposed as alternative solution.
I'd like to see what Edward thinks about it. But this one really
needs to be handled either way.

>>  static void __napi_gro_flush_chain(struct napi_struct *napi, u32
>> index,
>> @@ -5539,7 +5560,7 @@ static void __napi_gro_flush_chain(struct
>> napi_struct *napi, u32 index,
>>  		if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
>>  			return;
>>  		skb_list_del_init(skb);
>> -		napi_gro_complete(skb);
>> +		napi_gro_complete(napi, skb);
>>  		napi->gro_hash[index].count--;
>>  	}
>> 
>> @@ -5641,7 +5662,7 @@ static void gro_pull_from_frag0(struct sk_buff
>> *skb, int grow)
>>  	}
>>  }
>> 
>> -static void gro_flush_oldest(struct list_head *head)
>> +static void gro_flush_oldest(struct napi_struct *napi, struct
>> list_head *head)
>>  {
>>  	struct sk_buff *oldest;
>> 
>> @@ -5657,7 +5678,7 @@ static void gro_flush_oldest(struct list_head
>> *head)
>>  	 * SKB to the chain.
>>  	 */
>>  	skb_list_del_init(oldest);
>> -	napi_gro_complete(oldest);
>> +	napi_gro_complete(napi, oldest);
>>  }
>> 
>>  INDIRECT_CALLABLE_DECLARE(struct sk_buff *inet_gro_receive(struct
>> list_head *,
>> @@ -5733,7 +5754,7 @@ static enum gro_result dev_gro_receive(struct
>> napi_struct *napi, struct sk_buff
>> 
>>  	if (pp) {
>>  		skb_list_del_init(pp);
>> -		napi_gro_complete(pp);
>> +		napi_gro_complete(napi, pp);
>>  		napi->gro_hash[hash].count--;
>>  	}
>> 
>> @@ -5744,7 +5765,7 @@ static enum gro_result dev_gro_receive(struct
>> napi_struct *napi, struct sk_buff
>>  		goto normal;
>> 
>>  	if (unlikely(napi->gro_hash[hash].count >= MAX_GRO_SKBS)) {
>> -		gro_flush_oldest(gro_head);
>> +		gro_flush_oldest(napi, gro_head);
>>  	} else {
>>  		napi->gro_hash[hash].count++;
>>  	}
>> @@ -5802,26 +5823,6 @@ struct packet_offload
>> *gro_find_complete_by_type(__be16 type)
>>  }
>>  EXPORT_SYMBOL(gro_find_complete_by_type);
>> 
>> -/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
>> -static void gro_normal_list(struct napi_struct *napi)
>> -{
>> -	if (!napi->rx_count)
>> -		return;
>> -	netif_receive_skb_list_internal(&napi->rx_list);
>> -	INIT_LIST_HEAD(&napi->rx_list);
>> -	napi->rx_count = 0;
>> -}
>> -
>> -/* Queue one GRO_NORMAL SKB up for list processing. If batch size
>> exceeded,
>> - * pass the whole batch up to the stack.
>> - */
>> -static void gro_normal_one(struct napi_struct *napi, struct sk_buff
>> *skb)
>> -{
>> -	list_add_tail(&skb->list, &napi->rx_list);
>> -	if (++napi->rx_count >= gro_normal_batch)
>> -		gro_normal_list(napi);
>> -}
>> -
>>  static void napi_skb_free_stolen_head(struct sk_buff *skb)
>>  {
>>  	skb_dst_drop(skb);

Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ