[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e7e3e30374b1b95f7f9a945c765ff523dfda3016.camel@redhat.com>
Date: Thu, 04 Oct 2018 11:08:11 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: David Howells <dhowells@...hat.com>, netdev@...r.kernel.org
Cc: linux-afs@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] udp: Allow kernel service to avoid udp socket rx
queue
Hi,
On Wed, 2018-10-03 at 19:48 +0100, David Howells wrote:
> There's a problem somewhere skb_recv_udp() that doesn't decrease the
> sk_rmem_alloc counter when a packet is extracted from the receive queue by
> a kernel service.
If this is the case, it's really bad and need an explicit fix. However
it looks like sk_rmem_alloc is reclaimed by skb_recv_udp(), as it ends-
up calling udp_rmem_release() on succesfull dequeue.
udp_rmem_release() can delay sk_rmem_alloc reclaiming is the rx queue
is almost empty, due to commit
6b229cf77d683f634f0edd876c6d1015402303ad. Anyhow I don't see either as
that may affect this scenario: if only 1/4 of the maxium receive buffer
size is used, the next packet should always be able to land into the rx
queue, with the default sk_rcvbuf and every reasonable truesize.
> Further, there doesn't seem any point in having the socket buffer being
> added to the UDP socket's rx queue since the rxrpc's data_ready handler
> takes it straight back out again (more or less, there seem to be occasional
> hold ups there).
I really would really try to avoid adding another indirect call in the
data-path, unless strictly needed (to avoid more RETPOLINE overhead for
all other use-case). If skipping altogether the enqueuing makes sense
(I guess so, mostily for performance reasons), I *think* you can use
the already existing encap_rcv hook, initializing it to the rxrpc input
function, and updating such function to pull the udp header and ev.
initializing the pktinfo, if needed. Please see e.g. l2tp usage.
> Putting in some tracepoints show a significant delay occurring between packets
> coming in and thence being delivered to rxrpc:
>
> <idle>-0 [001] ..s2 67.631844: net_rtl8169_napi_rx: enp3s0 skb=07db0a32
> ...
> <idle>-0 [001] ..s4 68.292778: rxrpc_rx_packet: d5ce8d37:bdb93c60:00000002:09c7 00000006 00000000 02 20 ACK 660967981 skb=07db0a32
>
> The "660967981" is the time difference in nanoseconds between the sk_buff
> timestamp and the current time. It seems to match the time elapsed between
> the two trace lines reasonably well. I've seen anything up to about 4s.
Can you please provide more data? specifically can you please add:
* a perf probe in rxrpc_data_ready() just after skb_recv_udp()
reporting the sk->sk_rmem_alloc and skb->truesize
* a perf probe in __udp_enqueue_schedule_skb() just before the 'if
(rmem > sk->sk_rcvbuf)' test reporting again sk->sk_rmem_alloc, skb-
>truesize and sk->sk_rcvbuf
And then provide the perf record -g -e ... /perf script output?
Thanks,
Paolo
Powered by blists - more mailing lists