netdev - Re: [PATCH net] udp: Allow kernel service to avoid udp socket rx queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <26468.1538645863@warthog.procyon.org.uk>
Date:   Thu, 04 Oct 2018 10:37:43 +0100
From:   David Howells <dhowells@...hat.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     dhowells@...hat.com, netdev@...r.kernel.org,
        linux-afs@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net] udp: Allow kernel service to avoid udp socket rx queue

Paolo Abeni <pabeni@...hat.com> wrote:

> > There's a problem somewhere skb_recv_udp() that doesn't decrease the
> > sk_rmem_alloc counter when a packet is extracted from the receive queue by
> > a kernel service.  
> 
> If this is the case, it's really bad and need an explicit fix. However
> it looks like sk_rmem_alloc is reclaimed by skb_recv_udp(), as it ends-
> up calling udp_rmem_release() on succesfull dequeue.

It certainly *looks* like it should do that, but nonetheless, the tracepoint I
put in shows it going up and up.  I can try putting in a tracepoint by the
subtraction, see what that shows.

> > Further, there doesn't seem any point in having the socket buffer being
> > added to the UDP socket's rx queue since the rxrpc's data_ready handler
> > takes it straight back out again (more or less, there seem to be occasional
> > hold ups there).
> 
> I really would really try to avoid adding another indirect call in the
> data-path, unless strictly needed (to avoid more RETPOLINE overhead for
> all other use-case). If skipping altogether the enqueuing makes sense
> (I guess so, mostily for performance reasons), I *think* you can use
> the already existing encap_rcv hook, initializing it to the rxrpc input
> function, and updating such function to pull the udp header and ev.
> initializing the pktinfo, if needed. Please see e.g. l2tp usage.

I looked at that, but it seems that a global conditional is required to enable
it - presumably for performance reasons.  I presume I would need to:

 (1) Allocate a new UDP_ENCAP_* flag.

 (2) Replicate at least some of the stuff that gets done between the check in
     udp_queue_rcv_skb() and the call of __udp_enqueue_schedule_skb() such as
     calling packet filtering.

     I'm not sure whether I need to call things like ipv4_pktinfo_prepare(),
     sock_rps_save_rxhash(), sk_mark_napi_id() or sk_incoming_cpu_update() -
     are they of necessity to the UDP socket?

> > Putting in some tracepoints show a significant delay occurring between packets
> > coming in and thence being delivered to rxrpc:
> > 
> > <idle>-0  [001] ..s2    67.631844: net_rtl8169_napi_rx: enp3s0 skb=07db0a32
> > ...
> > <idle>-0  [001] ..s4    68.292778: rxrpc_rx_packet: d5ce8d37:bdb93c60:00000002:09c7 00000006 00000000 02 20 ACK 660967981 skb=07db0a32
> > 
> > The "660967981" is the time difference in nanoseconds between the sk_buff
> > timestamp and the current time.  It seems to match the time elapsed between
> > the two trace lines reasonably well.  I've seen anything up to about 4s.
> 
> Can you please provide more data?

I can give you a whole trace if you like.

> specifically can you please add:
> * a perf probe in rxrpc_data_ready() just after skb_recv_udp()
> reporting the sk->sk_rmem_alloc and skb->truesize
> * a perf probe in __udp_enqueue_schedule_skb() just before the 'if
> (rmem > sk->sk_rcvbuf)' test reporting again sk->sk_rmem_alloc, skb-
> >truesize and sk->sk_rcvbuf
> And then provide the perf record -g -e ... /perf script output?

Can't this be done by putting tracepoints there instead?  I don't know how to
do the perf stuff.  What can that get that can't be obtained with a
tracepoint?

Thanks,
David