netdev - Re: [RFC] udp: some improvements on RX path.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 05 Dec 2016 06:28:53 -0800
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Paolo Abeni <pabeni@...hat.com>
Cc:     netdev <netdev@...r.kernel.org>
Subject: Re: [RFC] udp: some improvements on RX path.

On Mon, 2016-12-05 at 14:22 +0100, Paolo Abeni wrote:
> Hi Eric,
> 
> On Sun, 2016-12-04 at 18:43 -0800, Eric Dumazet wrote:
> > We currently access 3 cache lines from an skb in receive queue while
> > holding receive queue lock :
> > 
> > First cache line (contains ->next / prev pointers )
> > 2nd cache line (skb->peeked)
> > 3rd cache line (skb->truesize)
> > 
> > I believe we could get rid of skb->peeked completely.
> > 
> > I will cook a patch, but basically the idea is that the last owner of a
> > skb (right before skb->users becomes 0) can have the 'ownership' and
> > thus increase stats.
> 
> Agreed.
> 
> > The 3rd cache line miss is easily avoided by the following patch.
> 
> I run some performance tests on top of your patch "net: reorganize
> struct sock for better data locality", and I see an additional ~7%
> improvement on top of that, in the udp flood scenario. 
> 
> In my tests, the topmost perf offenders for the u/s process are now:
> 
>    9.98%  udp_sink  [kernel.kallsyms]  [k] udp_rmem_release
>    8.76%  udp_sink  [kernel.kallsyms]  [k] inet_recvmsg
>    6.71%  udp_sink  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>    5.40%  udp_sink  [kernel.kallsyms]  [k] __skb_try_recv_datagram
>    5.19%  udp_sink  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
> 
> udp_rmem_release() spends most of its time doing:
> 
>     atomic_sub(size, &sk->sk_rmem_alloc);
> 
> I see a cacheline miss while accessing sk_rmem_alloc; most probably due
> to sk_rmem_alloc being updated by the writer outside the rx queue lock.
> 
> Moving such update inside the lock show remove this cache miss but will
> increase the pressure on the rx lock. What do you think ?

I want to accumulate the sk_rmem_alloc  deficit in another variable in
the same cache line than the secondary list, updated from process
context at udp recvmsg() time.

And only transfer the accumulated deficit in one go when the second
queue is emptied, or if the accumulated deficite is > (rcvbuf/2)

If done right, we should interfere between the softirq flooder(s) and
the process thread only once per batch.

> 
> inet_recvmsg() is there because with "net: reorganize struct sock for
> better data locality" we get a cache miss while accessing skc_rxhash in
> sock_rps_record_flow(); touching sk_drops is dirtying that cacheline -
> sorry for not noticing this before. Do you have CONFIG_RPS disabled ?
> 
> > But I also want to work on the idea I gave few days back, having a
> > separate queue and use splice to transfer the 'softirq queue' into
> > a calm queue in a different cache line.
> > 
> > I expect a 50 % performance increase under load, maybe 1.5 Mpps.
> 
> It should work nicely under contention, but won't that increase the
> overhead for the uncontended/single flow scenario ? the user space
> reader needs to acquire 2 lock when splicing the 'softirq queue'. On my
> system ksoftirqd and the u/s process work at similar speeds, so splicing
> will happen quite often. 

Well, the splice would happen only if you have more than one message in
the softirq queue. So no real overhead for uncontended flow scenario.


This reminds me of the busylock I added in __dev_xmit_skb(), which
basically is acquired only when we detect a possible contention on qdisc
lock.

Thanks.