netdev - Re: [PATCH 0/4] skb paged fragment destructors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1323438427.20077.57.camel@zakaz.uk.xensource.com>
Date:	Fri, 9 Dec 2011 13:47:07 +0000
From:	Ian Campbell <Ian.Campbell@...rix.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
CC:	David Miller <davem@...emloft.net>,
	Jesse Brandeburg <jesse.brandeburg@...el.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: [PATCH 0/4] skb paged fragment destructors

On Wed, 2011-12-07 at 13:35 +0000, Ian Campbell wrote:
> On Tue, 2011-12-06 at 13:24 +0000, Eric Dumazet wrote:
> > Le mardi 06 décembre 2011 à 11:57 +0000, Ian Campbell a écrit :
> > > On Wed, 2011-11-09 at 15:01 +0000, Ian Campbell wrote:
> > > >       * split linear data allocation and shinfo allocation into two. I
> > > >         suspect this will have its own performance implications? On the
> > > >         positive side skb_shared_info could come from its own fixed size
> > > >         pool/cache which might have some benefits
> > > 
> > > I played with this to see how it would look. Illustrative patch below. 
> > > 
> > > I figure that lots of small frames is the interesting workload for a
> > > change such as this but I don't know if iperf is necessarily the best
> > > benchmark for measuring that.
> > > Before changing things I got:
> > >         iperf -c qarun -m -t 60 -u -b 10000M -l 64
> > >         ------------------------------------------------------------
> > >         Client connecting to qarun, UDP port 5001
> > >         Sending 64 byte datagrams
> > >         UDP buffer size:   224 KByte (default)
> > >         ------------------------------------------------------------
> > >         [  3] local 10.80.225.63 port 45857 connected with 10.80.224.22 port 5001
> > >         [ ID] Interval       Transfer     Bandwidth
> > >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec
> > >         [  3] Sent 13820376 datagrams
> > >         [  3] Server Report:
> > >         [  3]  0.0-60.0 sec    844 MBytes    118 Mbits/sec  0.005 ms    0/13820375 (0%)
> > >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > > whereas with the patch:
> > >         # iperf -c qarun -m -t 60 -u -b 10000M -l 64
> > >         ------------------------------------------------------------
> > >         Client connecting to qarun, UDP port 5001
> > >         Sending 64 byte datagrams
> > >         UDP buffer size:   224 KByte (default)
> > >         ------------------------------------------------------------
> > >         [  3] local 10.80.225.63 port 42504 connected with 10.80.224.22 port 5001
> > >         [ ID] Interval       Transfer     Bandwidth
> > >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec
> > >         [  3] Sent 13645857 datagrams
> > >         [  3] Server Report:
> > >         [  3]  0.0-60.0 sec    833 MBytes    116 Mbits/sec  0.005 ms    0/13645856 (0%)
> > >         [  3]  0.0-60.0 sec  1 datagrams received out-of-order
> > > 
> > > With 1200 byte datagrams I get basically identical throughput.
> > > 
> > > (nb: none of the skb destructor stuff was present in either case)
> > 
> > Sorry, but the real problem is that if skb producer and consumer are not
> > on same CPU, each skb will now hit SLUB slowpath three times instead of
> > two.
> > 
> > Some workloads are : One cpu fully handling IRQ from device, dispatching
> > skbs to consumers on other cpus.
> 
> So something like a multithreaded apache benchmark would be interesting?
> 
> > Plus skb->truesize is wrong after your patch.
> > Not sure if cloning is correct either...
> 
> Me neither, the patch was just one which happened to run well enough to
> run some numbers on. I'll be sure to double-check/correct that stuff if
> I persist with the approach.
> 
> > Anyway, do we _really_ need 16 frags per skb, I dont know....
> 
> MAX_SKB_FRAGS is:
>         /* To allow 64K frame to be packed as single skb without frag_list. Since
>          * GRO uses frags we allocate at least 16 regardless of page size.
>          */
>         #if (65536/PAGE_SIZE + 2) < 16
>         #define MAX_SKB_FRAGS 16UL
>         #else
>         #define MAX_SKB_FRAGS (65536/PAGE_SIZE + 2)
>         #endif
>         
> So I think actually it turns out to be 18 on systems with 4k pages. If
> we reduced that to be 16 that would save 48 bytes on amd64 which pulls
> the shinfo size down to 440 and then
> 
>         NET_SKB_PAD (64) + 1500 + 14 + 440 = 2018
> 
> which does fit in half a page.
> 
> Using 16 still means that a 64k frame fits precisely in frags on a 4096
> page size system. I don't know what the extra 2 was for (it predates
> git), perhaps just to allow for some slop due to misalignment in the
> first page of data?

Tracked it down in TGLX's historic tree as:
http://git.kernel.org/?p=linux/kernel/git/tglx/history.git;a=commitdiff;h=80223d5186f73bf42a7e260c66c9cb9f7d8ec9cf

Having found the title I did a bit of searching around for discussion
but all I could find was other folks not having any idea where the + 2
comes from either...

http://kerneltrap.org/mailarchive/linux-netdev/2008/4/21/1529914 sort of
wonders if the + 2 might be including a head which crosses a page
boundary. I'm not sure it is sane/useful to account for that in
MAX_SKB_FRAGS. If we had a concept like MAX_SKB_PAGES then it would
perhaps make sense to have + 2 there, but AFAICT drivers etc are already
accounting for this appropriately by adding a further + 2 (or sometimes
+ 1) to MAX_SKB_FRAGS.

Ian.

> 
> > This gives problems when/if skb must be linearized and we hit
> > PAGE_ALLOC_COSTLY_ORDER
> > 
> > Alternatively, we could use order-1 or order-2 pages on x86 to get
> > 8192/16384 bytes frags. (fallback to order-0 pages in case of allocation
> > failures)
> 
> Or fallback to separate allocation of shinfo?
> 
> Ian.
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html