netdev - Re: [RFC] tcp: use order-3 pages in tcp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 15 Nov 2012 05:06:01 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	"Yan, Zheng" <yanzheng@...n.com>
Cc:	netdev <netdev@...r.kernel.org>
Subject: Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()

On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> > We currently use per socket page reserve for tcp_sendmsg() operations.
> >
> > Its done to raise the probability of coalescing small write() in to
> > single segments in the skbs.
> >
> > But it wastes a lot of memory for applications handling a lot of mostly
> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
> >
> > I did a small experiment to use order-3 pages and it gave me a 10% boost
> > of performance, because each TSO skb can use only two frags of 32KB,
> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
> > setup the tx descriptor and TX completion path to unmap the frags and
> > free them.
> >
> > We also spend less time in tcp_sendmsg(), because we call page allocator
> > 8x less often.
> >
> > Now back to the per socket page, what about trying to factorize it ?
> >
> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
> > really use a percpu page reserve as we do in __netdev_alloc_frag()
> >
> > We could instead use a per thread reserve, at the cost of adding a test
> > in task exit handler.
> >
> > Recap :
> >
> > 1) Use a per thread page reserve instead of a per socket one
> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
> >
> >
> 
> Hi,
> 
> This commit makes one of our test case on core 2 machine drop in performance
> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
> the same time.  Analysis showed using order-3 pages causes more LLC misses,
> most new LLC misses happen when the senders copy data to the socket buffer.
> If revert to use single page, the sender side only trigger a few LLC
> misses, most
> LLC misses happen on the receiver size. It means most pages allocated by the
> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
> is much larger than LLC size. Should this regression be worried? or
> our test case
> is too unpractical?

Hi Yan

You forgot to give some basic information with this mail, like the
hardware configuration, NIC driver, ...

Increasing performance can sometime change the balance you had on a
prior workload.

Number of in flight bytes do not depend on the order of the pages, but
sizes of TCP buffers (receiver, sender)

TCP Small queue was an attempt to reduce the number of in-flight bytes,
you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
of letting the system autotune them) if you really need 2048 concurrent
flows.

Otherwise, each flow can consume up to 6 MB of memory, so obviously your
cpu caches wont hold 2048*6MB of memory...

If the sender is faster (because of this commit), but receiver is slow
to drain the receive queues, then you can have a situation where the
consumed memory on receiver is higher and the receiver might be actually
slower.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html