[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAM7YAm13mTaOPqwaq1rEx9Q+OXJjneSwH_cg5QRZTvF-qyCrA@mail.gmail.com>
Date: Fri, 16 Nov 2012 10:36:30 +0800
From: "Yan, Zheng " <yanzheng@...n.com>
To: Eric Dumazet <eric.dumazet@...il.com>
Cc: netdev <netdev@...r.kernel.org>, rick.jones2@...com,
"Yan, Zheng" <zheng.z.yan@...ux.intel.com>
Subject: Re: [RFC] tcp: use order-3 pages in tcp_sendmsg()
On Thu, Nov 15, 2012 at 9:06 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> On Thu, 2012-11-15 at 15:52 +0800, Yan, Zheng wrote:
>> On Mon, Sep 17, 2012 at 3:49 PM, Eric Dumazet <eric.dumazet@...il.com> wrote:
>> > We currently use per socket page reserve for tcp_sendmsg() operations.
>> >
>> > Its done to raise the probability of coalescing small write() in to
>> > single segments in the skbs.
>> >
>> > But it wastes a lot of memory for applications handling a lot of mostly
>> > idle sockets, since each socket holds one page in sk->sk_sndmsg_page
>> >
>> > I did a small experiment to use order-3 pages and it gave me a 10% boost
>> > of performance, because each TSO skb can use only two frags of 32KB,
>> > instead of 16 frags of 4KB, so we spend less time in ndo_start_xmit() to
>> > setup the tx descriptor and TX completion path to unmap the frags and
>> > free them.
>> >
>> > We also spend less time in tcp_sendmsg(), because we call page allocator
>> > 8x less often.
>> >
>> > Now back to the per socket page, what about trying to factorize it ?
>> >
>> > Since we can sleep (or/and do a cpu migration) in tcp_sendmsg(), we cant
>> > really use a percpu page reserve as we do in __netdev_alloc_frag()
>> >
>> > We could instead use a per thread reserve, at the cost of adding a test
>> > in task exit handler.
>> >
>> > Recap :
>> >
>> > 1) Use a per thread page reserve instead of a per socket one
>> > 2) Use order-3 pages (or order-0 pages if page size is >= 32768)
>> >
>> >
>>
>> Hi,
>>
>> This commit makes one of our test case on core 2 machine drop in performance
>> by about 60%. The test case runs 2048 instances of netperf 64k stream test at
>> the same time. Analysis showed using order-3 pages causes more LLC misses,
>> most new LLC misses happen when the senders copy data to the socket buffer.
>> If revert to use single page, the sender side only trigger a few LLC
>> misses, most
>> LLC misses happen on the receiver size. It means most pages allocated by the
>> senders are cache hot. But when using order-3 pages, 2048 * 32k = 64M, 64M
>> is much larger than LLC size. Should this regression be worried? or
>> our test case
>> is too unpractical?
>
> Hi Yan
>
> You forgot to give some basic information with this mail, like the
> hardware configuration, NIC driver, ...
>
> Increasing performance can sometime change the balance you had on a
> prior workload.
>
> Number of in flight bytes do not depend on the order of the pages, but
> sizes of TCP buffers (receiver, sender)
>
> TCP Small queue was an attempt to reduce the number of in-flight bytes,
> you should try to change either SO_SNDBUF or SO_RCVBUF settings (instead
> of letting the system autotune them) if you really need 2048 concurrent
> flows.
>
> Otherwise, each flow can consume up to 6 MB of memory, so obviously your
> cpu caches wont hold 2048*6MB of memory...
>
> If the sender is faster (because of this commit), but receiver is slow
> to drain the receive queues, then you can have a situation where the
> consumed memory on receiver is higher and the receiver might be actually
> slower.
>
I'm sorry, I forgot to mention the test ran on loopback device. It's
one test case in
our kernel performance test project. This test case is very sensitive to memory
allocation and scheduler behavior changes.
Regards
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists