[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20130315200907.GA24041@order.stressinduktion.org>
Date: Fri, 15 Mar 2013 21:09:07 +0100
From: Hannes Frederic Sowa <hannes@...essinduktion.org>
To: Ben Hutchings <bhutchings@...arflare.com>
Cc: Jesper Dangaard Brouer <jbrouer@...hat.com>,
Eric Dumazet <eric.dumazet@...il.com>, netdev@...r.kernel.org,
yoshfuji@...ux-ipv6.org
Subject: Re: RFC crap-patch [PATCH] net: Per CPU separate frag mem accounting
On Thu, Mar 14, 2013 at 11:39:44PM +0000, Ben Hutchings wrote:
> On Fri, 2013-03-15 at 00:12 +0100, Hannes Frederic Sowa wrote:
> > On Thu, Mar 14, 2013 at 08:59:03PM +0000, Ben Hutchings wrote:
> > > On Thu, 2013-03-14 at 09:59 +0100, Jesper Dangaard Brouer wrote:
> > > > On Thu, 2013-03-14 at 08:25 +0100, Jesper Dangaard Brouer wrote:
> > > > > This is NOT the patch I just mentioned in the other thread, of removing
> > > > > the LRU list. This patch does real per cpu mem acct, and LRU per CPU.
> > > > >
> > > > > I get really good performance number with this patch, but I still think
> > > > > this might not be the correct solution.
> > > >
> > > > The reason is this depend on fragments entering the same HW queue, some
> > > > NICs might not put the first fragment (which have the full header
> > > > tuples) and the remaining fragments on the same queue. In which case
> > > > this patch will loose its performance gain.
> > > [...]
> > >
> > > The Microsoft RSS spec only includes port numbers in the flow hash for
> > > TCP, presumably because TCP avoids IP fragmentation whereas datagram
> > > protocols cannot. Some Linux drivers allow UDP ports to be included in
> > > the flow hash but I don't think this is the default for any of them.
> > >
> > > In Solarflare hardware the IPv4 MF bit inhibits layer 4 flow steering,
> > > so all fragments will be unsteered. I don't know whether everyone else
> > > got that right though. :-)
> >
> > Shouldn't they be steered by the IPv4 2-tuple then (if ipv4 hashing is enabled
> > on the card)?
>
> IP fragments should get a flow hash based on the 2-tuple, yes.
Thanks for clearing this up!
Hm, if we seperate the fragmentation caches per cpu perhaps it would
make sense to recalculate rxhash as soon as we know that we processed the
first fragment with more-fragments flag set and reroute it to another cpu
once (much like rps). It would burn caches but the next packets would
already arrive at the correct cpu. This would perhaps be benficial if
(like I think Jesper said) a common scenario is where packets are split in
minimum 3 fragments. I don't think there would be latency problems either
because we cannot deliver the first fragment up the stack (given no packet
reordering). So we would not have cross cpu fragment lookups anymore, but I
don't know if the overhead is worth it.
This could be done conditionally by a blacklist where we check if the nic
does generate broken udp/fragment checksums. The in-kernel flow dissector
already does handle this case correctly. We would "just" have to verify if
the network cards handle this case correctly. Heh, thats something where
the kernel could tune itself and deactivate cross fragment cache lookups
as soon as it knows that a given interface handles this case correctly. :)
But this also seems to be very complex just for handling fragments. :/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists