[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd88ac7e-fe82-fdc0-3410-0decf57d3c43@intel.com>
Date: Fri, 2 Jun 2023 18:15:03 +0200
From: Alexander Lobakin <aleksander.lobakin@...el.com>
To: Alexander Duyck <alexander.duyck@...il.com>
CC: Paul Menzel <pmenzel@...gen.mpg.de>,
Jesper Dangaard Brouer <hawk@...nel.org>,
Larysa Zaremba <larysa.zaremba@...el.com>,
<netdev@...r.kernel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>,
<linux-kernel@...r.kernel.org>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Michal Kubiak <michal.kubiak@...el.com>,
<intel-wired-lan@...ts.osuosl.org>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Christoph Hellwig <hch@....de>,
Magnus Karlsson <magnus.karlsson@...el.com>
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx
buffer allocation a bunch
From: Alexander Duyck <alexander.duyck@...il.com>
Date: Fri, 2 Jun 2023 08:04:53 -0700
> On Fri, Jun 2, 2023 at 7:00 AM Alexander Lobakin
> <aleksander.lobakin@...el.com> wrote:
[...]
>> That's for Rx path, but don't forget that the initial allocation on ifup
>> is done in the process context. That's what the maintainers and
>> reviewers usually warn about: to not allocate with %GFP_ATOMIC on ifups.
>
> I can see that for the static values like the queue vectors and rings,
> however for the buffers themselves, but I don't see the point in doing
> that for the regular buffer allocations. Basically it is adding
> overhead for something that should have minimal impact as it usually
> happens early on during boot when the memory should be free anyway so
> GFP_ATOMIC vs GFP_KERNEL wouldn't have much impact in either case
Queue vectors and rings get allocated earlier than buffers, on device
probing :D ifup happens later and it depends on the networking scripts
etc. -- now every init system enables all the interfaces when booting up
like systemd does. Plus, ifdowns-ifups can occur during the normal
system functioning -- resets, XDP setup/remove, Ethtool configuration,
and so on. I wouldn't say Rx buffer allocation happens only on early boot.
[...]
>> Oh, I feel like I'm starting to agree :D OK, then the following doesn't
>> really get out of my head: why do we store skb pointer on the ring then,
>> if we count 1 skb as 1 unit, so that we won't leave the loop until the
>> EOP? Only to handle allocation failures? But skb is already allocated at
>> this point... <confused>
>
> The skb is there to essentially hold the frags. Keep in mind that when
> ixgbe was coded up XDP didn't exist yet.
Ok, maybe I phrased it badly.
If we don't stop the loop until skb is passed up the stack, how we can
go out of the loop with an unfinished skb? Previously, I thought lots of
drivers do that, as you may exhaust your budget prior to reaching the
last fragment, so you'll get back to the skb on the next poll.
But if we count 1 skb as budget unit, not descriptor, how we can end up
breaking the loop prior to finishing the skb? I can imagine only one
situation: HW gave us some buffers, but still processes the EOP buffer,
so we don't have any more descriptors to process, but the skb is still
unfinished. But sounds weird TBH, I thought HW processes frames
"atomically", i.e. it doesn't give you buffers until they hold the whole
frame :D
>
> I think there are drivers that are already getting away from this,
> such as mvneta, by storing an xdp_buff instead of an skb. In theory we
> could do away with most of this and just use a shared_info structure,
> but since that exists in the first frag we still need a pointer to the
> first frag as well.
ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
than skb, but also carries the frags, since frags is a part of shinfo,
not skb.
It's totally fine and we'll end up doing the same here, my question was
as I explained below.
>
> Also multi-frag frames are typically not that likely on a normal
> network as most of the frames are less than 1514B in length. In
> addition as I mentioned before a jumbo frame workload will be less
> demanding since the frame rates are so much lower. So when I coded
> this up I had optimized for the non-fragged case with the fragmented
> case being more of an afterthought needed mostly as exception
> handling.
[...]
> That is kind of what I figured. So one thing to watch out for is
> stating performance improvements without providing context on what
> exactly it is you are doing to see that gain. So essentially what we
> have is a microbenchmark that is seeing the gain.
>
> Admittedly my goto used to be IPv4 routing since that exercised both
> the Tx and Rx path for much the same reason. However one thing you
> need to keep in mind is that if you cannot see a gain in the
> full-stack test odds are most users may not notice much of an impact.
Yeah sure. I think more than a half of optimizations in such drivers
nowadays is unnoticeable to end users :D
[...]
> Either that or they were already worn down by the time you started
> adding this type of stuff.. :)
>
> The one I used to do that would really drive people nuts was:
> for (i = loop_count; i--;)
Oh, nice one! Never thought of something like that hehe.
>
> It is more efficient since I don't have to do the comparison to the
> loop counter, but it is definitely counterintuitive to run loops
> backwards like that. I tried to break myself of the habit of using
> those sort of loops anywhere that wasn't performance critical such as
> driver init.
[...]
> Yep, now the question is how many drivers can be pulled into using
> this library. The issue is going to be all the extra features and
> workarounds outside of your basic Tx/Rx will complicate the code since
> all the drivers implement them a bit differently. One of the reasons
> for not consolidating them was to allow for performance optimizing for
> each driver. By combining them you are going to likely need to add a
> number of new conditional paths to the fast path.
When I was counting the number of spots in the Rx polling function that
need to have switch-cases/ifs in order to be able to merge the code
(e.g. parsing the descriptors), it was something around 4-5 (per
packet). So it can only be figured out during the testing whether adding
new branches actually hurts there.
XDP is relatively easy to unify in one place, most of code is
software-only. Ring structures are also easy to merge, wrapping a couple
driver-specific pointers into static inline accessors. Hotpath in
general is the easiest part, that's why I started from it.
But anyway, I'd say if one day I'd have to choice whether to remove 400
locs per driver with having -1% in synthetic tests or not do that at
all, I'd go for the former. As discussed above, it's very unlikely for
such changes to hurt real workloads, esp. with usercopy.
>
>
>>>
>>>> [...]
>>>>
>>>> Thanks for the detailed reviews, stuff that Intel often lacks :s :D
>>>
>>> No problem, it was the least I could do since I am responsible for so
>>> much of this code in the earlier drivers anyway. If nothing else I
>>> figured I could provide a bit of history on why some of this was the
>>> way it was.
>> These history bits are nice and interesting to read actually! And also
>> useful since they give some context and understanding of what is
>> obsolete and can be removed/changed.
>
> Yeah, it is easiest to do these sort of refactors when you have
> somebody to answer the "why" of most of this. I recall going through
> this when I was refactoring the igb/ixgbe drivers back in the day and
> having to purge the dead e1000 code throughout. Of course, after this
> refactor it will be all yours right?.. :D
Nope, maybe from the git-blame PoV only :p
Thanks,
Olek
Powered by blists - more mailing lists