[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f7tv8bmfmgc.fsf@redhat.com>
Date: Wed, 04 Oct 2023 11:15:31 -0400
From: Aaron Conole <aconole@...hat.com>
To: "Nicholas Piggin" <npiggin@...il.com>
Cc: "Eelco Chaudron" <echaudro@...hat.com>, <netdev@...r.kernel.org>,
<dev@...nvswitch.org>, "Ilya Maximets" <imaximet@...hat.com>, "Flavio
Leitner" <fbl@...hat.com>
Subject: Re: [ovs-dev] [RFC PATCH 4/7] net: openvswitch: ovs_vport_receive
reduce stack usage
"Nicholas Piggin" <npiggin@...il.com> writes:
> On Fri Sep 29, 2023 at 6:38 PM AEST, Eelco Chaudron wrote:
>>
>>
>> On 29 Sep 2023, at 9:00, Nicholas Piggin wrote:
>>
>> > On Fri Sep 29, 2023 at 1:26 AM AEST, Aaron Conole wrote:
>> >> Nicholas Piggin <npiggin@...il.com> writes:
>> >>
>> >>> Dynamically allocating the sw_flow_key reduces stack usage of
>> >>> ovs_vport_receive from 544 bytes to 64 bytes at the cost of
>> >>> another GFP_ATOMIC allocation in the receive path.
>> >>>
>> >>> XXX: is this a problem with memory reserves if ovs is in a
>> >>> memory reclaim path, or since we have a skb allocated, is it
>> >>> okay to use some GFP_ATOMIC reserves?
>> >>>
>> >>> Signed-off-by: Nicholas Piggin <npiggin@...il.com>
>> >>> ---
>> >>
>> >> This represents a fairly large performance hit. Just my own quick
>> >> testing on a system using two netns, iperf3, and simple forwarding rules
>> >> shows between 2.5% and 4% performance reduction on x86-64. Note that it
>> >> is a simple case, and doesn't involve a more involved scenario like
>> >> multiple bridges, tunnels, and internal ports. I suspect such cases
>> >> will see even bigger hit.
>> >>
>> >> I don't know the impact of the other changes, but just an FYI that the
>> >> performance impact of this change is extremely noticeable on x86
>> >> platform.
>> >
>> > Thanks for the numbers. This patch is probably the biggest perf cost,
>> > but unfortunately it's also about the biggest saving. I might have an
>> > idea to improve it.
>>
>> Also, were you able to figure out why we do not see this problem on
>> x86 and arm64? Is the stack usage so much larger, or is there some
>> other root cause?
>
> Haven't pinpointed it exactly. ppc64le interrupt entry frame is nearly
> 3x larger than x86-64, about 200 bytes. So there's 400 if a hard
> interrupt (not seen in the backtrace) is what overflowed it. Stack
> alignment I think is 32 bytes vs 16 for x86-64. And different amount of
> spilling and non-volatile register use and inlining choices by the
> compiler could nudge things one way or another. There is little to no
> ppc64le specific data structures on the stack in any of this call chain
> which should cause much more bloat though, AFAIKS.
>
> So other archs should not be far away from overflowing 16kB I think.
>
>> Is there a simple replicator, as this might help you
>> profile the differences between the architectures?
>
> Unfortunately not, it's some kubernetes contraption I don't know how
> to reproduce myself.
If we can get the flow dump and configuration, we can probably make sure
to reproduce it with ovs-dpctl.py (add any missing features, etc). I
guess it should be simple to get (ovs-vsctl show, ovs-appctl
dpctl/dump-flows) and we can try to replicate it.
> Thanks,
> Nick
Powered by blists - more mailing lists