[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <f7ty0xa1hj2.fsf@redhat.com>
Date: Wed, 12 Mar 2025 09:54:09 -0400
From: Aaron Conole <aconole@...hat.com>
To: Ilya Maximets <i.maximets@....org>
Cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>, Eric
Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo
Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
dev@...nvswitch.org, linux-kernel@...r.kernel.org, Pravin B Shelar
<pshelar@....org>, Eelco Chaudron <echaudro@...hat.com>
Subject: Re: [PATCH net] net: openvswitch: remove misbehaving actions length
check
Ilya Maximets <i.maximets@....org> writes:
> On 3/8/25 21:03, Aaron Conole wrote:
>> Ilya Maximets <i.maximets@....org> writes:
>>
>>> The actions length check is unreliable and produces different results
>>> depending on the initial length of the provided netlink attribute and
>>> the composition of the actual actions inside of it. For example, a
>>> user can add 4088 empty clone() actions without triggering -EMSGSIZE,
>>> on attempt to add 4089 such actions the operation will fail with the
>>> -EMSGSIZE verdict. However, if another 16 KB of other actions will
>>> be *appended* to the previous 4089 clone() actions, the check passes
>>> and the flow is successfully installed into the openvswitch datapath.
>>>
>>> The reason for a such a weird behavior is the way memory is allocated.
>>> When ovs_flow_cmd_new() is invoked, it calls ovs_nla_copy_actions(),
>>> that in turn calls nla_alloc_flow_actions() with either the actual
>>> length of the user-provided actions or the MAX_ACTIONS_BUFSIZE. The
>>> function adds the size of the sw_flow_actions structure and then the
>>> actually allocated memory is rounded up to the closest power of two.
>>>
>>> So, if the user-provided actions are larger than MAX_ACTIONS_BUFSIZE,
>>> then MAX_ACTIONS_BUFSIZE + sizeof(*sfa) rounded up is 32K + 24 -> 64K.
>>> Later, while copying individual actions, we look at ksize(), which is
>>> 64K, so this way the MAX_ACTIONS_BUFSIZE check is not actually
>>> triggered and the user can easily allocate almost 64 KB of actions.
>>>
>>> However, when the initial size is less than MAX_ACTIONS_BUFSIZE, but
>>> the actions contain ones that require size increase while copying
>>> (such as clone() or sample()), then the limit check will be performed
>>> during the reserve_sfa_size() and the user will not be allowed to
>>> create actions that yield more than 32 KB internally.
>>>
>>> This is one part of the problem. The other part is that it's not
>>> actually possible for the userspace application to know beforehand
>>> if the particular set of actions will be rejected or not.
>>>
>>> Certain actions require more space in the internal representation,
>>> e.g. an empty clone() takes 4 bytes in the action list passed in by
>>> the user, but it takes 12 bytes in the internal representation due
>>> to an extra nested attribute, and some actions require less space in
>>> the internal representations, e.g. set(tunnel(..)) normally takes
>>> 64+ bytes in the action list provided by the user, but only needs to
>>> store a single pointer in the internal implementation, since all the
>>> data is stored in the tunnel_info structure instead.
>>>
>>> And the action size limit is applied to the internal representation,
>>> not to the action list passed by the user. So, it's not possible for
>>> the userpsace application to predict if the certain combination of
>>> actions will be rejected or not, because it is not possible for it to
>>> calculate how much space these actions will take in the internal
>>> representation without knowing kernel internals.
>>>
>>> All that is causing random failures in ovs-vswitchd in userspace and
>>> inability to handle certain traffic patterns as a result. For example,
>>> it is reported that adding a bit more than a 1100 VMs in an OpenStack
>>> setup breaks the network due to OVS not being able to handle ARP
>>> traffic anymore in some cases (it tries to install a proper datapath
>>> flow, but the kernel rejects it with -EMSGSIZE, even though the action
>>> list isn't actually that large.)
>>>
>>> Kernel behavior must be consistent and predictable in order for the
>>> userspace application to use it in a reasonable way. ovs-vswitchd has
>>> a mechanism to re-direct parts of the traffic and partially handle it
>>> in userspace if the required action list is oversized, but that doesn't
>>> work properly if we can't actually tell if the action list is oversized
>>> or not.
>>>
>>> Solution for this is to check the size of the user-provided actions
>>> instead of the internal representation. This commit just removes the
>>> check from the internal part because there is already an implicit size
>>> check imposed by the netlink protocol. The attribute can't be larger
>>> than 64 KB. Realistically, we could reduce the limit to 32 KB, but
>>> we'll be risking to break some existing setups that rely on the fact
>>> that it's possible to create nearly 64 KB action lists today.
>>>
>>> Vast majority of flows in real setups are below 100-ish bytes. So
>>> removal of the limit will not change real memory consumption on the
>>> system. The absolutely worst case scenario is if someone adds a flow
>>> with 64 KB of empty clone() actions. That will yield a 192 KB in the
>>> internal representation consuming 256 KB block of memory. However,
>>> that list of actions is not meaningful and also a no-op. Real world
>>> very large action lists (that can occur for a rare cases of BUM
>>> traffic handling) are unlikely to contain a large number of clones and
>>> will likely have a lot of tunnel attributes making the internal
>>> representation comparable in size to the original action list.
>>> So, it should be fine to just remove the limit.
>>>
>>> Commit in the 'Fixes' tag is the first one that introduced the
>>> difference between internal representation and the user-provided action
>>> lists, but there were many more afterwards that lead to the situation
>>> we have today.
>>>
>>> Fixes: 7d5437c709de ("openvswitch: Add tunneling interface.")
>>> Signed-off-by: Ilya Maximets <i.maximets@....org>
>>> ---
>>
>> Thanks for the detailed explanation. Do you think it's useful to
>> check with selftest:
>>
>> # python3 ./ovs-dpctl.py add-dp flbr
>> # python3 ./ovs-dpctl.py add-flow flbr \
>> "in_port(0),eth(),eth_type(0x806),arp()" \
>> $(echo 'print("clone(),"*4089)' | python3)
>>
>> I think a limit test is probably a good thing to have anyway (although
>> after this commit we will rely on netlink limits).
>
> I had a similar thought, but as you said, this commit will remove the limit
> so we'll not really be testing OVS code at this point. So, I thought it
> may be better to not include such a test for easier backporting to older
> kernels (given the Fixes tag goes far back). But I agree that it's a good
> thing in general to have tests that cover maximum size cases, so maybe we
> can add something like this to net-next instead, once the fix is accepted?
Yes, that makes sense to me, especially as we don't have any bounds
checking currently, and as you note we're not really exercising an OVS
specific code path.
> Note: 4089 is too small for such a test, it should be somewhere around 16K.
Yes, but 4089 would fail before this patch ;)
>>
>>
>> Reviewed-by: Aaron Conole <aconole@...hat.com>
>>
>
> Thanks for review!
>
> Best regards, Ilya Maximets.
Powered by blists - more mailing lists