[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <54667115-0286-4d01-9bd9-8ccde3ddefa7@redhat.com>
Date: Wed, 13 Mar 2024 09:28:32 +0100
From: Adrian Moreno <amorenoz@...hat.com>
To: Ilya Maximets <i.maximets@....org>, netdev@...r.kernel.org,
dev@...nvswitch.org
Cc: cmi@...dia.com, yotam.gi@...il.com, aconole@...hat.com,
echaudro@...hat.com, horms@...nel.org, Dumitru Ceara <dceara@...hat.com>
Subject: Re: [RFC PATCH 0/4] net: openvswitch: Add sample multicasting.
On 3/8/24 15:24, Ilya Maximets wrote:
> On 3/7/24 22:29, Ilya Maximets wrote:
>> On 3/7/24 21:59, Adrian Moreno wrote:
>>>
>>>
>>> On 3/7/24 17:54, Ilya Maximets wrote:
>>>> On 3/7/24 16:18, Adrian Moreno wrote:
>>>>> ** Background ** Currently, OVS supports several packet
>>>>> sampling mechanisms (sFlow, per-bridge IPFIX, per-flow IPFIX).
>>>>> These end up being translated into a userspace action that
>>>>> needs to be handled by ovs-vswitchd's handler threads only to
>>>>> be forwarded to some third party application that will somehow
>>>>> process the sample and provide observability on the datapath.
>>>>>
>>>>> The fact that sampled traffic share netlink sockets and
>>>>> handler thread time with upcalls, apart from being a
>>>>> performance bottleneck in the sample extraction itself, can
>>>>> severely compromise the datapath, yielding this solution unfit
>>>>> for highly loaded production systems.
>>>>>
>>>>> Users are left with little options other than guessing what
>>>>> sampling rate will be OK for their traffic pattern and system
>>>>> load and dealing with the lost accuracy.
>>>>>
>>>>> ** Proposal ** In this RFC, I'd like to request feedback on an
>>>>> attempt to fix this situation by adding a flag to the userspace
>>>>> action to indicate the upcall should be sent to a netlink
>>>>> multicast group instead of unicasted to ovs-vswitchd.
>>>>>
>>>>> This would allow for other processes to read samples directly,
>>>>> freeing the netlink sockets and handler threads to process
>>>>> packet upcalls.
>>>>>
>>>>> ** Notes on tc-offloading ** I am aware of the efforts being
>>>>> made to offload the sample action with the help of psample. I
>>>>> did consider using psample to multicast the samples. However, I
>>>>> found a limitation that I'd like to discuss: I would like to
>>>>> support OVN-driven per-flow (IPFIX) sampling because it allows
>>>>> OVN to insert two 32-bit values (obs_domain_id and
>>>>> ovs_point_id) that can be used to enrich the sample with "high
>>>>> level" controller metadata (see debug_drop_domain_id NBDB
>>>>> option in ovn-nb(5)).
>>>>>
>>>>> The existing fields in psample_metadata are not enough to
>>>>> carry this information. Would it be possible to extend this
>>>>> struct to make room for some extra "application-specific"
>>>>> metadata?
>>>>>
>>>>> ** Alternatives ** An alternative approach that I'm
>>>>> considering (apart from using psample as explained above) is to
>>>>> use a brand-new action. This lead to a cleaner separation of
>>>>> concerns with existing userspace action (used for slow paths
>>>>> and OFP_CONTROLLER actions) and cleaner statistics. Also,
>>>>> ovs-vswitchd could more easily make the layout of this new
>>>>> userdata part of the public API, allowing third party sample
>>>>> collectors to decode it.
>>>>>
>>>>> I am currently exploring this alternative but wanted to send
>>>>> the RFC to get some early feedback, guidance or ideas.
>>>>
>>>>
>>>> Hi, Adrian. Thanks for the patches!
>>>>
>>>
>>> Thanks for the quick feedback. Also adding Dumitru who I missed to
>>> include in the original CC list.
>>>
>>>> Though I'm not sure if broadcasting is generally the best
>>>> approach. These messages contain opaque information that is not
>>>> actually parsable by any other entity than a process that
>>>> created the action. And I don't think the structure of these
>>>> opaque fields should become part of uAPI in neither kernel nor
>>>> OVS in userspace.
>>>>
>>>
>>> I understand this can be cumbersome, specially given the opaque
>>> field is currently also used for some purely-internal OVS actions
>>> (e.g: CONTROLLER).
>>>
>>> However, for features such as OVN-driven per-flow sampling, where
>>> OVN-generated identifiers are placed in obs_domain_id and
>>> obs_point_id, it would be _really_ useful if this opaque value
>>> could be somehow decoded by external programs.
>>>
>>> Two ideas come to mind to try to alleviate the potential
>>> maintainability issues: - As I suggested, using a new action maybe
>>> makes things easier. By splitting the current "user_action_cookie"
>>> in two, one for internal actions and one for "observability"
>>> actions, we could expose the latter in the OVS userspace API
>>> without having to expose the former. - Exposing functions in OVS
>>> that decode the opaque value. Third party applications could link
>>> against, say, libopenvswitch.so and use it to extract
>>> obs_{domain,point}_ids.
>>
>> Linking with OVS libraries is practically the same as just exposing
>> the internal structure, because once the external application is
>> running it either must have the same library version as the process
>> that installs the action, or it may not be able to parse the
>> message.
>>
>> Any form of exposing to an external application will freeze the
>> opaque arguments and effectively make them a form of uAPI.
>>
>> The separate action with a defined uAPI solves this problem by just
>> creating a new uAPI, but I'm not sure why it is needed.
>>
>>>
>>> What do you think?
>>>
>>>> The userspace() action already has a OVS_USERSPACE_ATTR_PID
>>>> argument. And it is not actually used when
>>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU is enabled. All known users of
>>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU are setting the
>>>> OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid that
>>>> kernel could generate.
>>>>
>>>> So, with a minimal and pretty much backward compatible change in
>>>> output_userspace() function, we can honor
>>>> OVS_USERSPACE_ATTR_PID if it's not U32_MAX. This way userspace
>>>> process can open a separate socket and configure sampling to
>>>> redirect all packets there while normal MISS upcalls would still
>>>> arrive to per-cpu sockets. This should cover the performance
>>>> concern.
>>>>
>>>
>>> Do you mean creating a new thread to process samples or using
>>> handlers? The latter would still have performance impact and the
>>> former would likely fail to process all samples in a timely manner
>>> if there are many.
>>>
>>> Besides, the current userspace tc-offloading series uses netlink
>>> broadcast with psample, can't we do the same for non-offloaded
>>> actions? It enable building external observability applications
>>> without overloading OVS.
>>
>> Creating a separate thread solves the performance issue. But you can
>> also write a separate application that would communicate its PID to
>> the running OVS daemon. Let's say the same application that
>> configures sampling in the OVS database can also write a PID there.
>>
>> The thing is that existence of external application immediately
>> breaks opacity of the arguments and forces us to define uAPI.
>> However, if there is an explicit communication between that
>> application and OVS userpsace daemon, then we can establish a
>> contract (structure of opaque values) between these two userspace
>> applications without defining that contract in the kernel uAPI. But
>> if we're going with multicast, that anyone can subscribe to, then we
>> have to define that contract in the kernel uAPI.
>>
>> Also, in order for this observability to work with userspace datapath
>> we'll have to implement userspace-to-userspace netlink multicast
>> (does that even exist?). Running the sample collection within OVS as
>> a thread will be much less painful.
>>
>> One other thing worth mentioning is that the PID approach I suggested
>> is just a minor tweak of what is already supported in the kernel. It
>> doesn't prohibit introduction of a new action or a multicast group in
>> the future. While premature uAPI definition may end up with another
>> action that nobody uses. It can be added later if end up being
>> actually necessary.
>
> Thinking more about this problem, it seems to make some sense to have
> a way to ask OVS for sampling that multiple observers can subscribe to.
> Unicast socket will not allow such functionality. However, I still don't
> think creation of a new multicast group for that purpose is justified.
> Kernel already has a generic sampling mechanism (psample) with a multicast
> group created specifically for a very similar purpose. So, instead of
> re-inventing it, we can add a small modification to the OVS'es sampling
> action allowing it to sample to psample instead of OVS'es own unicast
> sockets. This can be achieved by adding a new OVS_SAMPLE_ATTR that
> would tell to direct packets to psample instead of executing actions.
> Or adding a new OVS_USERSPACE_ATTR that would do the same thing but from
> a userspace() action instead, i.e. direct packets to psample instead of
> OVS'es own sockets copying OVS_PACKET_ATTR_USERDATA into the
> PSAMPLE_ATTR_SAMPLE_GROUP. Might be cleaner this way, not sure.
>
> Form a perspective of an OVS userspace daemon, this functionality can
> be clearly exposed as a separate sampling mechanism alongside IPFIX,
> sFlow and NetFlow.
>
> I see you eluded to this approach in the original cover letter above.
> So, I'd vote for it instead. Hopefully the psample API can be extended
> to be more flexible and allow larger userdata to be passed in. Maybe have
> SAMPLE_SUBGROUP in addition to the existing PSAMPLE_ATTR_SAMPLE_GROUP ?
> OTOH, it is not really IPFIX, it's a different interface, so it might have
> different requirements. In any case OVS may check that userdata fits
> into psample arguments and reject flows that are not compatible.
>
Thanks for your feedback Ilya.
I'll send an RFC_v2 with the proposed alternative.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>>
>>>> For the case without per-cpu dispatch, the feature comes for free
>>>> if userspace application wants to use it. However, there is no
>>>> currently supported version of OVS that doesn't use per-cpu
>>>> dispatch when available.
>>>>> What do you think? Best regards, Ilya Maximets.
>>>>
>>>
>>
>
--
Adrián Moreno
Powered by blists - more mailing lists