lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <83f8aae1-c701-43a6-b91a-2e387e9a865c@ovn.org>
Date: Fri, 8 Mar 2024 15:24:14 +0100
From: Ilya Maximets <i.maximets@....org>
To: Adrian Moreno <amorenoz@...hat.com>, netdev@...r.kernel.org,
 dev@...nvswitch.org
Cc: i.maximets@....org, cmi@...dia.com, yotam.gi@...il.com,
 aconole@...hat.com, echaudro@...hat.com, horms@...nel.org,
 Dumitru Ceara <dceara@...hat.com>
Subject: Re: [RFC PATCH 0/4] net: openvswitch: Add sample multicasting.

On 3/7/24 22:29, Ilya Maximets wrote:
> On 3/7/24 21:59, Adrian Moreno wrote:
>> 
>> 
>> On 3/7/24 17:54, Ilya Maximets wrote:
>>> On 3/7/24 16:18, Adrian Moreno wrote:
>>>> ** Background ** Currently, OVS supports several packet 
>>>> sampling mechanisms (sFlow, per-bridge IPFIX, per-flow IPFIX). 
>>>> These end up being translated into a userspace action that 
>>>> needs to be handled by ovs-vswitchd's handler threads only to 
>>>> be forwarded to some third party application that will somehow 
>>>> process the sample and provide observability on the datapath.
>>>> 
>>>> The fact that sampled traffic share netlink sockets and
>>>> handler thread time with upcalls, apart from being a
>>>> performance bottleneck in the sample extraction itself, can
>>>> severely compromise the datapath, yielding this solution unfit
>>>> for highly loaded production systems.
>>>> 
>>>> Users are left with little options other than guessing what 
>>>> sampling rate will be OK for their traffic pattern and system 
>>>> load and dealing with the lost accuracy.
>>>> 
>>>> ** Proposal ** In this RFC, I'd like to request feedback on an 
>>>> attempt to fix this situation by adding a flag to the userspace
>>>> action to indicate the upcall should be sent to a netlink
>>>> multicast group instead of unicasted to ovs-vswitchd.
>>>> 
>>>> This would allow for other processes to read samples directly, 
>>>> freeing the netlink sockets and handler threads to process 
>>>> packet upcalls.
>>>> 
>>>> ** Notes on tc-offloading ** I am aware of the efforts being 
>>>> made to offload the sample action with the help of psample. I 
>>>> did consider using psample to multicast the samples. However, I
>>>> found a limitation that I'd like to discuss: I would like to 
>>>> support OVN-driven per-flow (IPFIX) sampling because it allows 
>>>> OVN to insert two 32-bit values (obs_domain_id and 
>>>> ovs_point_id) that can be used to enrich the sample with "high 
>>>> level" controller metadata (see debug_drop_domain_id NBDB 
>>>> option in ovn-nb(5)).
>>>> 
>>>> The existing fields in psample_metadata are not enough to
>>>> carry this information. Would it be possible to extend this
>>>> struct to make room for some extra "application-specific"
>>>> metadata?
>>>> 
>>>> ** Alternatives ** An alternative approach that I'm
>>>> considering (apart from using psample as explained above) is to
>>>> use a brand-new action. This lead to a cleaner separation of
>>>> concerns with existing userspace action (used for slow paths
>>>> and OFP_CONTROLLER actions) and cleaner statistics. Also, 
>>>> ovs-vswitchd could more easily make the layout of this new 
>>>> userdata part of the public API, allowing third party sample 
>>>> collectors to decode it.
>>>> 
>>>> I am currently exploring this alternative but wanted to send 
>>>> the RFC to get some early feedback, guidance or ideas.
>>> 
>>> 
>>> Hi, Adrian.  Thanks for the patches!
>>> 
>> 
>> Thanks for the quick feedback. Also adding Dumitru who I missed to 
>> include in the original CC list.
>> 
>>> Though I'm not sure if broadcasting is generally the best 
>>> approach. These messages contain opaque information that is not 
>>> actually parsable by any other entity than a process that
>>> created the action. And I don't think the structure of these
>>> opaque fields should become part of uAPI in neither kernel nor
>>> OVS in userspace.
>>> 
>> 
>> I understand this can be cumbersome, specially given the opaque 
>> field is currently also used for some purely-internal OVS actions 
>> (e.g: CONTROLLER).
>> 
>> However, for features such as OVN-driven per-flow sampling, where 
>> OVN-generated identifiers are placed in obs_domain_id and 
>> obs_point_id, it would be _really_ useful if this opaque value 
>> could be somehow decoded by external programs.
>> 
>> Two ideas come to mind to try to alleviate the potential 
>> maintainability issues: - As I suggested, using a new action maybe 
>> makes things easier. By splitting the current "user_action_cookie" 
>> in two, one for internal actions and one for "observability" 
>> actions, we could expose the latter in the OVS userspace API 
>> without having to expose the former. - Exposing functions in OVS 
>> that decode the opaque value. Third party applications could link 
>> against, say, libopenvswitch.so and use it to extract 
>> obs_{domain,point}_ids.
> 
> Linking with OVS libraries is practically the same as just exposing 
> the internal structure, because once the external application is 
> running it either must have the same library version as the process 
> that installs the action, or it may not be able to parse the 
> message.
> 
> Any form of exposing to an external application will freeze the 
> opaque arguments and effectively make them a form of uAPI.
> 
> The separate action with a defined uAPI solves this problem by just 
> creating a new uAPI, but I'm not sure why it is needed.
> 
>> 
>> What do you think?
>> 
>>> The userspace() action already has a OVS_USERSPACE_ATTR_PID 
>>> argument. And it is not actually used when 
>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU is enabled.  All known users of 
>>> OVS_DP_F_DISPATCH_UPCALL_PER_CPU are setting the 
>>> OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid that 
>>> kernel could generate.
>>> 
>>> So, with a minimal and pretty much backward compatible change in
>>>  output_userspace() function, we can honor
>>> OVS_USERSPACE_ATTR_PID if it's not U32_MAX.  This way userspace
>>> process can open a separate socket and configure sampling to
>>> redirect all packets there while normal MISS upcalls would still
>>> arrive to per-cpu sockets.  This should cover the performance
>>> concern.
>>> 
>> 
>> Do you mean creating a new thread to process samples or using 
>> handlers? The latter would still have performance impact and the 
>> former would likely fail to process all samples in a timely manner 
>> if there are many.
>> 
>> Besides, the current userspace tc-offloading series uses netlink 
>> broadcast with psample, can't we do the same for non-offloaded 
>> actions? It enable building external observability applications 
>> without overloading OVS.
> 
> Creating a separate thread solves the performance issue.  But you can
> also write a separate application that would communicate its PID to
> the running OVS daemon.  Let's say the same application that 
> configures sampling in the OVS database can also write a PID there.
> 
> The thing is that existence of external application immediately 
> breaks opacity of the arguments and forces us to define uAPI. 
> However, if there is an explicit communication between that 
> application and OVS userpsace daemon, then we can establish a 
> contract (structure of opaque values) between these two userspace 
> applications without defining that contract in the kernel uAPI.  But 
> if we're going with multicast, that anyone can subscribe to, then we 
> have to define that contract in the kernel uAPI.
> 
> Also, in order for this observability to work with userspace datapath
> we'll have to implement userspace-to-userspace netlink multicast
> (does that even exist?).  Running the sample collection within OVS as
> a thread will be much less painful.
> 
> One other thing worth mentioning is that the PID approach I suggested
> is just a minor tweak of what is already supported in the kernel.  It
> doesn't prohibit introduction of a new action or a multicast group in
> the future.  While premature uAPI definition may end up with another
> action that nobody uses.  It can be added later if end up being
> actually necessary.

Thinking more about this problem, it seems to make some sense to have
a way to ask OVS for sampling that multiple observers can subscribe to.
Unicast socket will not allow such functionality.  However, I still don't
think creation of a new multicast group for that purpose is justified.
Kernel already has a generic sampling mechanism (psample) with a multicast
group created specifically for a very similar purpose.  So, instead of
re-inventing it, we can add a small modification to the OVS'es sampling
action allowing it to sample to psample instead of OVS'es own unicast
sockets.  This can be achieved by adding a new OVS_SAMPLE_ATTR that
would tell to direct packets to psample instead of executing actions.
Or adding a new OVS_USERSPACE_ATTR that would do the same thing but from
a userspace() action instead, i.e. direct packets to psample instead of
OVS'es own sockets copying OVS_PACKET_ATTR_USERDATA into the
PSAMPLE_ATTR_SAMPLE_GROUP.  Might be cleaner this way, not sure.

Form a perspective of an OVS userspace daemon, this functionality can
be clearly exposed as a separate sampling mechanism alongside IPFIX,
sFlow and NetFlow.

I see you eluded to this approach in the original cover letter above.
So, I'd vote for it instead.  Hopefully the psample API can be extended
to be more flexible and allow larger userdata to be passed in.  Maybe have
SAMPLE_SUBGROUP in addition to the existing PSAMPLE_ATTR_SAMPLE_GROUP ?
OTOH, it is not really IPFIX, it's a different interface, so it might have
different requirements.  In any case OVS may check that userdata fits
into psample arguments and reject flows that are not compatible.

> 
> Best regards, Ilya Maximets.
> 
>> 
>> 
>>> For the case without per-cpu dispatch, the feature comes for free
>>> if userspace application wants to use it.  However, there is no
>>> currently supported version of OVS that doesn't use per-cpu 
>>> dispatch when available.
>>>> What do you think? Best regards, Ilya Maximets.
>>> 
>> 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ