lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5f522987-994b-4a46-a489-cde796a4a960@ovn.org>
Date: Thu, 7 Mar 2024 22:29:49 +0100
From: Ilya Maximets <i.maximets@....org>
To: Adrian Moreno <amorenoz@...hat.com>, netdev@...r.kernel.org,
 dev@...nvswitch.org
Cc: i.maximets@....org, cmi@...dia.com, yotam.gi@...il.com,
 aconole@...hat.com, echaudro@...hat.com, horms@...nel.org,
 Dumitru Ceara <dceara@...hat.com>
Subject: Re: [RFC PATCH 0/4] net: openvswitch: Add sample multicasting.

On 3/7/24 21:59, Adrian Moreno wrote:
> 
> 
> On 3/7/24 17:54, Ilya Maximets wrote:
>> On 3/7/24 16:18, Adrian Moreno wrote:
>>> ** Background **
>>> Currently, OVS supports several packet sampling mechanisms (sFlow,
>>> per-bridge IPFIX, per-flow IPFIX). These end up being translated into a
>>> userspace action that needs to be handled by ovs-vswitchd's handler
>>> threads only to be forwarded to some third party application that
>>> will somehow process the sample and provide observability on the
>>> datapath.
>>>
>>> The fact that sampled traffic share netlink sockets and handler thread
>>> time with upcalls, apart from being a performance bottleneck in the
>>> sample extraction itself, can severely compromise the datapath,
>>> yielding this solution unfit for highly loaded production systems.
>>>
>>> Users are left with little options other than guessing what sampling
>>> rate will be OK for their traffic pattern and system load and dealing
>>> with the lost accuracy.
>>>
>>> ** Proposal **
>>> In this RFC, I'd like to request feedback on an attempt to fix this
>>> situation by adding a flag to the userspace action to indicate the
>>> upcall should be sent to a netlink multicast group instead of unicasted
>>> to ovs-vswitchd.
>>>
>>> This would allow for other processes to read samples directly, freeing
>>> the netlink sockets and handler threads to process packet upcalls.
>>>
>>> ** Notes on tc-offloading **
>>> I am aware of the efforts being made to offload the sample action with
>>> the help of psample.
>>> I did consider using psample to multicast the samples. However, I
>>> found a limitation that I'd like to discuss:
>>> I would like to support OVN-driven per-flow (IPFIX) sampling because
>>> it allows OVN to insert two 32-bit values (obs_domain_id and
>>> ovs_point_id) that can be used to enrich the sample with "high level"
>>> controller metadata (see debug_drop_domain_id NBDB option in ovn-nb(5)).
>>>
>>> The existing fields in psample_metadata are not enough to carry this
>>> information. Would it be possible to extend this struct to make room for
>>> some extra "application-specific" metadata?
>>>
>>> ** Alternatives **
>>> An alternative approach that I'm considering (apart from using psample
>>> as explained above) is to use a brand-new action. This lead to a cleaner
>>> separation of concerns with existing userspace action (used for slow
>>> paths and OFP_CONTROLLER actions) and cleaner statistics.
>>> Also, ovs-vswitchd could more easily make the layout of this
>>> new userdata part of the public API, allowing third party sample
>>> collectors to decode it.
>>>
>>> I am currently exploring this alternative but wanted to send the RFC to
>>> get some early feedback, guidance or ideas.
>>
>>
>> Hi, Adrian.  Thanks for the patches!
>>
> 
> Thanks for the quick feedback.
> Also adding Dumitru who I missed to include in the original CC list.
> 
>> Though I'm not sure if broadcasting is generally the best approach.
>> These messages contain opaque information that is not actually
>> parsable by any other entity than a process that created the action.
>> And I don't think the structure of these opaque fields should become
>> part of uAPI in neither kernel nor OVS in userspace.
>>
> 
> I understand this can be cumbersome, specially given the opaque field is 
> currently also used for some purely-internal OVS actions (e.g: CONTROLLER).
> 
> However, for features such as OVN-driven per-flow sampling, where OVN-generated 
> identifiers are placed in obs_domain_id and obs_point_id, it would be _really_ 
> useful if this opaque value could be somehow decoded by external programs.
> 
> Two ideas come to mind to try to alleviate the potential maintainability issues:
> - As I suggested, using a new action maybe makes things easier. By splitting the 
> current "user_action_cookie" in two, one for internal actions and one for 
> "observability" actions, we could expose the latter in the OVS userspace API 
> without having to expose the former.
> - Exposing functions in OVS that decode the opaque value. Third party 
> applications could link against, say, libopenvswitch.so and use it to extract 
> obs_{domain,point}_ids.

Linking with OVS libraries is practically the same as just exposing
the internal structure, because once the external application is
running it either must have the same library version as the process
that installs the action, or it may not be able to parse the message.

Any form of exposing to an external application will freeze the
opaque arguments and effectively make them a form of uAPI.

The separate action with a defined uAPI solves this problem by just
creating a new uAPI, but I'm not sure why it is needed.

> 
> What do you think?
> 
>> The userspace() action already has a OVS_USERSPACE_ATTR_PID argument.
>> And it is not actually used when OVS_DP_F_DISPATCH_UPCALL_PER_CPU is
>> enabled.  All known users of OVS_DP_F_DISPATCH_UPCALL_PER_CPU are
>> setting the OVS_USERSPACE_ATTR_PID to UINT32_MAX, which is not a pid
>> that kernel could generate.
>>
>> So, with a minimal and pretty much backward compatible change in
>> output_userspace() function, we can honor OVS_USERSPACE_ATTR_PID if
>> it's not U32_MAX.  This way userspace process can open a separate
>> socket and configure sampling to redirect all packets there while
>> normal MISS upcalls would still arrive to per-cpu sockets.  This
>> should cover the performance concern.
>>
> 
> Do you mean creating a new thread to process samples or using handlers?
> The latter would still have performance impact and the former would likely fail 
> to process all samples in a timely manner if there are many.
> 
> Besides, the current userspace tc-offloading series uses netlink broadcast with 
> psample, can't we do the same for non-offloaded actions? It enable building 
> external observability applications without overloading OVS.

Creating a separate thread solves the performance issue.  But you can
also write a separate application that would communicate its PID to the
running OVS daemon.  Let's say the same application that configures
sampling in the OVS database can also write a PID there.

The thing is that existence of external application immediately breaks
opacity of the arguments and forces us to define uAPI.  However, if
there is an explicit communication between that application and OVS
userpsace daemon, then we can establish a contract (structure of opaque
values) between these two userspace applications without defining that
contract in the kernel uAPI.  But if we're going with multicast, that
anyone can subscribe to, then we have to define that contract in the
kernel uAPI.

Also, in order for this observability to work with userspace datapath
we'll have to implement userspace-to-userspace netlink multicast (does
that even exist?).  Running the sample collection within OVS as a thread
will be much less painful.

One other thing worth mentioning is that the PID approach I suggested
is just a minor tweak of what is already supported in the kernel.  It
doesn't prohibit introduction of a new action or a multicast group in
the future.  While premature uAPI definition may end up with another
action that nobody uses.  It can be added later if end up being
actually necessary.

Best regards, Ilya Maximets.

> 
> 
>> For the case without per-cpu dispatch, the feature comes for free
>> if userspace application wants to use it.  However, there is no
>> currently supported version of OVS that doesn't use per-cpu dispatch
>> when available.
>>  > What do you think?
>>  > Best regards, Ilya Maximets.
>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ