[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874jf9jmi3.fsf@nvidia.com>
Date: Fri, 19 Jan 2024 17:28:36 +0100
From: Petr Machata <petrm@...dia.com>
To: Jamal Hadi Salim <jhs@...atatu.com>
CC: Petr Machata <petrm@...dia.com>, Petr Machata <me@...chata.org>, "Ido
Schimmel" <idosch@...sch.org>, Jiri Pirko <jiri@...nulli.us>,
<netdev@...r.kernel.org>, <kuba@...nel.org>, <pabeni@...hat.com>,
<davem@...emloft.net>, <edumazet@...gle.com>, <xiyou.wangcong@...il.com>,
<victor@...atatu.com>, <pctammela@...atatu.com>, <mleitner@...hat.com>,
<vladbu@...dia.com>, <paulb@...dia.com>
Subject: Re: [patch net-next] net: sched: move block device tracking into
tcf_block_get/put_ext()
Jamal Hadi Salim <jhs@...atatu.com> writes:
> On Tue, Jan 16, 2024 at 7:03 AM Petr Machata <petrm@...dia.com> wrote:
>>
>> Spectrum supports these mirror headers. They are sandwiched between the
>> encapsulation headers of the mirrored packets and the actual packet, and
>> tell you the RX and TX ports, what was the mirror reason, some
>> timestamps, etc. Without this, you need to encode the information into
>> like parts of IPv6 or what not.
>>
>> But it's a proprietary format. It could be expressed as a netdevice, or
>> maybe a special version in the ERSPAN netdevice. But there's no
>> standard, or public documentation, no open source tools that would
>> consume it. AFAIK. So adding code to ERSPAN to support it looks like a
>> non-starter.
>
> This is the kind of problem that P4 is well suited to solve. We are
> stuck with current kernel implementations and current standards.
> Spectrum can do so much more than the current ERSPAN standard
> provides. It can do so much more than the current kernel code can
> express. I am sure there are at least 5 people who want this feature
> we are talking about but status quo says the only choice to make this
> acceptable is to convince the masses (meaning likely years of chasing)
> and do the ERSPAN standard mods alongside kernel and user space
> implementation. And then in standards bodies like IEEE, IETF, etc you
> politik to death trying to get people to sign onto your proposal with
> +1s (sorry for the rant, but it is one reason i stopped going there).
> Alternatively, just have those 5 people write a P4 program in a very
> short time and not bother anybody else...
>
>> I was also toying with approaches around some push_header action, but it
>> all seemed too clumsy. Also unclear how to express things like
>> timestamps, port labels, mirror reasons, latency, queue depths... It's
>> very, very HW-oriented header.
>>
>
> So at one point Yotam Gigi was trying to use IFE for i think this,
> which makes sense since you can add arbitrary metadata after the
> ethernet header and transport it to a remote machine on the wire or
> terminate on local cpu. You may want to look at that again because he
I did look at IFE way back, but it looks like I should do that again.
> seemed to think it works closely to the h/w approach. Of course it
> suffers from the same "fixed implementation" so both the producer and
> consumer would have to be taught what those metadatum mean i.e the
> kernel and iproute2 code updates will be required. IIRC, the Cumulus
> people pushed back and converted this into traps coming out of the
> spectrum in order to make use of standardized tooling to avoid
> retraining (I think it was sflow as the consumer).
>
>> I suppose with P4 there might be a way for the user to describe the
>> encapsulation and for the driver to recognize it as the mirror header
>> and offload properly. With some squinting: frankly I don't see anybody
>> encoding things like queue depths, or anybody writing the driver code to
>> recognize that indeed that's what it is.
>
> You can define whatever metadata you want and make your datapath
> generate it. As long as you also have a P4 program on the consumer to
> decode it. My fingers cant resist, so let me show extract from a
> simple test program we have for P4TC which transports skb->mark
> between machines:
>
> // define the headers
> header customl2_t {
> bit<32> skbmark;
> bit<16> anothermetadatum
> bit<16> etherType;
> }
> ..
> struct headers_t {
> ethernet_t ethernet;
> customl2_t customl2;
> ipv4_t ip;
> //rest is considered payload
> }
> ...
So let's talk about the queue depth in particular. How would the driver
recognize the field? Does it need to "parse the parser" to figure it
out? Or use magic field names? Use EtherType to just know what is meant
by the header? I don't see how to express the idea in the abstract, for
the driver to recognize it and say, yeah, we can in fact offload this
program, because it the abstract description matches what the HW is
doing for mirror headers version XYZ.
> //the parser
> ....
> state parse_ethernet {
> pkt.extract(hdr.ethernet);
> transition select(hdr.ethernet.etherType) {
> ETHERTYPE_IPV4: parse_ipv4;
> ETHERTYPE_CUSTOML2: parse_customl2;
> default: reject;
> }
> }
> state parse_customl2 {
> pkt.extract(hdr.customl2);
> transition select(hdr.customl2.etherType) {
> ETHERTYPE_IPV4: parse_ipv4;
> default: reject;
> }
> }
> state parse_ipv4 {
> pkt.extract(hdr.ip);
> transition accept;
> }
>
> ...
>
> Then you have a match action table definition which in our case
> matches on dst ip, but you can make it match whatever you want. Add a
> couple of actions; one to push and another to pop the headers. Here's
> one that pushes (not showing the other part that pops and sets
> skb->mark):
>
> action push_custom(@tc_type("macaddr") bit<48> dmac, @tc_type("dev")
> PortId_t port)
> hdr.ethernet.dstAddr = dmac;
> hdr.customl2.etherType = hdr.ethernet.etherType;
> hdr.ethernet.etherType = ETHERTYPE_CUSTOML2;
> hdr.customl2.skbmark = meta.mark;
> hdr.customl2.setValid();
> send_to_port(port);
> }
>
> And at runtime:
> tc p4ctrl create mycustoml2/table/mytable dstAddr 10.99.0.1/32 \
> action push_custom param dmac 66:33:34:35:46:01 param port eno1
>
> And when hw supports it, you could just say skip_sw above..
> That's it - no need to argue about tomatos or tomateos on the mailing
> list for the next 6 months or longer.
OK, thanks for the primer, I'll try to carve out some time to look at it
more closely.
>> I'm not sure what semantics of mirroring to a qevent block are, but
>> beyond that, it shouldn't impact us. Such rule would be punted from HW
>> datapath, because there's no netdevice, and we demand a netdevice (plus
>> conditions on what the netdevice is allowed to be).
>
> Ok, for the hardware path i guess it's however you abstract it. But if
> someone did this in sw as such:
> --
> tc qdisc add dev ens10 egress_block 10 clsact
> tc qdisc add dev ens9 egress_block 10 clsact
> tc qdisc add dev ens8 egress_block 10 clsact
> tc filter add block 22 protocol ip pref 25 \
> matchall action mirred egress mirror blockid 10
> tc qdisc add dev eth0 parent 10:7 handle 107: red limit 1000000 min
> 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> early_drop block 22
> ---
> Then all of ens8-10 will get a copy of the packet on the qevent.
I meant the other way around. Say someone mirrors to blockid 22.
Does the packet go to eth0?
> > > Its a "fixed" ASIC, so it is expected. But: One should be able to
> > > express the Spectrum's ACLs or even the whole datapath as a P4 program
> >
> > Yeah, I think so, too.
> >
> > > and i dont see why it wouldnt work with P4TC. Matty has at least once
> > > in the past, if i am not mistaken, pitched such an idea.
> >
> > I don't see why it wouldn't work. What I'm saying is that at least the
> > ACL bits will just look fairly close to flower, because that's the HW we
> > are working with. And then the benefits of P4 are not as clear, because
> > flower is already here and also looks like flower.
>
> That's fine mostly because the ASIC doesnt change once it is cast. If
> however you want to expose a new field that the ASIC already supports,
> the problem with flower is it is a "fixed datapath". Adding another
> header requires changing the kernel (tc and flowdisector, driver) and
> iproute2..
>
> > On the upside, we would get more flexibility with different matching
> > approaches. Mixing different matchers is awkward (say flower + basic
> > might occasionally be useful), so there's this tendency to cram
> > everything into flower.
> >
>
> It's a good hammer and if a lot of things can be imagined to be nails,
> it works great ;->
>
> > I mentioned the mirror headers above, that's one area where TC just
> > doesn't have the tools we need.
> >
>
> Agreed - but with P4 you have a way out IMO.
Yeah, that's why I'm mentioning it. Those mirror headers are the closest
to where we would want... P4 or something P4-ish, to have the sort of
flexibility we need. Because the alternatives are non-starters.
(Though I'll check out IFE once more.)
> > > I believe you but will have to look to make sense. There's at least
> > > one attribute you mention above carried in some data structure in a
> > > TLV (if i am not mistaken it is queue size either in packet or bytes,
> > > depending on which fifo mode you are running). You are saying you cant
> > > add another one or a flag at least?
> >
> > Not backward-compatibly. The sole attribute's payload is interpreted as
> > follows:
> >
>
> Ok, just took a quick look..
> > - length<4? Bounce.
> > - length>=4? First u32 is queue limit, the rest is ignored garbage.
>
> I think you mean this part:
> struct tc_fifo_qopt *ctl = nla_data(opt);
>
> if (nla_len(opt) < sizeof(*ctl))
> return -EINVAL;
>
> sch->limit = ctl->limit;
>
> Yeah, cant stash a new attribute there unfortunately. TCA_OPTIONS only
> has tc_fifo_qopt. Would have been easier if TCA_OPTIONS was nested
> (like other qdiscs such as TBF) - then you could add a new TLV.
Yep.
> > So you have to stash any new stuff into the now-ignored garbage, thus
> > changing behavior. The I-think-safe approach that I mention above is
> > passing limit=0 and serializing the real attribute tree into the garbage
> > area. So if limit=0 and the garbage parses as an atrribute tree, use
> > that, otherwise it's really just limit of 0 and some garbage.
> >
> > I think this is safe, because the combination of limit=0 (already an
> > improbable configuration) and parseable garbage is unlikely to arise by
> > mistake. But it's kludgy.
> >
> > Or maybe the flag could be in the message header, but that seems wrong
> > given we are talking about extending one qdisc kind.
>
> I can see the dilema - and if i understood correctly what you are
> suggesting, something like:
>
> if (nla_len(opt) == sizeof(*ctl))
> do the old thing here
> else if (nla_len(opt) == sizeof(*mynewstruct))
> do the new thing here
> else
> invalid..
Basically this, but handle the case that a broken userspace is sending
payload such that (nla_len(opt) == sizeof(*mynewstruct)), but only
provides the first four bytes with the queue limit, and the rest is
garbage.
Powered by blists - more mailing lists