[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zfxbmp5i.fsf@nvidia.com>
Date: Thu, 11 Jan 2024 22:44:52 +0100
From: Petr Machata <me@...chata.org>
To: Jamal Hadi Salim <jhs@...atatu.com>
Cc: Petr Machata <petrm@...dia.com>, Ido Schimmel <idosch@...sch.org>, Jiri
Pirko <jiri@...nulli.us>, netdev@...r.kernel.org, kuba@...nel.org,
pabeni@...hat.com, davem@...emloft.net, edumazet@...gle.com,
xiyou.wangcong@...il.com, victor@...atatu.com, pctammela@...atatu.com,
mleitner@...hat.com, vladbu@...dia.com, paulb@...dia.com
Subject: Re: [patch net-next] net: sched: move block device tracking into
tcf_block_get/put_ext()
Jamal Hadi Salim <jhs@...atatu.com> writes:
> On Thu, Jan 11, 2024 at 11:55 AM Petr Machata <petrm@...dia.com> wrote:
>>
>>
>> Jamal Hadi Salim <jhs@...atatu.com> writes:
>>
>> > On Thu, Jan 11, 2024 at 10:40 AM Jamal Hadi Salim <jhs@...atatu.com> wrote:
>> >>
>> >> On Wed, Jan 10, 2024 at 7:10 AM Ido Schimmel <idosch@...sch.org> wrote:
>> >> >
>> >> > On Thu, Jan 04, 2024 at 01:58:44PM +0100, Jiri Pirko wrote:
>> >> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> >> > > index adf5de1ff773..253b26f2eddd 100644
>> >> > > --- a/net/sched/cls_api.c
>> >> > > +++ b/net/sched/cls_api.c
>> >> > > @@ -1428,6 +1428,7 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> >> > > struct tcf_block_ext_info *ei,
>> >> > > struct netlink_ext_ack *extack)
>> >> > > {
>> >> > > + struct net_device *dev = qdisc_dev(q);
>> >> > > struct net *net = qdisc_net(q);
>> >> > > struct tcf_block *block = NULL;
>> >> > > int err;
>> >> > > @@ -1461,9 +1462,18 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
>> >> > > if (err)
>> >> > > goto err_block_offload_bind;
>> >> > >
>> >> > > + if (tcf_block_shared(block)) {
>> >> > > + err = xa_insert(&block->ports, dev->ifindex, dev, GFP_KERNEL);
>> >> > > + if (err) {
>> >> > > + NL_SET_ERR_MSG(extack, "block dev insert failed");
>> >> > > + goto err_dev_insert;
>> >> > > + }
>> >> > > + }
>> >> >
>> >> > While this patch fixes the original issue, it creates another one:
>> >> >
>> >> > # ip link add name swp1 type dummy
>> >> > # tc qdisc replace dev swp1 root handle 10: prio bands 8 priomap 7 6 5 4 3 2 1
>> >> > # tc qdisc add dev swp1 parent 10:8 handle 108: red limit 1000000 min 200000 max 200001 probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
>> >> > RED: set bandwidth to 10Mbit
>> >> > # tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent early_drop block 10
>> >> > RED: set bandwidth to 10Mbit
>> >> > Error: block dev insert failed.
>> >> >
>> >>
>> >>
>> >> +cc Petr
>> >> We'll add a testcase on tdc - it doesnt seem we have any for qevents.
>> >> If you have others that are related let us know.
>> >> But how does this work? I see no mention of block on red code and i
>>
>> Look for qe_early_drop and qe_mark in sch_red.c.
>>
>
> I see it...
>
>> >> see no mention of block on the reproducer above.
>> >
>> > Context: Yes, i see it on red setup but i dont see any block being setup.
>>
>> qevents are binding locations for blocks, similar in principle to
>> clsact's ingress_block / egress_block. So the way to create a block is
>> the same: just mention the block number for the first time.
>>
>> What qevents there are depends on the qdisc. They are supposed to
>> reflect events that are somehow interesting, from the point of view of
>> an skb within a qdisc. Thus RED has two qevents: early_drop for packets
>> that were chosen to be, well, dropped early, and mark for packets that
>> are ECN-marked. So when a packet is, say, early-dropped, the RED qdisc
>> passes it through the TC block bound at that qevent (if any).
>>
>
> Ok, the confusing part was the missing block command. I am assuming in
> addition to Ido's example one would need to create block 10 and then
> attach a filter to it?
> Something like:
> tc qdisc add dev swp1 parent 10:7 handle 107: red limit 1000000 min
> 500000 max 500001 probability 1.0 avpkt 8000 burst 63 qevent
> early_drop block 10
> tc filter add block 10 ...
Yes, correct.
> So a packet tagged for early drop will end up being processed in some
> filter chain with some specified actions. So in the case of offload,
> does it mean early drops will be sent to the kernel and land at the
> specific chain? Also trying to understand (in retrospect, not armchair
For offload, the idea is that the silicon is configured to do the things
that the user configures at the qevent.
In the particular case of mlxsw, we only permit adding one chain with
one filter, which has to be a matchall, and the action has to be either
a mirred or trap, plus it needs to have hw_stats disabled. Because the
HW is limited like that, it can't do much in that context.
> lawyering): why was a block necessary? feels like the goto chain
> action could have worked, no? i.e something like: qevent early_drop
> goto chain x.. Is the block perhaps tied to something in the h/w or is
> it just some clever metainfo that is used to jump to tc block when the
> exceptions happen?
So yeah, blocks are super fancy compared to what the HW can actually do.
The initial idea was to have a single action, but then what if the HW
can do more than that? And qdiscs still exist as SW entitites obviously,
why limit ourselves to a single action in SW? OK, so maybe we can make
that one action a goto chain, where the real stuff is, but where would
it look the chain up? On ingress? Egress? So say that we know where to
look it up, but then also you'll end up with totally unhinged actions on
an ingress / egress qdisc that are there just as jump targets of some
qevent. Plus the set of actions permissible on ingress / egress can be
arbitrarily different from the set of actions permissible on a qevent,
which makes the validation in an offloading driver difficult. And chain
reuse is really not a thing in Linux TC, so keeping a chain on its own
seems wrong. Plus the goto chain is still unclear in that scenario.
Blocks have no such issues, they are self-contained. They are heavy
compared to what we need, true. But that's not really an issue -- the
driver can bounce invalid configuration just fine. And there's no risk
of making another driver or SW datapath user unhappy because their set
of constrants is something we didn't anticipate. Conceptually it's
cleaner than if we had just one action / one rule / one chain, because
you can point at e.g. ingress_block and say, this, it's the same as
this, except instead of all ingress packets, only those that are
early_dropped are passed through.
BTW, newer Spectrum chips actually allow (some?) ACL matching to run in
the qevent context, so we may end up relaxing the matchall requirement
in the future and do a more complex offload of qevents.
> Important thing is we need tests so we can catch these regressions in
> the future. If you can, point me to some (outside of the ones Ido
> posted) and we'll put them on tdc.
We just have the followin. Pretty sure that's where Ido's come from:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_root.sh
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
Which is doing this (or s/early_drop/mark/ instead):
tc qdisc add dev $swp3 parent 1: handle 108: red \
limit 1000000 min $BACKLOG max $((BACKLOG + 1)) \
probability 1.0 avpkt 8000 burst 38 qevent early_drop block 10
And then installing a rule like one of these:
tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
action mirred egress mirror dev $swp2 hw_stats disabled
tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
action trap hw_stats disabled
tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
action trap_fwd hw_stats disabled
Then in runs traffic and checks the right amount gets mirrored or trapped.
>> > Also: Is it only Red or other qdiscs could behave this way?
>>
>> Currently only red supports any qevents at all, but in principle the
>> mechanism is reusable. With my mlxsw hat on, an obvious next candidate
>> would be tail_drop on FIFO qdisc.
>
> Sounds cool. I can see use even for s/w only dpath.
FIFO is tricky to extend BTW. I wrote some patches way back before it
got backburner'd, and the only payloads that are currently bounced are
those that are <=3 bytes. Everything else is interpreted to mean
something, extra garbage is ignored, etc. Fun fun.
Powered by blists - more mailing lists