netdev - Re: Matching on DSCP with IPv4 FIB rules

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZpkVIE1Pod1jrgsc@shredder.mtl.com>
Date: Thu, 18 Jul 2024 16:14:08 +0300
From: Ido Schimmel <idosch@...dia.com>
To: Guillaume Nault <gnault@...hat.com>
Cc: netdev@...r.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules

On Thu, Jul 04, 2024 at 08:19:05PM +0200, Guillaume Nault wrote:
> On Thu, Jun 27, 2024 at 10:54:29PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, thanks for the detailed response!
> > 
> > On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> > > 
> > > Could removing the high order bits mask actually _be_ an option? I was
> > > worried about behaviour change when I started looking into this. But,
> > > with time, I'm more and more thinking about just removing the mask.
> > > 
> > > Here are the reasons why:
> > > 
> > >   * DSCP deprecated the Precedence/TOS bits separation more than
> > >     25 years ago. I've never heard of anyone trying to use the high
> > >     order bits as Preference, while we've had several reports of people
> > >     using (or trying to use) the full DSCP bit range.
> > >     Also, I far as I know, Linux never offered any way to interpret the
> > >     high order bits as Precedence (apart from parsing these bits
> > >     manually with u32 or BPF, but these use cases wouldn't be affected
> > >     if we decided to use the full DSCP bit range in core IPv4 code).
> > > 
> > >   * Ignoring the high order bits creates useless inconsistencies
> > >     between the IPv4 and IPv6 code, while current RFCs make no such
> > >     distinction.
> > > 
> > >   * Even the IPv4 implementation isn't self consistent. While most
> > >     route lookups are done with the high order bits cleared, some parts
> > >     of the code explicitly use the full DSCP bit range.
> > > 
> > >   * In the past, people have sent patches to mask the high order DSCP
> > >     bits and created regressions because of that. People seem to use
> > 
> > By "patches" you mean IPv6 patches?
> 
> Not necessarily. I had the following case in mind:
> https://lore.kernel.org/netdev/20200805024131.2091206-1-liuhangbin@gmail.com/
> 
> I'm pretty sure this revert came after someone complained that setting
> the high order DSCP bits stopped working in VXLAN. But I haven't
> managed to find the original report.
> 
> But there has been fixes for IPv6 too:
> https://lore.kernel.org/netdev/20220805191906.9323-1-matthias.may@westermo.com/
> 
> > >   * It would indeed be a behaviour change to make "tos 0x1c" exactly
> > >     match "0x1c". But I'd be surprised if people really expected "0x1c"
> > >     to actually match "0xfc". Also currently one can set "tos 0x1f" in
> > >     routes, but no packet will ever match. That's probably not
> > >     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
> > >     mean "0x1f" would simplify everyone's life I believe.
> > 
> > Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
> > "tos 0x1f" due to ECN bits being set.
> 
> Yes, 0xfc. I don't know what I had in mind when I wrote 0x1f. That
> value clearly doesn't make any sense in this context.
> 
> > I agree with everything you wrote except the assumption about users'
> > expectations. I honestly do not know if some users are relying on "tos
> > 0x1c" to also match "0xfc", but I am not really interested in finding
> > out especially when undoing the change is not that easy. However, I have
> > another suggestion that might work which seems like a middle ground
> > between both approaches:
> > 
> > 1. Extending the IPv4 flow information structure with a new 'dscp_t'
> > field (e.g., 'struct flowi4::dscp') and initializing it with the full
> > DSCP value throughout the stack. Already did this for all the places
> > where 'flowi4_tos' initialized other than flowi4_init_output() which is
> > next on my list.
> > 
> > 2. Keeping the existing semantics of the "tos" keyword in ip-rule and
> > ip-route to match on the three lower DSCP bits, but changing the IPv4
> > functions that match on 'flowi4_tos' (fib_select_default,
> > fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
> > field with a mask. For example, in fib4_rule_match(), instead of:
> > 
> > if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> > 
> > We will have:
> > 
> > if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))
> 
> So, do you mean to centralise the effect of all the current RT_TOS()
> calls inside a few functions? So that we can eventually remove all
> those RT_TOS() calls later?

Yes. See this patch:

https://github.com/idosch/linux/commit/80f536f629c4dccdb6c015a10ca25d7743233208.patch

I can send it when net-next opens. It should allow us to start removing
the masking of the high order DSCP bits without introducing regressions
as the masking now happens at the core and not at individual call sites
or along the path to the core.

> 
> > I was only able to find two call paths that can reach these functions
> > with a TOS value that does not have its three upper DSCP bits masked:
> > 
> > nl_fib_input()
> > 	nl_fib_lookup()
> > 		flowi4_tos = frn->fl_tos	// Directly from user space
> > 		fib_table_lookup()
> > 
> > nft_fib4_eval()
> > 	flowi4_tos = iph->tos & DSCP_BITS
> > 	fib_lookup()
> > 
> > The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> > quite certain nobody is using and the second belongs to netfilter's fib
> > expression.
> 
> I agree that nl_fib_input() probably doesn't matter.
> 
> For nft_fib4_eval() it really looks like the current behaviour is
> intended. And even though it's possible that nobody currently relies on
> it, I think it's the correct one. So I don't really feel like changing
> it.

Yes, I agree. The patch I mentioned takes care of that by setting the
new 'FLOWI_FLAG_MATCH_FULL_DSCP' in nft_fib4_eval().

> 
> > If regressions are reported for any of them (unlikely, IMO), we can add
> > a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> > tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> > 
> > 3. Removing 'struct flowi4::flowi4_tos'.
> > 
> > 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> > matching "dscp" keyword in iproute2 that accepts values in the range of
> > [0, 63] which both address families will support. IPv4 will support it
> > via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> > using the existing flow label field ('struct flowi6::flowlabel').
> 
> I'm sorry, something isn't clear to me. Since masking the high order
> bits has been centralised at step 2, how will you match them?
> 
> If we continue to take fib4_rule_match() as an example; do you mean to
> extend struct fib4_rule to store the extra information, so that
> fib4_rule_match() knows how to test fl4->dscp? For example:

Yes. See these patches:

https://github.com/idosch/linux/commit/1a79fb59f66731cfc891e3fecb3b08cda6bb0170.patch
https://github.com/idosch/linux/commit/a4990aab8ee4866b9f853777a50de09537255d67.patch
https://github.com/idosch/linux/commit/7328d60b7cfe2b07b2d565c9af36f650e96552a5.patch
https://github.com/idosch/linux/commit/73a739735d27bef613813f0ac0a9280e6427264d.patch

Specifically these hunks from the second patch:

@@ -37,6 +37,7 @@ struct fib4_rule {
 	u8			dst_len;
 	u8			src_len;
 	dscp_t			dscp;
+	u8			is_dscp_sel:1;	/* DSCP or TOS selector */
 	__be32			src;
 	__be32			srcmask;
 	__be32			dst;
@@ -186,7 +187,14 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule,
 	    ((daddr ^ r->dst) & r->dstmask))
 		return 0;
 
-	if (r->dscp && !fib_dscp_match(r->dscp, fl4))
+	/* When DSCP selector is used we need to match on the entire DSCP field
+	 * in the flow information structure. When TOS selector is used we need
+	 * to mask the upper three DSCP bits prior to matching to maintain
+	 * legacy behavior.
+	 */
+	if (r->is_dscp_sel && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
+		return 0;
+	else if (!r->is_dscp_sel && r->dscp && !fib_dscp_match(r->dscp, fl4))
 		return 0;
 
 	if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto))

Note that it is just an RFC. I first need to remove the masking of the
high order DSCP bits before I can send it.

> 
>     /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
>      * would be 0x1c to keep the old behaviour.
>      */
>     if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))

It's a bit more involved. Even if the old TOS selector is used, we don't
always want to mask using 0x1c. If nft_fib4_eval() filled 0xfc in
'flowi4_tos', then by masking using 0x1c it will now match a FIB rule
that was configured with 'tos 0x1c' whereas previously it didn't. The
new 'FLOWI_FLAG_MATCH_FULL_DSCP' takes care of that, but it only applies
to rules configured with the TOS selector. The new DSCP selector will
always match against the entire DSCP field.

>         return 0;
> 
> > The kernel will reject rules that are configured with both "tos" and
> > "dscp".
> > 
> > I do not want to add a similar keyword to ip-route because I have no use
> > case for it and if we add it now we will never be able to remove it. It
> > can always be added later without too much effort.
> > 
> > > 
> > > > Instead, I was thinking of extending the IPv4 flow information structure
> > > > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > > > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > > > range of [0, 63] which both address families will support. This will
> > > > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > > > regards to DSCP matching, without affecting existing use cases.
> > > 
> > > If removing the high order bits mask really isn't feasible, then yes,
> > > that'd probably be our best option. But this would make both the code
> > > and the UAPI more complex. Also we'd have to take care of corner cases
> > > (when both TOS and DSCP are set) and make the same changes to IPv4
> > > routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> > > 
> > > Dropping the high order bits mask, on the other hand, would make
> > > everything consistent and would simplifies both the code and the user
> > > experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> > > (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> > > case really exist.
> > 
> > Whether use cases like that exist or not is the main issue I have with
> > the removal of the high order bits mask. The advantage of this approach
> > is that no new uAPI is required, but the disadvantage is that there is a
> > potential for regressions without an easy mitigation.
> > 
> > I believe that with the approach I outlined above the potential for
> > regressions is lower and we should have a way to mitigate them if/when
> > reported. The disadvantage is that we need to introduce a new "dscp"
> > keyword and a new netlink attribute.
> 
> What I'd really like is to stop the proliferation of RT_TOS() and to
> get consistent behaviour. If the new "dscp" option allows that, while
> still allowing to simplify the implementation, then I'm fine with it.

Yes, I believe the new "dscp" option gets us there.

> 
> > > Side note: I'm actually working on a series to start converting
> > > flowi4_tos to dscp_t. I should have a first patch set ready soon
> > > (converting only a few places). But, I'm keeping the old behaviour of
> > > clearing the 3 high order bits for now (these are just two separate
> > > topics).
> > 
> > I will be happy to review, but I'm not sure what you mean by "converting
> > only a few places". How does it work?
> 
> My idea is to go through the functions that uses ->flowi4_tos one by
> one. I convert their u8 variables to dscp_t, but keep ->flowi4_tos a
> __u8 field for the moment. For example:
> 
> -void my_func(__u8 tos, ...)
> -void my_func(dscp_t dscp, ...)
>  {
>      ...
> -    fl4.flowi4_tos = tos;
> -    fl4.flowi4_tos = inet_dscp_to_dsfield(dscp);
>      ...
>  }
> 
> Of course, the whole call chain should be converted, until the function
> that reads the value from a packet header or from user space:
> 
>  void another_func(const struct iphdr *ip4h)
>  {
> -    __u8 tos = ip4h->tos;
> +    dscp_t dscp = inet_dsfield_to_dscp(ip4h->tos);
>      ...
> -    my_func(tos, ...);
> +    my_func(dscp, ...);
>      ...
>  }
> 
> Depending on how complex is the call chain, I introduce intermediate
> u8/dscp_t conversions, to keep patches simple.
> 
> The idea is to eventually have inet_dsfield_to_dscp() conversions at
> the boundaries of the kernel, and to have temporary internal
> inet_dscp_to_dsfield() calls when interacting with ->flowi4_tos.
> 
> Once all ->flowi4_tos users will actually use a dscp_t value, then
> I'll convert this field from __u8 to dscp_t and remove all the extra
> inet_dscp_to_dsfield() calls. That last patch will have to touch many
> places at once, but, by renaming ->flowi4_tos to ->flowi4_dscp, we can
> rely on the compiler and on sparse to ensure that every place has been
> taken care of. Also, that final patch should be easy to review as it
> should mostly consist of chunks like:
> 
> -    fl4->flowi4_tos = inet_dscp_to_dsfield(dscp);
> +    fl4->flowi4_dscp = dscp;
> 
> But I'm not there yet ;).

OK, I see, thanks for explaining.

Are you OK with the approach that I outlined above? Basically:

1. Submitting the patch that centralizes TOS matching
2. Removing the masking of the high order DSCP bits
3. Adding new DSCP selector for FIB rules

If you already have some patches that convert to 'dscp_t', then I can
work on top of them to avoid conflicts.

> 
> Note that I'm going to be offline for a bit more than a week. I'll
> catch up on this topic after I get back online.
> 
> > > I can allocate more time on the dscp_t conversion and work/help with
> > > removing the high order bits mask if there's interest in this option.
> > > 
> > > > Thanks
> > > > 
> > > > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > > > 
> > > 
> > 
>