netdev - Re: Matching on DSCP with IPv4 FIB rules

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zobnma+cQPMhIlSy@debian>
Date: Thu, 4 Jul 2024 20:19:05 +0200
From: Guillaume Nault <gnault@...hat.com>
To: Ido Schimmel <idosch@...dia.com>
Cc: netdev@...r.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules

On Thu, Jun 27, 2024 at 10:54:29PM +0300, Ido Schimmel wrote:
> Hi Guillaume, thanks for the detailed response!
> 
> On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> > 
> > Could removing the high order bits mask actually _be_ an option? I was
> > worried about behaviour change when I started looking into this. But,
> > with time, I'm more and more thinking about just removing the mask.
> > 
> > Here are the reasons why:
> > 
> >   * DSCP deprecated the Precedence/TOS bits separation more than
> >     25 years ago. I've never heard of anyone trying to use the high
> >     order bits as Preference, while we've had several reports of people
> >     using (or trying to use) the full DSCP bit range.
> >     Also, I far as I know, Linux never offered any way to interpret the
> >     high order bits as Precedence (apart from parsing these bits
> >     manually with u32 or BPF, but these use cases wouldn't be affected
> >     if we decided to use the full DSCP bit range in core IPv4 code).
> > 
> >   * Ignoring the high order bits creates useless inconsistencies
> >     between the IPv4 and IPv6 code, while current RFCs make no such
> >     distinction.
> > 
> >   * Even the IPv4 implementation isn't self consistent. While most
> >     route lookups are done with the high order bits cleared, some parts
> >     of the code explicitly use the full DSCP bit range.
> > 
> >   * In the past, people have sent patches to mask the high order DSCP
> >     bits and created regressions because of that. People seem to use
> 
> By "patches" you mean IPv6 patches?

Not necessarily. I had the following case in mind:
https://lore.kernel.org/netdev/20200805024131.2091206-1-liuhangbin@gmail.com/

I'm pretty sure this revert came after someone complained that setting
the high order DSCP bits stopped working in VXLAN. But I haven't
managed to find the original report.

But there has been fixes for IPv6 too:
https://lore.kernel.org/netdev/20220805191906.9323-1-matthias.may@westermo.com/

> >   * It would indeed be a behaviour change to make "tos 0x1c" exactly
> >     match "0x1c". But I'd be surprised if people really expected "0x1c"
> >     to actually match "0xfc". Also currently one can set "tos 0x1f" in
> >     routes, but no packet will ever match. That's probably not
> >     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
> >     mean "0x1f" would simplify everyone's life I believe.
> 
> Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
> "tos 0x1f" due to ECN bits being set.

Yes, 0xfc. I don't know what I had in mind when I wrote 0x1f. That
value clearly doesn't make any sense in this context.

> I agree with everything you wrote except the assumption about users'
> expectations. I honestly do not know if some users are relying on "tos
> 0x1c" to also match "0xfc", but I am not really interested in finding
> out especially when undoing the change is not that easy. However, I have
> another suggestion that might work which seems like a middle ground
> between both approaches:
> 
> 1. Extending the IPv4 flow information structure with a new 'dscp_t'
> field (e.g., 'struct flowi4::dscp') and initializing it with the full
> DSCP value throughout the stack. Already did this for all the places
> where 'flowi4_tos' initialized other than flowi4_init_output() which is
> next on my list.
> 
> 2. Keeping the existing semantics of the "tos" keyword in ip-rule and
> ip-route to match on the three lower DSCP bits, but changing the IPv4
> functions that match on 'flowi4_tos' (fib_select_default,
> fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
> field with a mask. For example, in fib4_rule_match(), instead of:
> 
> if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
> 
> We will have:
> 
> if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

So, do you mean to centralise the effect of all the current RT_TOS()
calls inside a few functions? So that we can eventually remove all
those RT_TOS() calls later?

> I was only able to find two call paths that can reach these functions
> with a TOS value that does not have its three upper DSCP bits masked:
> 
> nl_fib_input()
> 	nl_fib_lookup()
> 		flowi4_tos = frn->fl_tos	// Directly from user space
> 		fib_table_lookup()
> 
> nft_fib4_eval()
> 	flowi4_tos = iph->tos & DSCP_BITS
> 	fib_lookup()
> 
> The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
> quite certain nobody is using and the second belongs to netfilter's fib
> expression.

I agree that nl_fib_input() probably doesn't matter.

For nft_fib4_eval() it really looks like the current behaviour is
intended. And even though it's possible that nobody currently relies on
it, I think it's the correct one. So I don't really feel like changing
it.

> If regressions are reported for any of them (unlikely, IMO), we can add
> a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
> tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
> 
> 3. Removing 'struct flowi4::flowi4_tos'.
> 
> 4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
> matching "dscp" keyword in iproute2 that accepts values in the range of
> [0, 63] which both address families will support. IPv4 will support it
> via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
> using the existing flow label field ('struct flowi6::flowlabel').

I'm sorry, something isn't clear to me. Since masking the high order
bits has been centralised at step 2, how will you match them?

If we continue to take fib4_rule_match() as an example; do you mean to
extend struct fib4_rule to store the extra information, so that
fib4_rule_match() knows how to test fl4->dscp? For example:

    /* Assuming FRA_DSCP sets ->dscp_mask to 0xff while the default
     * would be 0x1c to keep the old behaviour.
     */
    if (r->dscp && r->dscp != (fl4->dscp & r->dscp_mask))
        return 0;

> The kernel will reject rules that are configured with both "tos" and
> "dscp".
> 
> I do not want to add a similar keyword to ip-route because I have no use
> case for it and if we add it now we will never be able to remove it. It
> can always be added later without too much effort.
> 
> > 
> > > Instead, I was thinking of extending the IPv4 flow information structure
> > > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > > range of [0, 63] which both address families will support. This will
> > > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > > regards to DSCP matching, without affecting existing use cases.
> > 
> > If removing the high order bits mask really isn't feasible, then yes,
> > that'd probably be our best option. But this would make both the code
> > and the UAPI more complex. Also we'd have to take care of corner cases
> > (when both TOS and DSCP are set) and make the same changes to IPv4
> > routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> > 
> > Dropping the high order bits mask, on the other hand, would make
> > everything consistent and would simplifies both the code and the user
> > experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> > (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> > case really exist.
> 
> Whether use cases like that exist or not is the main issue I have with
> the removal of the high order bits mask. The advantage of this approach
> is that no new uAPI is required, but the disadvantage is that there is a
> potential for regressions without an easy mitigation.
> 
> I believe that with the approach I outlined above the potential for
> regressions is lower and we should have a way to mitigate them if/when
> reported. The disadvantage is that we need to introduce a new "dscp"
> keyword and a new netlink attribute.

What I'd really like is to stop the proliferation of RT_TOS() and to
get consistent behaviour. If the new "dscp" option allows that, while
still allowing to simplify the implementation, then I'm fine with it.

> > Side note: I'm actually working on a series to start converting
> > flowi4_tos to dscp_t. I should have a first patch set ready soon
> > (converting only a few places). But, I'm keeping the old behaviour of
> > clearing the 3 high order bits for now (these are just two separate
> > topics).
> 
> I will be happy to review, but I'm not sure what you mean by "converting
> only a few places". How does it work?

My idea is to go through the functions that uses ->flowi4_tos one by
one. I convert their u8 variables to dscp_t, but keep ->flowi4_tos a
__u8 field for the moment. For example:

-void my_func(__u8 tos, ...)
-void my_func(dscp_t dscp, ...)
 {
     ...
-    fl4.flowi4_tos = tos;
-    fl4.flowi4_tos = inet_dscp_to_dsfield(dscp);
     ...
 }

Of course, the whole call chain should be converted, until the function
that reads the value from a packet header or from user space:

 void another_func(const struct iphdr *ip4h)
 {
-    __u8 tos = ip4h->tos;
+    dscp_t dscp = inet_dsfield_to_dscp(ip4h->tos);
     ...
-    my_func(tos, ...);
+    my_func(dscp, ...);
     ...
 }

Depending on how complex is the call chain, I introduce intermediate
u8/dscp_t conversions, to keep patches simple.

The idea is to eventually have inet_dsfield_to_dscp() conversions at
the boundaries of the kernel, and to have temporary internal
inet_dscp_to_dsfield() calls when interacting with ->flowi4_tos.

Once all ->flowi4_tos users will actually use a dscp_t value, then
I'll convert this field from __u8 to dscp_t and remove all the extra
inet_dscp_to_dsfield() calls. That last patch will have to touch many
places at once, but, by renaming ->flowi4_tos to ->flowi4_dscp, we can
rely on the compiler and on sparse to ensure that every place has been
taken care of. Also, that final patch should be easy to review as it
should mostly consist of chunks like:

-    fl4->flowi4_tos = inet_dscp_to_dsfield(dscp);
+    fl4->flowi4_dscp = dscp;

But I'm not there yet ;).

Note that I'm going to be offline for a bit more than a week. I'll
catch up on this topic after I get back online.

> > I can allocate more time on the dscp_t conversion and work/help with
> > removing the high order bits mask if there's interest in this option.
> > 
> > > Thanks
> > > 
> > > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > > 
> > 
>