netdev - Re: Matching on DSCP with IPv4 FIB rules

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zn3DdfGZIVBxN0DR@shredder.lan>
Date: Thu, 27 Jun 2024 22:54:29 +0300
From: Ido Schimmel <idosch@...dia.com>
To: Guillaume Nault <gnault@...hat.com>
Cc: netdev@...r.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules

Hi Guillaume, thanks for the detailed response!

On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, everyone,
> 
> Hi Ido, thanks for reaching out,
> 
> > We have users that would like to direct traffic to a routing table based
> > on the DSCP value in the IP header. While this can be done using IPv6
> > FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> > allows such rules to match on the three TOS bits from RFC 791 (lower
> > three DSCP bits). See more info in Guillaume's excellent presentation
> > here [1].
> > 
> > Extending IPv4 FIB rules to match on DSCP is not easy because of how
> > inconsistently the TOS field in the IPv4 flow information structure
> > (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> > throughout the networking stack.
> > 
> > Redefining the field using 'dscp_t' and removing the masking of the
> > upper three DSCP bits is not an option as it will change existing
> > behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> > will no longer match a FIB rule that matches on 'tos 0x1c'.
> 
> Could removing the high order bits mask actually _be_ an option? I was
> worried about behaviour change when I started looking into this. But,
> with time, I'm more and more thinking about just removing the mask.
> 
> Here are the reasons why:
> 
>   * DSCP deprecated the Precedence/TOS bits separation more than
>     25 years ago. I've never heard of anyone trying to use the high
>     order bits as Preference, while we've had several reports of people
>     using (or trying to use) the full DSCP bit range.
>     Also, I far as I know, Linux never offered any way to interpret the
>     high order bits as Precedence (apart from parsing these bits
>     manually with u32 or BPF, but these use cases wouldn't be affected
>     if we decided to use the full DSCP bit range in core IPv4 code).
> 
>   * Ignoring the high order bits creates useless inconsistencies
>     between the IPv4 and IPv6 code, while current RFCs make no such
>     distinction.
> 
>   * Even the IPv4 implementation isn't self consistent. While most
>     route lookups are done with the high order bits cleared, some parts
>     of the code explicitly use the full DSCP bit range.
> 
>   * In the past, people have sent patches to mask the high order DSCP
>     bits and created regressions because of that. People seem to use

By "patches" you mean IPv6 patches?

>     the RT_TOS() macro on whatever "tos" variable they see, without
>     really understanding the consequences. I think we'd be better off
>     without RT_TOS() and the various IPTOS_* variants, so people
>     wouldn't be tempted to copy/pasting such code.
> 
>   * It would indeed be a behaviour change to make "tos 0x1c" exactly
>     match "0x1c". But I'd be surprised if people really expected "0x1c"
>     to actually match "0xfc". Also currently one can set "tos 0x1f" in
>     routes, but no packet will ever match. That's probably not
>     something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
>     mean "0x1f" would simplify everyone's life I believe.

Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
"tos 0x1f" due to ECN bits being set.

I agree with everything you wrote except the assumption about users'
expectations. I honestly do not know if some users are relying on "tos
0x1c" to also match "0xfc", but I am not really interested in finding
out especially when undoing the change is not that easy. However, I have
another suggestion that might work which seems like a middle ground
between both approaches:

1. Extending the IPv4 flow information structure with a new 'dscp_t'
field (e.g., 'struct flowi4::dscp') and initializing it with the full
DSCP value throughout the stack. Already did this for all the places
where 'flowi4_tos' initialized other than flowi4_init_output() which is
next on my list.

2. Keeping the existing semantics of the "tos" keyword in ip-rule and
ip-route to match on the three lower DSCP bits, but changing the IPv4
functions that match on 'flowi4_tos' (fib_select_default,
fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
field with a mask. For example, in fib4_rule_match(), instead of:

if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))

We will have:

if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))

I was only able to find two call paths that can reach these functions
with a TOS value that does not have its three upper DSCP bits masked:

nl_fib_input()
	nl_fib_lookup()
		flowi4_tos = frn->fl_tos	// Directly from user space
		fib_table_lookup()

nft_fib4_eval()
	flowi4_tos = iph->tos & DSCP_BITS
	fib_lookup()

The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
quite certain nobody is using and the second belongs to netfilter's fib
expression.

If regressions are reported for any of them (unlikely, IMO), we can add
a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.

3. Removing 'struct flowi4::flowi4_tos'.

4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
matching "dscp" keyword in iproute2 that accepts values in the range of
[0, 63] which both address families will support. IPv4 will support it
via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
using the existing flow label field ('struct flowi6::flowlabel').

The kernel will reject rules that are configured with both "tos" and
"dscp".

I do not want to add a similar keyword to ip-route because I have no use
case for it and if we add it now we will never be able to remove it. It
can always be added later without too much effort.

> 
> > Instead, I was thinking of extending the IPv4 flow information structure
> > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > range of [0, 63] which both address families will support. This will
> > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > regards to DSCP matching, without affecting existing use cases.
> 
> If removing the high order bits mask really isn't feasible, then yes,
> that'd probably be our best option. But this would make both the code
> and the UAPI more complex. Also we'd have to take care of corner cases
> (when both TOS and DSCP are set) and make the same changes to IPv4
> routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
> 
> Dropping the high order bits mask, on the other hand, would make
> everything consistent and would simplifies both the code and the user
> experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> case really exist.

Whether use cases like that exist or not is the main issue I have with
the removal of the high order bits mask. The advantage of this approach
is that no new uAPI is required, but the disadvantage is that there is a
potential for regressions without an easy mitigation.

I believe that with the approach I outlined above the potential for
regressions is lower and we should have a way to mitigate them if/when
reported. The disadvantage is that we need to introduce a new "dscp"
keyword and a new netlink attribute.

> 
> > Adding the new field and initializing it correctly throughout the stack
> > is not a small undertaking so I was wondering a) Are you OK with the
> > suggested approach? b) If not, what else would you suggest?
> 
> Sorry for the long text, but I think you have my opinion now.
> And yes, whatever the option, this is going to be a long task.

Yes :(

> 
> Side note: I'm actually working on a series to start converting
> flowi4_tos to dscp_t. I should have a first patch set ready soon
> (converting only a few places). But, I'm keeping the old behaviour of
> clearing the 3 high order bits for now (these are just two separate
> topics).

I will be happy to review, but I'm not sure what you mean by "converting
only a few places". How does it work?

> 
> I can allocate more time on the dscp_t conversion and work/help with
> removing the high order bits mask if there's interest in this option.
> 
> > Thanks
> > 
> > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> > 
>