netdev - Re: Matching on DSCP with IPv4 FIB rules

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnypieBfn3CxCGDq@debian>
Date: Thu, 27 Jun 2024 01:51:37 +0200
From: Guillaume Nault <gnault@...hat.com>
To: Ido Schimmel <idosch@...dia.com>
Cc: netdev@...r.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules

On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> Hi Guillaume, everyone,

Hi Ido, thanks for reaching out,

> We have users that would like to direct traffic to a routing table based
> on the DSCP value in the IP header. While this can be done using IPv6
> FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> allows such rules to match on the three TOS bits from RFC 791 (lower
> three DSCP bits). See more info in Guillaume's excellent presentation
> here [1].
> 
> Extending IPv4 FIB rules to match on DSCP is not easy because of how
> inconsistently the TOS field in the IPv4 flow information structure
> (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> throughout the networking stack.
> 
> Redefining the field using 'dscp_t' and removing the masking of the
> upper three DSCP bits is not an option as it will change existing
> behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> will no longer match a FIB rule that matches on 'tos 0x1c'.

Could removing the high order bits mask actually _be_ an option? I was
worried about behaviour change when I started looking into this. But,
with time, I'm more and more thinking about just removing the mask.

Here are the reasons why:

  * DSCP deprecated the Precedence/TOS bits separation more than
    25 years ago. I've never heard of anyone trying to use the high
    order bits as Preference, while we've had several reports of people
    using (or trying to use) the full DSCP bit range.
    Also, I far as I know, Linux never offered any way to interpret the
    high order bits as Precedence (apart from parsing these bits
    manually with u32 or BPF, but these use cases wouldn't be affected
    if we decided to use the full DSCP bit range in core IPv4 code).

  * Ignoring the high order bits creates useless inconsistencies
    between the IPv4 and IPv6 code, while current RFCs make no such
    distinction.

  * Even the IPv4 implementation isn't self consistent. While most
    route lookups are done with the high order bits cleared, some parts
    of the code explicitly use the full DSCP bit range.

  * In the past, people have sent patches to mask the high order DSCP
    bits and created regressions because of that. People seem to use
    the RT_TOS() macro on whatever "tos" variable they see, without
    really understanding the consequences. I think we'd be better off
    without RT_TOS() and the various IPTOS_* variants, so people
    wouldn't be tempted to copy/pasting such code.

  * It would indeed be a behaviour change to make "tos 0x1c" exactly
    match "0x1c". But I'd be surprised if people really expected "0x1c"
    to actually match "0xfc". Also currently one can set "tos 0x1f" in
    routes, but no packet will ever match. That's probably not
    something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
    mean "0x1f" would simplify everyone's life I believe.

> Instead, I was thinking of extending the IPv4 flow information structure
> with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> range of [0, 63] which both address families will support. This will
> allow user space to get a consistent behavior between IPv4 and IPv6 with
> regards to DSCP matching, without affecting existing use cases.

If removing the high order bits mask really isn't feasible, then yes,
that'd probably be our best option. But this would make both the code
and the UAPI more complex. Also we'd have to take care of corner cases
(when both TOS and DSCP are set) and make the same changes to IPv4
routes, to keep TOS/DSCP consistent between ip-rule and ip-route.

Dropping the high order bits mask, on the other hand, would make
everything consistent and would simplifies both the code and the user
experience. The only drawback is that "tos 0x1c" would only match "0x1c"
(and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
case really exist.

> Adding the new field and initializing it correctly throughout the stack
> is not a small undertaking so I was wondering a) Are you OK with the
> suggested approach? b) If not, what else would you suggest?

Sorry for the long text, but I think you have my opinion now.
And yes, whatever the option, this is going to be a long task.

Side note: I'm actually working on a series to start converting
flowi4_tos to dscp_t. I should have a first patch set ready soon
(converting only a few places). But, I'm keeping the old behaviour of
clearing the 3 high order bits for now (these are just two separate
topics).

I can allocate more time on the dscp_t conversion and work/help with
removing the high order bits mask if there's interest in this option.

> Thanks
> 
> [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
>