[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zn3DdfGZIVBxN0DR@shredder.lan>
Date: Thu, 27 Jun 2024 22:54:29 +0300
From: Ido Schimmel <idosch@...dia.com>
To: Guillaume Nault <gnault@...hat.com>
Cc: netdev@...r.kernel.org
Subject: Re: Matching on DSCP with IPv4 FIB rules
Hi Guillaume, thanks for the detailed response!
On Thu, Jun 27, 2024 at 01:51:37AM +0200, Guillaume Nault wrote:
> On Wed, Jun 26, 2024 at 02:58:17PM +0300, Ido Schimmel wrote:
> > Hi Guillaume, everyone,
>
> Hi Ido, thanks for reaching out,
>
> > We have users that would like to direct traffic to a routing table based
> > on the DSCP value in the IP header. While this can be done using IPv6
> > FIB rules, it cannot be done using IPv4 FIB rules as the kernel only
> > allows such rules to match on the three TOS bits from RFC 791 (lower
> > three DSCP bits). See more info in Guillaume's excellent presentation
> > here [1].
> >
> > Extending IPv4 FIB rules to match on DSCP is not easy because of how
> > inconsistently the TOS field in the IPv4 flow information structure
> > (i.e., 'struct flowi4::flowi4_tos') is initialized and handled
> > throughout the networking stack.
> >
> > Redefining the field using 'dscp_t' and removing the masking of the
> > upper three DSCP bits is not an option as it will change existing
> > behavior. For example, an incoming IPv4 packet with a DS field of 0xfc
> > will no longer match a FIB rule that matches on 'tos 0x1c'.
>
> Could removing the high order bits mask actually _be_ an option? I was
> worried about behaviour change when I started looking into this. But,
> with time, I'm more and more thinking about just removing the mask.
>
> Here are the reasons why:
>
> * DSCP deprecated the Precedence/TOS bits separation more than
> 25 years ago. I've never heard of anyone trying to use the high
> order bits as Preference, while we've had several reports of people
> using (or trying to use) the full DSCP bit range.
> Also, I far as I know, Linux never offered any way to interpret the
> high order bits as Precedence (apart from parsing these bits
> manually with u32 or BPF, but these use cases wouldn't be affected
> if we decided to use the full DSCP bit range in core IPv4 code).
>
> * Ignoring the high order bits creates useless inconsistencies
> between the IPv4 and IPv6 code, while current RFCs make no such
> distinction.
>
> * Even the IPv4 implementation isn't self consistent. While most
> route lookups are done with the high order bits cleared, some parts
> of the code explicitly use the full DSCP bit range.
>
> * In the past, people have sent patches to mask the high order DSCP
> bits and created regressions because of that. People seem to use
By "patches" you mean IPv6 patches?
> the RT_TOS() macro on whatever "tos" variable they see, without
> really understanding the consequences. I think we'd be better off
> without RT_TOS() and the various IPTOS_* variants, so people
> wouldn't be tempted to copy/pasting such code.
>
> * It would indeed be a behaviour change to make "tos 0x1c" exactly
> match "0x1c". But I'd be surprised if people really expected "0x1c"
> to actually match "0xfc". Also currently one can set "tos 0x1f" in
> routes, but no packet will ever match. That's probably not
> something anyone would expect. Making "0x1c" mean "0x1c" and "0x1f"
> mean "0x1f" would simplify everyone's life I believe.
Did you mean "0xfc" instead of "0x1f"? The kernel rejects routes with
"tos 0x1f" due to ECN bits being set.
I agree with everything you wrote except the assumption about users'
expectations. I honestly do not know if some users are relying on "tos
0x1c" to also match "0xfc", but I am not really interested in finding
out especially when undoing the change is not that easy. However, I have
another suggestion that might work which seems like a middle ground
between both approaches:
1. Extending the IPv4 flow information structure with a new 'dscp_t'
field (e.g., 'struct flowi4::dscp') and initializing it with the full
DSCP value throughout the stack. Already did this for all the places
where 'flowi4_tos' initialized other than flowi4_init_output() which is
next on my list.
2. Keeping the existing semantics of the "tos" keyword in ip-rule and
ip-route to match on the three lower DSCP bits, but changing the IPv4
functions that match on 'flowi4_tos' (fib_select_default,
fib4_rule_match, fib_table_lookup) to instead match on the new DSCP
field with a mask. For example, in fib4_rule_match(), instead of:
if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
We will have:
if (r->dscp && r->dscp != (fl4->dscp & IPTOS_RT_MASK))
I was only able to find two call paths that can reach these functions
with a TOS value that does not have its three upper DSCP bits masked:
nl_fib_input()
nl_fib_lookup()
flowi4_tos = frn->fl_tos // Directly from user space
fib_table_lookup()
nft_fib4_eval()
flowi4_tos = iph->tos & DSCP_BITS
fib_lookup()
The first belongs to an ancient "NETLINK_FIB_LOOKUP" family which I am
quite certain nobody is using and the second belongs to netfilter's fib
expression.
If regressions are reported for any of them (unlikely, IMO), we can add
a new flow information flag (e.g., 'FLOWI_FLAG_DSCP_NO_MASK') which will
tell the core routing functions to not apply the 'IPTOS_RT_MASK' mask.
3. Removing 'struct flowi4::flowi4_tos'.
4. Adding a new DSCP FIB rule attribute (e.g., 'FRA_DSCP') with a
matching "dscp" keyword in iproute2 that accepts values in the range of
[0, 63] which both address families will support. IPv4 will support it
via the new DSCP field ('struct flowi4::dscp') and IPv6 will support it
using the existing flow label field ('struct flowi6::flowlabel').
The kernel will reject rules that are configured with both "tos" and
"dscp".
I do not want to add a similar keyword to ip-route because I have no use
case for it and if we add it now we will never be able to remove it. It
can always be added later without too much effort.
>
> > Instead, I was thinking of extending the IPv4 flow information structure
> > with a new 'dscp_t' field (e.g., 'struct flowi4::dscp') and adding a new
> > DSCP FIB rule attribute (e.g., 'FRA_DSCP') that accepts values in the
> > range of [0, 63] which both address families will support. This will
> > allow user space to get a consistent behavior between IPv4 and IPv6 with
> > regards to DSCP matching, without affecting existing use cases.
>
> If removing the high order bits mask really isn't feasible, then yes,
> that'd probably be our best option. But this would make both the code
> and the UAPI more complex. Also we'd have to take care of corner cases
> (when both TOS and DSCP are set) and make the same changes to IPv4
> routes, to keep TOS/DSCP consistent between ip-rule and ip-route.
>
> Dropping the high order bits mask, on the other hand, would make
> everything consistent and would simplifies both the code and the user
> experience. The only drawback is that "tos 0x1c" would only match "0x1c"
> (and not "0x1f" anymore). But, as I said earlier, I doubt if such a use
> case really exist.
Whether use cases like that exist or not is the main issue I have with
the removal of the high order bits mask. The advantage of this approach
is that no new uAPI is required, but the disadvantage is that there is a
potential for regressions without an easy mitigation.
I believe that with the approach I outlined above the potential for
regressions is lower and we should have a way to mitigate them if/when
reported. The disadvantage is that we need to introduce a new "dscp"
keyword and a new netlink attribute.
>
> > Adding the new field and initializing it correctly throughout the stack
> > is not a small undertaking so I was wondering a) Are you OK with the
> > suggested approach? b) If not, what else would you suggest?
>
> Sorry for the long text, but I think you have my opinion now.
> And yes, whatever the option, this is going to be a long task.
Yes :(
>
> Side note: I'm actually working on a series to start converting
> flowi4_tos to dscp_t. I should have a first patch set ready soon
> (converting only a few places). But, I'm keeping the old behaviour of
> clearing the 3 high order bits for now (these are just two separate
> topics).
I will be happy to review, but I'm not sure what you mean by "converting
only a few places". How does it work?
>
> I can allocate more time on the dscp_t conversion and work/help with
> removing the high order bits mask if there's interest in this option.
>
> > Thanks
> >
> > [1] https://lpc.events/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
> >
>
Powered by blists - more mailing lists