netdev - Re: [RFC net] Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d579c817-50c7-5bd5-4b28-f044daabf7f6@uliege.be>
Date:   Mon, 5 Dec 2022 21:44:09 +0100
From:   Justin Iurman <justin.iurman@...ege.be>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev@...r.kernel.org, davem@...emloft.net, kuba@...nel.org,
        yoshfuji@...ux-ipv6.org, dsahern@...nel.org, pabeni@...hat.com,
        stable@...r.kernel.org
Subject: Re: [RFC net] Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue
 depth data field")

On 12/5/22 19:34, Eric Dumazet wrote:
> On Mon, Dec 5, 2022 at 7:24 PM Justin Iurman <justin.iurman@...ege.be> wrote:
>>
>> On 12/5/22 17:50, Eric Dumazet wrote:
>>> On Mon, Dec 5, 2022 at 5:30 PM Eric Dumazet <edumazet@...gle.com> wrote:
>>>>
>>>> Patch title seems
>>>>
>>>> On Mon, Dec 5, 2022 at 4:36 PM Justin Iurman <justin.iurman@...ege.be> wrote:
>>>>>
>>>>> This patch fixes a NULL qdisc pointer when retrieving the TX queue depth
>>>>> for IOAM.
>>>>>
>>>>> IMPORTANT: I suspect this fix is local only and the bug goes deeper (see
>>>>> reasoning below).
>>>>>
>>>>> Kernel panic:
>>>>> [...]
>>>>> RIP: 0010:ioam6_fill_trace_data+0x54f/0x5b0
>>>>> [...]
>>>>>
>>>>> ...which basically points to the call to qdisc_qstats_qlen_backlog
>>>>> inside net/ipv6/ioam6.c.
>>>>>
>>>>>   From there, I directly thought of a NULL pointer (queue->qdisc). To make
>>>>> sure, I added some printk's to know exactly *why* and *when* it happens.
>>>>> Here is the (summarized by queue) output:
>>>>>
>>>>> skb for TX queue 1, qdisc is ffff8b375eee9800, qdisc_sleeping is ffff8b375eee9800
>>>>> skb for TX queue 2, qdisc is ffff8b375eeefc00, qdisc_sleeping is ffff8b375eeefc00
>>>>> skb for TX queue 3, qdisc is ffff8b375eeef800, qdisc_sleeping is ffff8b375eeef800
>>>>> skb for TX queue 4, qdisc is ffff8b375eeec800, qdisc_sleeping is ffff8b375eeec800
>>>>> skb for TX queue 5, qdisc is ffff8b375eeea400, qdisc_sleeping is ffff8b375eeea400
>>>>> skb for TX queue 6, qdisc is ffff8b375eeee000, qdisc_sleeping is ffff8b375eeee000
>>>>> skb for TX queue 7, qdisc is ffff8b375eee8800, qdisc_sleeping is ffff8b375eee8800
>>>>> skb for TX queue 8, qdisc is ffff8b375eeedc00, qdisc_sleeping is ffff8b375eeedc00
>>>>> skb for TX queue 9, qdisc is ffff8b375eee9400, qdisc_sleeping is ffff8b375eee9400
>>>>> skb for TX queue 10, qdisc is ffff8b375eee8000, qdisc_sleeping is ffff8b375eee8000
>>>>> skb for TX queue 11, qdisc is ffff8b375eeed400, qdisc_sleeping is ffff8b375eeed400
>>>>> skb for TX queue 12, qdisc is ffff8b375eeea800, qdisc_sleeping is ffff8b375eeea800
>>>>> skb for TX queue 13, qdisc is ffff8b375eee8c00, qdisc_sleeping is ffff8b375eee8c00
>>>>> skb for TX queue 14, qdisc is ffff8b375eeea000, qdisc_sleeping is ffff8b375eeea000
>>>>> skb for TX queue 15, qdisc is ffff8b375eeeb800, qdisc_sleeping is ffff8b375eeeb800
>>>>> skb for TX queue 16, qdisc is NULL, qdisc_sleeping is NULL
>>>>>
>>>>> What the hell? So, not sure why queue #16 would *never* have a qdisc
>>>>> attached. Is it something expected I'm not aware of? As an FYI, here is
>>>>> the output of "tc qdisc list dev xxx":
>>>>>
>>>>> qdisc mq 0: root
>>>>> qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>> qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
>>>>>
>>>>> By the way, the NIC is an Intel XL710 40GbE QSFP+ (i40e driver, firmware
>>>>> version 8.50 0x8000b6c7 1.3082.0) and it was tested on latest "net"
>>>>> version (6.1.0-rc7+). Is this a bug in the i40e driver?
>>>>>
>>>>
>>>>> Cc: stable@...r.kernel.org
>>>>
>>>> Patch title is mangled. The Fixes: tag should appear here, not in the title.
>>>>
>>>>
>>>> Fixes: b63c5478e9cb ("ipv6: ioam: Support for Queue depth data field")
>>>>
>>>>> Signed-off-by: Justin Iurman <justin.iurman@...ege.be>
>>>>> ---
>>>>>    net/ipv6/ioam6.c | 11 +++++++----
>>>>>    1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/net/ipv6/ioam6.c b/net/ipv6/ioam6.c
>>>>> index 571f0e4d9cf3..2472a8a043c4 100644
>>>>> --- a/net/ipv6/ioam6.c
>>>>> +++ b/net/ipv6/ioam6.c
>>>>> @@ -727,10 +727,13 @@ static void __ioam6_fill_trace_data(struct sk_buff *skb,
>>>>>                           *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>>>>                   } else {
>>>>>                           queue = skb_get_tx_queue(skb_dst(skb)->dev, skb);
>>>>
>>>> Are you sure skb_dst(skb)->dev is correct at this stage, what about
>>>> stacked devices ?
>>>>
>>>>> -                       qdisc = rcu_dereference(queue->qdisc);
>>>>> -                       qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>>>> -
>>>>> -                       *(__be32 *)data = cpu_to_be32(backlog);
>>>>> +                       if (!queue->qdisc) {
>>>>
>>>> This is racy.
>>>>
>>>>> +                               *(__be32 *)data = cpu_to_be32(IOAM6_U32_UNAVAILABLE);
>>>>> +                       } else {
>>>>> +                               qdisc = rcu_dereference(queue->qdisc);
>>>>> +                               qdisc_qstats_qlen_backlog(qdisc, &qlen, &backlog);
>>>>> +                               *(__be32 *)data = cpu_to_be32(backlog);
>>>>> +                       }
>>>>>                   }
>>>>>                   data += sizeof(__be32);
>>>>>           }
>>>>> --
>>>>> 2.25.1
>>>>>
>>>>
>>>> Quite frankly I suggest to revert b63c5478e9cb completely.
>>>>
>>>> The notion of Queue depth can not be properly gathered in Linux with a
>>>> multi queue model,
>>>> so why trying to get a wrong value ?
>>>
>>> Additional reason for a revert is that qdisc_qstats_qlen_backlog() is
>>> reserved for net/sched
>>
>> If by "reserved" you mean "only used by at the moment", then yes (with
>> the only exception being IOAM). But some other functions are defined as
>> well, and some are used elsewhere than the "net/sched" context. So I
>> don't think it's really an issue to use this function "from somewhere else".
>>
>>> code, I think it needs the qdisc lock to be held.
>>
>> Good point. But is it really needed when called with rcu_read_lock?
> 
> It is needed.

So I guess I could just submit a patch to wrap that part with:

spin_lock(qdisc_lock(qdisc));
[...]
spin_unlock(qdisc_lock(qdisc));

Anyway, there'd still be that queue-without-qdisc bug out there that I'd 
like to discuss but which seems to be avoided in the conversation somehow.

> Please revert this patch.
> 
> Many people use FQ qdisc, where packets are waiting for their Earliest
> Departure Time to be released.

The IOAM queue depth is a very important value and is already used. Just 
reverting the patch is not really an option. Besides, if IOAM is not 
used, then what you described above would not be a problem (probably 
most of the time as IOAM is enabled on limited domains). Not to mention 
that you probably don't need the queue depth in every packet.

> Also, the draft says:
> 
> 5.4.2.7.  queue depth
> 
>     The "queue depth" field is a 4-octet unsigned integer field.  This
>     field indicates the current length of the egress interface queue of
>     the interface from where the packet is forwarded out.  The queue
>     depth is expressed as the current amount of memory buffers used by
>     the queue (a packet could consume one or more memory buffers,
>     depending on its size).
> 
>      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>     |                       queue depth                             |
>     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
> 
> It is relatively clear that the egress interface is the aggregate
> egress interface,
> not a subset of the interface.

Correct, even though the definition of an interface in RFC 9197 is quite 
abstract (see the end of section 4.4.2.2: "[...] could represent a 
physical interface, a virtual or logical interface, or even a queue").

> If you have 32 TX queues on a NIC, all of them being backlogged (line rate),
> sensing the queue length of one of the queues would give a 97% error
> on the measure.

Why would it? Not sure I get your idea based on that example.

> To properly implement that 'Queue depth' metric, you would need to use
> the backlog of the MQ qdisc.
> That would be very expensive.

Could be a solution, thanks for the suggestion. But if it's too 
expensive, I'd rather have the current solution (with locks) than 
nothing at all. At least it has provided an acceptable solution so far.