linux-kernel - Re: CPU stuck due to the taprio hrtimer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87bk14pw5o.fsf@intel.com>
Date: Tue, 03 Sep 2024 12:57:39 -0300
From: Vinicius Costa Gomes <vinicius.gomes@...el.com>
To: Florian Kauer <florian.kauer@...utronix.de>, luyun <luyun@...inos.cn>,
 jhs@...atatu.com, xiyou.wangcong@...il.com, jiri@...nulli.us
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: CPU stuck due to the taprio hrtimer

Florian Kauer <florian.kauer@...utronix.de> writes:

> On 9/2/24 23:34, Vinicius Costa Gomes wrote:
>> Florian Kauer <florian.kauer@...utronix.de> writes:
>> 
>>> On 9/2/24 11:12, luyun wrote:
>>>>
>>>> 在 2024/6/28 07:30, Vinicius Costa Gomes 写道:
>>>>> Yun Lu <luyun@...inos.cn> writes:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> When I run a taprio test program on the latest kernel(v6.10-rc4), CPU stuck
>>>>>> is detected immediately, and the stack shows that CPU is stuck on taprio
>>>>>> hrtimer.
>>>>>>
>>>>>> The reproducer program link:
>>>>>> https://github.com/xyyluyun/taprio_test/blob/main/taprio_test.c
>>>>>> gcc taprio_test.c -static -o taprio_test
>>>>>>
>>>>>> In this program, start the taprio hrtimer which clockid is set to REALTIME, and
>>>>>> then adjust the system time by a significant value backwards. Thus, CPU will enter
>>>>>> an infinite loop in the__hrtimer_run_queues function, getting stuck and unable to
>>>>>> exit or respond to any interrupts.
>>>>>>
>>>>>> I have tried to avoid this problem by apllying the following patch, and it does work.
>>>>>> But I am not sure if this can be the final solution?
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Signed-off-by: Yun Lu <luyun@...inos.cn>
>>>>>> ---
>>>>>>   net/sched/sch_taprio.c | 24 ++++++++++++++++++++++++
>>>>>>   1 file changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
>>>>>> index a0d54b422186..2ff8d34bdbac 100644
>>>>>> --- a/net/sched/sch_taprio.c
>>>>>> +++ b/net/sched/sch_taprio.c
>>>>>> @@ -104,6 +104,7 @@ struct taprio_sched {
>>>>>>       u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
>>>>>>       u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
>>>>>>       u32 txtime_delay;
>>>>>> +    ktime_t offset;
>>>>>>   };
>>>>>>     struct __tc_taprio_qopt_offload {
>>>>>> @@ -170,6 +171,19 @@ static ktime_t sched_base_time(const struct sched_gate_list *sched)
>>>>>>       return ns_to_ktime(sched->base_time);
>>>>>>   }
>>>>>>   +static ktime_t taprio_get_offset(const struct taprio_sched *q)
>>>>>> +{
>>>>>> +    enum tk_offsets tk_offset = READ_ONCE(q->tk_offset);
>>>>>> +    ktime_t time = ktime_get();
>>>>>> +
>>>>>> +    switch (tk_offset) {
>>>>>> +    case TK_OFFS_MAX:
>>>>>> +        return 0;
>>>>>> +    default:
>>>>>> +        return ktime_sub_ns(ktime_mono_to_any(time, tk_offset), time);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>   static ktime_t taprio_mono_to_any(const struct taprio_sched *q, ktime_t mono)
>>>>>>   {
>>>>>>       /* This pairs with WRITE_ONCE() in taprio_parse_clockid() */
>>>>>> @@ -918,6 +932,7 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>>>>>>       int num_tc = netdev_get_num_tc(dev);
>>>>>>       struct sched_entry *entry, *next;
>>>>>>       struct Qdisc *sch = q->root;
>>>>>> +    ktime_t now_offset = taprio_get_offset(q);
>>>>>>       ktime_t end_time;
>>>>>>       int tc;
>>>>>>   @@ -957,6 +972,14 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>>>>>>       end_time = ktime_add_ns(entry->end_time, next->interval);
>>>>>>       end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
>>>>>>   +    if (q->offset != now_offset) {
>>>>>> +        ktime_t diff = ktime_sub_ns(now_offset, q->offset);
>>>>>> +
>>>>>> +        end_time = ktime_add_ns(end_time, diff);
>>>>>> +        oper->cycle_end_time = ktime_add_ns(oper->cycle_end_time, diff);
>>>>>> +        q->offset = now_offset;
>>>>>> +    }
>>>>>> +
>>>>> I think what we should do here is a bit different. Let me try to explain
>>>>> what I have in mind with some context.
>>>>>
>>>>> A bit of context: The idea of taprio is to enforce "TSN" traffic
>>>>> schedules, these schedules require time synchronization, for example via
>>>>> PTP, and in those cases, time jumps are not expected or a sign that
>>>>> something is wrong.
>>>>>
>>>>> In my mind, a time jump, specially a big one, kind of invalidates the
>>>>> schedule, as the schedule is based on an absolute time value (the
>>>>> base_time), and when time jumps that reference in time is lost.
>>>>>
>>>>> BUT making the user's system unresponsive is a bug, a big one, as if
>>>>> this happens in the real world, the user will be unable to investigate
>>>>> what made the system have so big a time correction.
>>>>>
>>>>> So my idea is to warn the user that the time jumped, say that the user
>>>>> needs to reconfigure the schedule, as it is now invalid, and disable the
>>>>> schedule.
>>>>>
>>>>> Does this make sense?
>>>>>
>>>>> Ah, and thanks for the report.
>>>>
>>>> Hello Vinicius,
>>>>
>>>> May I ask is there a fix patch for this issue?
>>>>
>>>> I test it on the latest kernel version,  and it still seems to cause CPU stuck.
>>>>
>>>> As you mentioned, a better way would be to warn the user that the current time has jumped and cancel the hrtimer,
>>>>
>>>> but I'm not sure how to warn the user, or just through printk?
>>>>
>>>> Thanks and best regards.
>>>
>>> I am not sure if it is really the best solution to force the user to reconfigure the schedule
>>> "just" because the clock jumped. Yes, time jumps are a big problem for TAPRIO, but stopping might
>>> make it worse.
>>>
>>> Vinicius wrote that the base_time can no longer reference to the correct point in time,
>>> so the schedule MUST be invalid after the time jump. It is true that the base_time does not longer
>>> refer to the same point in time it referred to before the jump from the view of the local system (!).
>>> But the base_time usually refers to the EXTERNAL time domain (i.e. the time the system SHOULD have
>>> and not the one the system currently has) and is often configured by an external entity.
>>>
>>> So it is quite likely that the schedule was incorrectly phase-shifted BEFORE the time jump and after
>>> the time jump the base_time refers to the CORRECT point in time viewed from the external time domain.
>>>
>>> If you now stop the schedule (and I assume you mean by this to let every queue transmit at any time
>>> as before the schedule was configured) and the user has to reconfigure the schedule again,
>>> it is quite likely that by this you actually increase the interference with the network and in
>>> particular confuse the time synchronization via PTP, so once the schedule is set up again,
>>> you might get a time jump AGAIN.
>>>
>>> So yes, a warning to the user is definitely appropriate in the case of a time jump, but apart
>>> from that I would prefer the system to adapt itself instead of resigning.
>>>
>> 
>> The "warn the user, disable the schedule" is more or less clear in my
>> mind how to implement. But while I was writing this, I was taking
>> another look at the standard, and I think this approach is wrong.
>> 
>> I think what we should do is something like this:
>> 
>> 1. Jump into the past:
>>    1.a. before base-time: Keep all queues open until base-time;
>>    1.b. after base-time: "rewind" the schedule to the new current time;
>> 2. Jump into the future: "fast forward" the schedule to the new current
>>    time;
>> 
>> But I think that for this to fit more neatly, we would need to change
>> how advance_sched() works, right now, it doesn't look at the current
>> time (it considers that the schedule will always advance one-by-one),
>> instead what I am thinking is to consider that every time
>> advance_sched() runs is a pontential "time jump". 
>> 
>> Ideas? Too complicated for an uncommon case? (is it really uncommon?)
>
> I think that would be the correct solution.
> And I don't think it is that uncommon. Especially when the device joins
> the network for the first time and has not time synchronized itself properly yet.
>
> Do you know what the i225/i226 do for hardware offloaded Qbv in that case?
>

In offloaded cases, in my understanding, i225/i226 re-uses most of how
launchtime works for Qbv, so the scheduling is almost per frame, during
transmission, the hw assigns each frame an offset into the gate open
event. And that time is used to calculate when the packet should reach
the wire.

So I would expect to see a few stuck packets (causing transmissions
timeouts) or early transmissions, depending on the direction of the
jump, but things should be able to recover eventually.

>> 
>>> Yun Lu, does this only happen for time jumps into the past or also for large jumps into the future?
>>> And does this also happen for small time "jumps"?
>> 
>> AFAIU this bug will only happen with large jumps into the past. For
>> small jumps into the past, it will spin uselessly for a bit. For jumps
>> into the future, the schedule will be stuck in a particular gate entry
>> until the future becomes now.
>
> Does "it will spin uselessly for a bit" mean the CPU is stuck for that time
> (even if it is short)? If that is the case, it might lead to very strange effects
> in RT systems. And these small time jumps into the past (even if just a few us)
> should actually be relatively common.
>

Yes. The hrtimer will expire, advance_sched() will get the next entry
and its expiration, set the hrtimer expiration to that, this will repeat
until that expiration is after "current" time. As this is hrtimer
function, this will block other things from running on that cpu.

My expectation for taprio in software mode was more something that
people would use to validate new schedules, do some experiments, that
kind of thing. What I was expecting to have in more serious cases was
the offloaded mode.

What I am trying to say is all I am hearing is that the software mode is
being used more seriously than I thought. So this work is a bit more
important. I will add this to my todo list, but I won't be sad at all if
someone beats me to it ;-)

> Greetings,
> Florian
>
>> 
>>>
>>> Thanks,
>>> Florian
>>>
>>>>
>>>>
>>>>>
>>>>>>       for (tc = 0; tc < num_tc; tc++) {
>>>>>>           if (next->gate_duration[tc] == oper->cycle_time)
>>>>>>               next->gate_close_time[tc] = KTIME_MAX;
>>>>>> @@ -1210,6 +1233,7 @@ static int taprio_get_start_time(struct Qdisc *sch,
>>>>>>         base = sched_base_time(sched);
>>>>>>       now = taprio_get_time(q);
>>>>>> +    q->offset = taprio_get_offset(q);
>>>>>>         if (ktime_after(base, now)) {
>>>>>>           *start = base;
>>>>>> -- 
>>>>>> 2.34.1
>>>>>>
>>>>>
>>>>> Cheers,
>>>>
>>>
>> 
>> 
>> Cheers,


Cheers,
-- 
Vinicius