netdev - Re: [PATCH net v1] net/sched: taprio: fix cycle time extension logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230530194708.zz6wnzaenau54hcv@skbuf>
Date: Tue, 30 May 2023 22:47:08 +0300
From: Vladimir Oltean <vladimir.oltean@....com>
To: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@...el.com>
Cc: netdev@...r.kernel.org, vinicius.gomes@...el.com, kuba@...nel.org,
	pabeni@...hat.com, linux-kernel@...r.kernel.org,
	tee.min.tan@...ux.intel.com, edumazet@...gle.com
Subject: Re: [PATCH net v1] net/sched: taprio: fix cycle time extension logic

On Tue, May 30, 2023 at 04:25:41PM +0800, Muhammad Husaini Zulkifli wrote:
> From: Tan Tee Min <tee.min.tan@...ux.intel.com>
> 
> According to IEEE Std. 802.1Q-2018 section Q.5 CycleTimeExtension,
> the Cycle Time Extension variable allows this extension of the last old
> cycle to be done in a defined way. If the last complete old cycle would
> normally end less than OperCycleTimeExtension nanoseconds before the new
> base time, then the last complete cycle before AdminBaseTime is reached
> is extended so that it ends at AdminBaseTime.
> 
> This patch extends the last entry of last complete cycle to AdminBaseTime.
> 
> Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
> Signed-off-by: Tan Tee Min <tee.min.tan@...ux.intel.com>
> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@...el.com>
> ---

Thanks for the patch.

I think that the commit message insufficiently describes the problems
with the existing code. Also, the incorrect Fixes: tag suggests it may
even have been incompletely characterized.

Here are some additional thoughts of mine (also insufficient, since they
are based on static code analysis, done now) which may nuance things a
bit more, to change the Fixes tag and the shape of the proposed solution.

Of the problems is that after my commit a1e6ad30fa19 ("net/sched: taprio:
calculate guard band against actual TC gate close time"), the last entry
of the old admin schedule stops being valid from taprio_dequeue_from_txq()'s
perspective.

Before that patch, taprio_dequeue_from_txq() would look at entry->close_time
to decide whether packets are eligible to be sent or not. The old logic
would take a cycle length correction into account like this:

	if (should_change_schedules(admin, oper, close_time)) {
		/* Set things so the next time this runs, the new
		 * schedule runs.
		 */
		close_time = sched_base_time(admin);
		switch_schedules(q, &admin, &oper);
	}

	next->close_time = close_time; // contains the cycle length correction

After that patch, taprio_dequeue_from_txq() -> taprio_entry_allows_tx()
simply does not have logic to take a cycle time correction into consideration;
it just looks at entry->gate_close_time[tc] which is determined from the
previous entry->end_time plus the next->gate_duration[tc]. All
entry->gate_duration[tc] values are calculated only once, during
taprio_calculate_gate_durations(). Nowhere is a dynamic schedule change
taken into account.

Now, taprio_dequeue_from_txq() uses the following "entry" fields:
gate_close_time[tc] and budget[tc]. They are both recalculated
incorrectly by advance_sched(), which your change addresses. That is one
issue which should be described properly, and a patch limited to that.

Sure, you're modifying entry->gate_duration[] to account for the correction,
but now it no longer matches this kind of checks:

	/* Traffic classes which never close have infinite budget */
	if (entry->gate_duration[tc] == sched->cycle_time)
		budget = INT_MAX;

so this is why your choice is to also update the cycle_time. An unfortunate
consequence of that choice is that this might also become transiently visible
to taprio_dump(), which I'm not totally convinced is desirable - because
effectively, the cycle time shouldn't have changed. Although, true, the
old oper schedule is going away soon, we don't really know how soon. The
cycle times can be arbitrarily large. It would be great if we could save
the correction into a dedicated "s64 correction" variable (ranging from
-cycle_time+1 to +cycle_time_extension), and leave the cycle_time alone.

So my patch is partly to blame for today's mishaps, which is something
that this patch fails to identify properly.

But taprio_enqueue() is also a problem, and that isn't addressed. It can be,
that during a corrected cycle, frames which were oversized due to the
queueMaxSDU restriction are transiently not oversized anymore, and
should be allowed to pass - or the other way around (and this is worse):
a frame which would have passed through a full-size window will not pass
through a truncated last cycle (negative correction), and taprio_enqueue()
will not detect this and will not drop the skb.

taprio_update_queue_max_sdu() would need to be called, and that depends
on an up-to-date sched->max_open_gate_duration[tc] which isn't currently
recalculated.

So, one way or another, taprio_calculate_gate_durations() also needs to
be called again after a change to the schedule's correction.

>  net/sched/sch_taprio.c | 32 +++++++++++++++++++++-----------
>  1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index 76db9a10ef504..ef487fef83fce 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -99,6 +99,7 @@ struct taprio_sched {
>  	u32 max_sdu[TC_MAX_QUEUE]; /* save info from the user */
>  	u32 fp[TC_QOPT_MAX_QUEUE]; /* only for dump and offloading */
>  	u32 txtime_delay;
> +	bool sched_changed;

nitpick: sched_change_pending?

>  };
>  
>  struct __tc_taprio_qopt_offload {
> @@ -934,8 +935,10 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	admin = rcu_dereference_protected(q->admin_sched,
>  					  lockdep_is_held(&q->current_entry_lock));
>  
> -	if (!oper)
> +	if (!oper || q->sched_changed) {
> +		q->sched_changed = false;
>  		switch_schedules(q, &admin, &oper);
> +	}
>  
>  	/* This can happen in two cases: 1. this is the very first run
>  	 * of this function (i.e. we weren't running any schedule
> @@ -962,20 +965,27 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  	end_time = ktime_add_ns(entry->end_time, next->interval);
>  	end_time = min_t(ktime_t, end_time, oper->cycle_end_time);
>  
> +	if (should_change_schedules(admin, oper, oper->cycle_end_time) &&
> +	    list_is_last(&next->list, &oper->entries)) {
> +		u32 ori_interval = next->interval;

The choice of operations seems unintuitive, when you could have simply
done:

	ktime_t correction = ktime_sub(sched_base_time(admin), end_time);

	next->interval += correction;
	oper->cycle_time += correction;

	(or possibly save the correction instead, see the feedback above)

> +
> +		next->interval += ktime_sub(sched_base_time(admin), end_time);
> +		oper->cycle_time += next->interval - ori_interval;
> +		end_time = sched_base_time(admin);
> +		oper->cycle_end_time = end_time;
> +		q->sched_changed = true;

I see an issue here: you need to set q->sched_changed = true whenever
should_change_schedules() returned true and not just for the last entry,
correct? Because there could be a schedule change which needs to happens
during entry 2 out of 4 of the current one (truncation case - negative
correction). In that case, I believe that should_change_schedules()
keeps shouting at you "change! change! change!" but you only call
switch_schedules() when you've reached entry 4 with the "next" pointer,
which is not what the standard suggests should be done.

IIUC, the standard says that when an admin and an oper sched meet, the
decision of what to do near the AdminBaseTime - whether to elongate the
next-to-last cycle of the oper sched or to let the last cycle run but to
cut it short - depends on the OperCycleTimeExtension. In a nutshell,
that variable says what is the maximum positive correction applicable to
the last sched entry and to the cycle time. If a positive correction
larger than that would be necessary (relative to the next-to-last cycle),
the decision is to just let the last cycle run for how long it can.

> +	}
> +
>  	for (tc = 0; tc < num_tc; tc++) {
> -		if (next->gate_duration[tc] == oper->cycle_time)
> +		if (next->gate_duration[tc] == oper->cycle_time) {
>  			next->gate_close_time[tc] = KTIME_MAX;
> -		else
> +		} else if (q->sched_changed && next->gate_duration[tc]) {
> +			next->gate_close_time[tc] = end_time;
> +			next->gate_duration[tc] = next->interval;

This deserves a comment because, although I understand what it intends
to do, I fail to understand if it will work or not :)

> +		} else {
>  			next->gate_close_time[tc] = ktime_add_ns(entry->end_time,
>  								 next->gate_duration[tc]);
> -	}
> -
> -	if (should_change_schedules(admin, oper, end_time)) {
> -		/* Set things so the next time this runs, the new
> -		 * schedule runs.
> -		 */
> -		end_time = sched_base_time(admin);
> -		switch_schedules(q, &admin, &oper);

I guess this is also a separate issue with the code. switch_schedules()
changes q->oper_sched too early, and taprio_skb_exceeds_queue_max_sdu()
looks at q->oper_sched.  So during an extension period, the queueMaxSDU
of the new schedule will be applied instead of the (extended) old one.

> +		}
>  	}
>  
>  	next->end_time = end_time;
> -- 
> 2.17.1
>

I guess at some point we should also fix up this comment?

	/* FIXME: the IEEE 802.1Q-2018 Specification isn't clear about
	 * how precisely the extension should be made. So after
	 * conformance testing, this logic may change.
	 */
	if (ktime_compare(next_base_time, extension_time) <= 0)
		return true;