linux-kernel - Re: [PATCH v2 net 3/7] net/sched: taprio: update impacted fields during cycle time adjustment

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231108234112.umxjgvqajnxjr6lj@skbuf>
Date:   Thu, 9 Nov 2023 01:41:12 +0200
From:   Vladimir Oltean <vladimir.oltean@....com>
To:     Faizal Rahim <faizal.abdul.rahim@...ux.intel.com>
Cc:     Vinicius Costa Gomes <vinicius.gomes@...el.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Cong Wang <xiyou.wangcong@...il.com>,
        Jiri Pirko <jiri@...nulli.us>,
        "David S . Miller" <davem@...emloft.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 net 3/7] net/sched: taprio: update impacted fields
 during cycle time adjustment

On Tue, Nov 07, 2023 at 06:20:19AM -0500, Faizal Rahim wrote:
> Update impacted fields in advance_sched() if cycle_corr_active()
> is true, which indicates that the next entry is the last entry
> from oper that it will run.
> 
> Update impacted fields:
> 
> 1. gate_duration[tc], max_open_gate_duration[tc]
> Created a new API update_open_gate_duration().The API sets the
> duration based on the last remaining entry, the original value
> was based on consideration of multiple entries.
> 
> 2. gate_close_time[tc]
> Update next entry gate close time according to the new admin
> base time
> 
> 3. max_sdu[tc], budget[tc]
> Restrict from setting to max value because there's only a single
> entry left to run from oper before changing to the new admin
> schedule, so we shouldn't set to max.
> 
> Signed-off-by: Faizal Rahim <faizal.abdul.rahim@...ux.intel.com>

The commit message shouldn't be a text-to-speech output of the commit body.
Say very shortly how the system should behave, what's wrong such that it
doesn't behave as expected, what's the user-visible impact of the bug,
and try to identify why the bug happened.

In this case, what happened is that commit a306a90c8ffe ("net/sched:
taprio: calculate tc gate durations"), which introduced the impacted
fields you are changing, never took dynamic schedule changes into
consideration. So this commit should also appear in the Fixes: tag.

> ---
>  net/sched/sch_taprio.c | 49 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/net/sched/sch_taprio.c b/net/sched/sch_taprio.c
> index ed32654b46f5..119dec3bbe88 100644
> --- a/net/sched/sch_taprio.c
> +++ b/net/sched/sch_taprio.c
> @@ -288,7 +288,8 @@ static void taprio_update_queue_max_sdu(struct taprio_sched *q,
>  		/* TC gate never closes => keep the queueMaxSDU
>  		 * selected by the user
>  		 */
> -		if (sched->max_open_gate_duration[tc] == sched->cycle_time) {
> +		if (sched->max_open_gate_duration[tc] == sched->cycle_time &&
> +		    !cycle_corr_active(sched->cycle_time_correction)) {
>  			max_sdu_dynamic = U32_MAX;
>  		} else {
>  			u32 max_frm_len;
> @@ -684,7 +685,8 @@ static void taprio_set_budgets(struct taprio_sched *q,
>  
>  	for (tc = 0; tc < num_tc; tc++) {
>  		/* Traffic classes which never close have infinite budget */
> -		if (entry->gate_duration[tc] == sched->cycle_time)
> +		if (entry->gate_duration[tc] == sched->cycle_time &&
> +		    !cycle_corr_active(sched->cycle_time_correction))
>  			budget = INT_MAX;
>  		else
>  			budget = div64_u64((u64)entry->gate_duration[tc] * PSEC_PER_NSEC,
> @@ -896,6 +898,32 @@ static bool should_restart_cycle(const struct sched_gate_list *oper,
>  	return false;
>  }
>  
> +/* Open gate duration were calculated at the beginning with consideration of
> + * multiple entries. If cycle time correction is active, there's only a single
> + * remaining entry left from oper to run.
> + * Update open gate duration based on this last entry.
> + */
> +static void update_open_gate_duration(struct sched_entry *entry,
> +				      struct sched_gate_list *oper,
> +				      int num_tc,
> +				      u64 open_gate_duration)
> +{
> +	int tc;
> +
> +	if (!entry || !oper)
> +		return;
> +
> +	for (tc = 0; tc < num_tc; tc++) {
> +		if (entry->gate_mask & BIT(tc)) {
> +			entry->gate_duration[tc] = open_gate_duration;
> +			oper->max_open_gate_duration[tc] = open_gate_duration;
> +		} else {
> +			entry->gate_duration[tc] = 0;
> +			oper->max_open_gate_duration[tc] = 0;
> +		}
> +	}
> +}
> +
>  static bool should_change_sched(struct sched_gate_list *oper)
>  {
>  	bool change_to_admin_sched = false;
> @@ -1010,13 +1038,28 @@ static enum hrtimer_restart advance_sched(struct hrtimer *timer)
>  			/* The next entry is the last entry we will run from
>  			 * oper, subsequent ones will take from the new admin
>  			 */
> +			u64 new_gate_duration =
> +				next->interval + oper->cycle_time_correction;
> +			struct qdisc_size_table *stab =
> +				rtnl_dereference(q->root->stab);
> +

The lockdep annotation for this RCU accessor is bogus.
rtnl_dereference() is the same as rcu_dereference_protected(..., lockdep_rtnl_is_held()),
which cannot be true in a hrtimer callback, as the rtnetlink lock is a
sleepable mutex and hrtimers run in atomic context.

Running with lockdep enabled will tell you as much:

$ ./test_taprio_cycle_extension.sh 
Testing config change with a delay of 5250000000 ns between schedules
[  100.734925] 
[  100.736703] =============================
[  100.740780] WARNING: suspicious RCU usage
[  100.744857] 6.6.0-10114-gca572939947f #1495 Not tainted
[  100.750162] -----------------------------
[  100.754236] net/sched/sch_taprio.c:1064 suspicious rcu_dereference_protected() usage!
[  100.762155] 
[  100.762155] other info that might help us debug this:
[  100.762155] 
[  100.770242] 
[  100.770242] rcu_scheduler_active = 2, debug_locks = 1
[  100.776851] 1 lock held by swapper/0/0:
[  100.780756]  #0: ffff3d9784b83b00 (&q->current_entry_lock){-...}-{3:3}, at: advance_sched+0x44/0x59c
[  100.790099] 
[  100.790099] stack backtrace:
[  100.794477] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.6.0-10114-gca572939947f #1495
[  100.802346] Hardware name: LS1028A RDB Board (DT)
[  100.807072] Call trace:
[  100.809531]  dump_backtrace+0xf4/0x140
[  100.813305]  show_stack+0x18/0x2c
[  100.816638]  dump_stack_lvl+0x60/0x80
[  100.820321]  dump_stack+0x18/0x24
[  100.823654]  lockdep_rcu_suspicious+0x170/0x210
[  100.828210]  advance_sched+0x384/0x59c
[  100.831978]  __hrtimer_run_queues+0x200/0x430
[  100.836360]  hrtimer_interrupt+0xdc/0x39c
[  100.840392]  arch_timer_handler_phys+0x3c/0x4c
[  100.844862]  handle_percpu_devid_irq+0xb8/0x28c
[  100.849417]  generic_handle_domain_irq+0x2c/0x44
[  100.854060]  gic_handle_irq+0x4c/0x110
[  100.857830]  call_on_irq_stack+0x24/0x4c
[  100.861775]  el1_interrupt+0x74/0xc0
[  100.865370]  el1h_64_irq_handler+0x18/0x24
[  100.869489]  el1h_64_irq+0x64/0x68
[  100.872909]  arch_local_irq_enable+0x8/0xc
[  100.877032]  cpuidle_enter+0x38/0x50
[  100.880629]  do_idle+0x1ec/0x280
[  100.883877]  cpu_startup_entry+0x34/0x38
[  100.887822]  kernel_init+0x0/0x1a0
[  100.891245]  start_kernel+0x0/0x3b0
[  100.894756]  start_kernel+0x2f8/0x3b0
[  100.898439]  __primary_switched+0xbc/0xc4

What I would do is:

			struct qdisc_size_table *stab;

			rcu_read_lock();
			stab = rcu_dereference(q->root->stab);
			taprio_update_queue_max_sdu(q, oper, stab);
			rcu_read_unlock();

>  			oper->cycle_end_time = new_base_time;
>  			end_time = new_base_time;
> +
> +			update_open_gate_duration(next, oper, num_tc,
> +						  new_gate_duration);
> +			taprio_update_queue_max_sdu(q, oper, stab);
>  		}
>  	}
>  
>  	for (tc = 0; tc < num_tc; tc++) {
> -		if (next->gate_duration[tc] == oper->cycle_time)
> +		if (cycle_corr_active(oper->cycle_time_correction) &&
> +		    (next->gate_mask & BIT(tc)))
> +			/* Set to the new base time, ensuring a smooth transition
> +			 * to the new schedule when the next entry finishes.
> +			 */
> +			next->gate_close_time[tc] = end_time;
> +		else if (next->gate_duration[tc] == oper->cycle_time)
>  			next->gate_close_time[tc] = KTIME_MAX;
>  		else
>  			next->gate_close_time[tc] = ktime_add_ns(entry->end_time,
> -- 
> 2.25.1
>