[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250114231507.21672-4-frederic@kernel.org>
Date: Wed, 15 Jan 2025 00:15:06 +0100
From: Frederic Weisbecker <frederic@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: LKML <linux-kernel@...r.kernel.org>,
Frederic Weisbecker <frederic@...nel.org>,
anna-maria@...utronix.de,
kernel test robot <oliver.sang@...el.com>
Subject: [PATCH 3/4] timers/migration: Annotate accesses to ignore flag
The group's ignore flag is:
_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit
When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.
On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.
Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:
BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report
write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
__tmigr_cpu_activate
tmigr_cpu_activate
timer_clear_idle
tick_nohz_restart_sched_tick
tick_nohz_idle_exit
do_idle
cpu_startup_entry
kernel_init
do_initcalls
clear_bss
reserve_bios_regions
common_startup_64
read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
print_report
kcsan_report_known_origin
kcsan_setup_watchpoint
tmigr_next_groupevt
tmigr_update_events
tmigr_inactive_up
__walk_groups+0x50/0x77
walk_groups
__tmigr_cpu_deactivate
tmigr_cpu_deactivate
__get_next_timer_interrupt
timer_base_try_to_set_idle
tick_nohz_stop_tick
tick_nohz_idle_stop_tick
cpuidle_idle_call
do_idle
Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.
Reported-by: kernel test robot <oliver.sang@...el.com>
Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
---
kernel/time/timer_migration.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 371a62a749aa..066c9ddca4ec 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -569,7 +569,7 @@ static struct tmigr_event *tmigr_next_groupevt(struct tmigr_group *group)
while ((node = timerqueue_getnext(&group->events))) {
evt = container_of(node, struct tmigr_event, nextevt);
- if (!evt->ignore) {
+ if (!READ_ONCE(evt->ignore)) {
WRITE_ONCE(group->next_expiry, evt->nextevt.expires);
return evt;
}
@@ -665,7 +665,7 @@ static bool tmigr_active_up(struct tmigr_group *group,
* lock is held while updating the ignore flag in idle path. So this
* state change will not be lost.
*/
- group->groupevt.ignore = true;
+ WRITE_ONCE(group->groupevt.ignore, true);
return walk_done;
}
@@ -726,6 +726,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
union tmigr_state childstate, groupstate;
bool remote = data->remote;
bool walk_done = false;
+ bool ignore;
u64 nextexp;
if (child) {
@@ -744,11 +745,19 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
nextexp = child->next_expiry;
evt = &child->groupevt;
- evt->ignore = (nextexp == KTIME_MAX) ? true : false;
+ /*
+ * This can race with concurrent idle exit (activate).
+ * If the current writer wins, a useless remote expiration may
+ * be scheduled. If the activate wins, the event is properly
+ * ignored.
+ */
+ ignore = (nextexp == KTIME_MAX) ? true : false;
+ WRITE_ONCE(evt->ignore, ignore);
} else {
nextexp = data->nextexp;
first_childevt = evt = data->evt;
+ ignore = evt->ignore;
/*
* Walking the hierarchy is required in any case when a
@@ -774,7 +783,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
* first event information of the group is updated properly and
* also handled properly, so skip this fast return path.
*/
- if (evt->ignore && !remote && group->parent)
+ if (ignore && !remote && group->parent)
return true;
raw_spin_lock(&group->lock);
@@ -788,7 +797,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
* queue when the expiry time changed only or when it could be ignored.
*/
if (timerqueue_node_queued(&evt->nextevt)) {
- if ((evt->nextevt.expires == nextexp) && !evt->ignore) {
+ if ((evt->nextevt.expires == nextexp) && !ignore) {
/* Make sure not to miss a new CPU event with the same expiry */
evt->cpu = first_childevt->cpu;
goto check_toplvl;
@@ -798,7 +807,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
WRITE_ONCE(group->next_expiry, KTIME_MAX);
}
- if (evt->ignore) {
+ if (ignore) {
/*
* When the next child event could be ignored (nextexp is
* KTIME_MAX) and there was no remote timer handling before or
--
2.46.0
Powered by blists - more mailing lists