[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZD1fdvr1Eh8aAdnU@localhost.localdomain>
Date: Mon, 17 Apr 2023 17:02:14 +0200
From: Frederic Weisbecker <frederic@...nel.org>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Victor Hassan <victor@...winnertech.com>, fweisbec@...il.com,
mingo@...nel.org, jindong.yue@....com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] tick/broadcast: Do not set oneshot_mask except
was_periodic was true
Le Sat, Apr 15, 2023 at 11:01:51PM +0200, Thomas Gleixner a écrit :
> From: Thomas Gleixner <tglx@...utronix.de>
> Subject: tick/broadcast: Make broadcast device replacement work correctly
> Date: Wed, 12 Apr 2023 08:34:25 +0800
>
> When a tick broadcast clockevent device is initialized for one shot mode
> then tick_broadcast_setup_oneshot() OR's the periodic broadcast mode
> cpumask into the oneshot broadcast cpumask.
>
> This is required when switching from periodic broadcast mode to oneshot
> broadcast mode to ensure that CPUs which are waiting for periodic
> broadcast are woken up on the next tick.
>
> But it is subtly broken, when an active broadcast device is replaced and
> the system is already in oneshot (NOHZ/HIGHRES) mode. Victor observed
> this and debugged the issue.
>
> Then the OR of the periodic broadcast CPU mask is wrong as the periodic
> cpumask bits are sticky after tick_broadcast_enable() set it for a CPU
> unless explicitly cleared via tick_broadcast_disable().
>
> That means that this sets all other CPUs which have tick broadcasting
> enabled at that point unconditionally in the oneshot broadcast mask.
>
> If the affected CPUs were already idle and had their bits set in the
> oneshot broadcast mask then this does no harm. But for non idle CPUs
> which were not set this corrupts their state.
>
> On their next invocation of tick_broadcast_enable() they observe the bit
> set, which indicates that the broadcast for the CPU is already set up.
> As a consequence they fail to update the broadcast event even if their
> earliest expiring timer is before the actually programmed broadcast
> event.
>
> If the programmed broadcast event is far in the future, then this can
> cause stalls or trigger the hung task detector.
>
> Avoid this by telling tick_broadcast_setup_oneshot() explicitly whether
> this is the initial switch over from periodic to oneshot broadcast which
> must take the periodic broadcast mask into account. In the case of
> initialization of a replacement device this prevents that the broadcast
> oneshot mask is modified.
>
> There is a second problem with broadcast device replacement in this
> function. The broadcast device is only armed when the previous state of
> the device was periodic.
Any chance the patch could be cut in two then?
Thanks.
Powered by blists - more mailing lists