[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YpMVfN7UUGxX0UxF@geday>
Date: Sun, 29 May 2022 03:41:00 -0300
From: Geraldo Nascimento <geraldogabriel@...il.com>
To: Tejun Heo <tj@...nel.org>
Cc: Lai Jiangshan <jiangshanlai@...il.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] workqueue: missing NOT while checking if Workqueue is
offline
On Sat, May 28, 2022 at 08:14:23PM -1000, Tejun Heo wrote:
> On Sun, May 29, 2022 at 02:53:39AM -0300, Geraldo Nascimento wrote:
> > On Sat, May 28, 2022 at 07:24:41PM -1000, Tejun Heo wrote:
> > > On Sun, May 29, 2022 at 01:29:32AM -0300, Geraldo Nascimento wrote:
> > > > I would like very much to hear the opinion of the maintainers!
> > >
> > > I have a hard time understanding what you're trying to do. Can you please
> > > slow down and start from describing the problem itself?
> >
> > Hi Tejun,
> >
> > Sorry for the hurry.
> >
> > The problem is best described in https://gitlab.freedesktop.org/drm/amd/-/issues/1898
> >
> > From my understanding from the context of __cancel_work_timer() we should not
> > ever call __flush_work() but I may be wrong. In the present case as
>
> Yeah, you're wrong.
>
No problem from my side, sorry for wasting your time.
> > described in AMD's GitLab __cancel_work_timer() is being called by
> > cancel_delayed_work_sync() inside kfd_process_notifier_release()
> > from drivers/gpu/drm/amd/amdkfd/kfd_process.c:1157 (Linux 5.18).
>
> Have you confirmed that that actually is the warning which is triggering? I
> don't see how that condition would trigger that late during the boot and the
> warning line being reported doesn't match v5.16 source code, so I'm not sure
> but skimming the instructon sequence, that's the second UD2 sequence, so I'm
> gonna guess that's the second WARN_ON - the !work->func one and someone else
> on the gitlab bug report seems to agree too.
While I can confirm from my dmesg traces it's the second WARN_ON (the one on
(!work->func)) that triggers, remember it's being called from
__flush_work() due to the lack of NOT operator on wq_online, inside
__cancel_work_timer().
To be honest, for me, it only makes sense to call __flush_work() from
the context of __cancel_work_timer() if we are sure the work isn't
executing. One of the few ways to be sure is when kthreads haven't
initialized yet, that means workqueue_init() hasn't fired yet and
wq_online is still false, so it makes sense to negate that false and
exceptionally call __flush_work() without waiting for completion of the
work - i.e., __flush_work() will WARN_ON workqueue offline condition and
return false immediately because task was already idle.
I know I may be repeating myself, but I'm trying to make my point, from
the little understanding I have of the kernel. And I know that you know best,
and that the possibility of a bug like that lying undiscovered on a
code-base as scrutinized as the Linux kernel is, is very small.
>
> It's usually a lot more helpful if the bug report is complete - include the
> full warning message with some context at least, make sure that the kernel
> you're using is an upstream one or something close enough. If not, point to
> the source tree. Also, try to clearly distinguish what you know and what you
> suspect. Both can help but mixing them up together tends to cause confusion
> for everyone involved.
Sorry. Will try to do better.
>
> It just looks like the code is trying to cancel a work item which hasn't
> been initialized and what it prolly needs is an ifdef around that cancel
> call depending on the config option.
Thanks for looking into it,
Geraldo Nascimento
>
> Thanks.
>
> --
> tejun
Powered by blists - more mailing lists