linux-kernel - Re: [RFC PATCH] workqueue: introduce queue_delayed_work_on_offline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <54ef8fde-fcbf-4610-a42e-dbb4713ef5c1@oracle.com>
Date: Fri, 17 Jan 2025 16:27:32 +1100
From: imran.f.khan@...cle.com
To: andrew+netdev@...n.ch, davem@...emloft.net, edumazet@...gle.com,
        kuba@...nel.org, pabeni@...hat.com,
        James.Bottomley@...senPartnership.com, martin.petersen@...cle.com
Cc: jiangshanlai@...il.com, linux-kernel@...r.kernel.org,
        Tejun Heo <tj@...nel.org>
Subject: Re: [RFC PATCH] workqueue: introduce
 queue_delayed_work_on_offline_safe.

Hello everyone,

Firstly, apologies If I have added too many people. 
I have used maintainer list for drivers mentioned in [1] and 
further mentioned below in [2], [3], [4] and [5].

To give some context, in this thread we are discussing the problem
of, queue_delayed_work_on (with non zero delay) putting the work
on an (already), offlined CPU and that work not getting executed or at least
not getting executed timely because the underlying timer would not
fire on an offlined cpu.
We observed this issue with some in-house changes in RDS code, where some
preferred-cpu was being used to queue the delayed_work , but it was offline
and as a result corresponding work item remained stuck

I could not locate any reports of similar issue in the mailing list but I
see that [2], [4] and [5]are also using some sort of cached/preferred cpu
information to queue the delayed_work on.
So could you please confirm if you have faced issue, similar to one being described
here and if not how are you ensuring that selected cpu is and remains online.

For example [2] is using a preferred-cpu, obtained from [3], but [3] is using
"cpu_online_mask" to confirm that cpu is online. But can't it happen that
cpu was seen as online in cpu_online_mask by [3], but by the time, timer
of delayed_work got added this cpu was offline. I understand that race window
is very slow and very unlikely but it still looks racy or am I understanding
something totally wrong.

Similarly, [4] and [5] are using cq->chann as cpu for queueing delayed_work on,
but its not clear how its ensured that these cpus are online. Is there some
implicit semantics,that is confirming that cq->chann is and remains online,
during submission of delayed_work.

I will very thankful, if you could clarify the above queries.

Thanks,
Imran

[1]: https://lore.kernel.org/all/eab73ac2-d9eb-4ac7-9e0f-9a02b81a31bb@oracle.com/
[2]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/net/ethernet/pensando/ionic/ionic_dev.c#L177
[3]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/net/ethernet/pensando/ionic/ionic_dev.c#L76
[4]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/scsi/lpfc/lpfc_sli.c#L14987
[5]: https://elixir.bootlin.com/linux/v6.12.6/source/drivers/scsi/lpfc/lpfc_sli.c#L15381

On 14/1/2025 2:01 pm, imran.f.khan@...cle.com wrote:
> Hello Tejun,
> Thanks for taking a look into it.
> On 14/1/2025 5:21 am, Tejun Heo wrote:
>> On Mon, Jan 13, 2025 at 03:35:40PM +1100, Imran Khan wrote:
>> ...
>>> I have kept the patch as RFC because from mailing list,
>>> I could not find any users, of queue_delayed_work_on,
>>> that is ending up queuing dwork on an offlined CPU.
>>> We have some in-house code that is running into this problem,
>>> and currently we are fixing it on caller side of queue_delayed_work_on.
>>> Other users who run into this issue, can also use the approach of
>>> fixing it on caller side or we can use the interface introduced
>>> here for such use cases.
>>
>> I'm not sure how necessary this is. If the timer is okay to run on other
>> CPUs, might as well just use queue_delayed_work().
>>
> 
> Yes, right now I can't locate something in upstream kernel that gets
> broken due to the issue mentioned here.
> All (except 3, mentioned further down) users of queued_delayed_work_on
> are using smp_processor_id(), to specify the CPU. So specified CPU can't
> be an already offlined CPU.
> 
> I see below 3 files (in v6.12.6), using queue_delayed_work_on with some sort of cached
> cpu information:
> 
>   drivers/net/ethernet/pensando/ionic/ionic_dev.c -> line 177
>   drivers/scsi/esas2r/esas2r_main.c -> line 1858
>   drivers/scsi/lpfc/lpfc_sli.c
>        -> line 14987
>        -> line 15381 
> But looks like in these cases specified CPU remains online or
> they simply have not encountered the issue mentioned here.
> 
> 
> For this patch, yes the timer is okay to run on other CPUs but that is
> only as a last resort, most of the times it could still run on specified
> CPU (assuming its online)
> 
> Thanks,
> Imran
> 
> 
> 
>> Thanks.
>>
>