lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260106144023.381884-2-ionut.nechita@windriver.com>
Date: Tue,  6 Jan 2026 16:40:21 +0200
From: Ionut Nechita <djiony2011@...il.com>
To: bvanassche@....org
Cc: axboe@...nel.dk,
	gregkh@...uxfoundation.org,
	ionut.nechita@...driver.com,
	linux-block@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	ming.lei@...hat.com,
	muchun.song@...ux.dev,
	sashal@...nel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH v2 2/2] block: Fix WARN_ON in blk_mq_run_hw_queue when called from interrupt context

Hi Bart,

Thank you for the thorough and insightful review. You've identified several critical issues with my submission that I need to address.

> 6.6.71 is pretty far away from Jens' for-next branch. Please use Jens'
> for-next branch for testing kernel patches intended for the upstream kernel.

You're absolutely right. I was testing on the stable Debian kernel (6.6.71-rt) which was where the issue was originally reported. I will now fetch and test on Jens' for-next branch and ensure the issue reproduces there before resubmitting.

> Where in the above call stack is the code that disables interrupts?

This was poorly worded on my part, and I apologize for the confusion. The issue is NOT "interrupt context" in the hardirq sense.

What's actually happening:
- **Context:** kworker thread (async SCSI device scan)
- **State:** Running with preemption disabled (atomic context, not hardirq)
- **Path:** Queue destruction during device probe error cleanup
- **Trigger:** On PREEMPT_RT, in_interrupt() returns true when preemption is disabled, even in process context

The WARN_ON in blk_mq_run_hw_queue() at line 2291 is:
  WARN_ON_ONCE(!async && in_interrupt());

On PREEMPT_RT, this check fires because:
1. blk_freeze_queue_start() calls blk_mq_run_hw_queues(q, false) ← async=false
2. This eventually calls blk_mq_run_hw_queue() with async=false
3. in_interrupt() returns true (because preempt_count indicates atomic state)
4. WARN_ON triggers

So it's not "interrupt context" - it's atomic context (preemption disabled) being detected by in_interrupt() on RT kernel.

> How is the above call stack related to the reported problem? The above
> call stack is about request queue allocation while the reported problem
> happens during request queue destruction.

You're absolutely correct, and I apologize for the confusion. I mistakenly included two different call stacks in my commit message:

1. **"scheduling while atomic" during blk_mq_realloc_hw_ctxs** - This was from queue allocation and is a DIFFERENT issue. It should NOT have been included.

2. **WARN_ON during blk_queue_start_drain** - This is the ACTUAL issue that my patch addresses (queue destruction path).

I will revise the commit message to remove the unrelated allocation stack trace and focus solely on the queue destruction path.

> I apologize for the confusion in my commit message. Should I:
> 1. Revise the commit message to accurately describe the blk_queue_start_drain() path?
> 2. Add details about the PREEMPT_RT context causing the atomic state?
>
> The answer to both questions is yes.

Understood. I will prepare v3->v5 with the following corrections:

1. **Test on Jens' for-next branch** - Fetch, reproduce, and validate the fix on the upstream development tree

2. **Accurate context description** - Replace "IRQ thread context" with "kworker context with preemption disabled (atomic context on RT)"

3. **Single, clear call stack** - Remove the confusing allocation stack trace, focus only on the destruction path:
   ```
   scsi_alloc_sdev (error path)
   → __scsi_remove_device
   → blk_mq_destroy_queue
   → blk_queue_start_drain
   → blk_freeze_queue_start
   → blk_mq_run_hw_queues(q, false)  ← Problem: async=false
   ```

4. **Explain PREEMPT_RT specifics** - Clearly describe why in_interrupt() returns true in atomic context on RT kernel, and how changing to async=true avoids the problem

5. **Accurate problem statement** - This is about avoiding synchronous queue runs in atomic context on RT, not about MSI-X IRQ thread contention (that was a misunderstanding on my part)

I'll respond again once I've validated on for-next and have a corrected v3->v5 ready.

Thank you again for the detailed feedback.

Best regards,
Ionut
--
2.52.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ