linux-kernel - Re: [PATCH] iommu: Avoid softlockup and rcu stall in fq_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7bede423-690c-4f6a-9c23-def4ad08344e@amd.com>
Date:   Tue, 18 Apr 2023 16:11:21 +0530
From:   Vasant Hegde <vasant.hegde@....com>
To:     Jerry Snitselaar <jsnitsel@...hat.com>,
        Peng Zhang <zhangpeng.00@...edance.com>,
        Robin Murphy <robin.murphy@....com>,
        Joerg Roedel <joro@...tes.org>
Cc:     will@...nel.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, Li Bin <huawei.libin@...wei.com>,
        Xie XiuQi <xiexiuqi@...wei.com>,
        Yang Yingliang <yangyingliang@...wei.com>,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [PATCH] iommu: Avoid softlockup and rcu stall in
 fq_flush_timeout().

Hi All,


On 4/11/2023 3:52 AM, Jerry Snitselaar wrote:
> On Thu, Feb 16, 2023 at 03:11:48PM +0800, Peng Zhang wrote:
>> There is softlockup under fio pressure test with smmu enabled:
>> watchdog: BUG: soft lockup - CPU#81 stuck for 22s!  [swapper/81:0]
>> ...
>> Call trace:
>>  fq_flush_timeout+0xc0/0x110
>>  call_timer_fn+0x34/0x178
>>  expire_timers+0xec/0x158
>>  run_timer_softirq+0xc0/0x1f8
>>  __do_softirq+0x120/0x324
>>  irq_exit+0x11c/0x140
>>  __handle_domain_irq+0x6c/0xc0
>>  gic_handle_irq+0x6c/0x170
>>  el1_irq+0xb8/0x140
>>  arch_cpu_idle+0x38/0x1c0
>>  default_idle_call+0x24/0x44
>>  do_idle+0x1f4/0x2d8
>>  cpu_startup_entry+0x2c/0x30
>>  secondary_start_kernel+0x17c/0x1c8
>>
>> Rcu stall may also be triggered:
>>
>> rcu: INFO: rcu_sched self-detected stall on CPU
>> NMI backtrace for cpu 21
>> CPU: 21 PID: 118 Comm: ksoftirqd/21
>> ...
>> Call trace:
>>  fq_flush_timeout+0x6d/0x90
>>  ? fq_ring_free+0xc0/0xc0
>>  call_timer_fn+0x2b/0x120
>>  run_timer_softirq+0x1a6/0x420
>>  ? finish_task_switch+0x80/0x280
>>  __do_softirq+0xda/0x2da
>>  ? sort_range+0x20/0x20
>>  run_ksoftirqd+0x26/0x40
>>  smpboot_thread_fn+0xb8/0x150
>>  kthread+0x110/0x130
>>  ? __kthread_cancel_work+0x40/0x40
>>  ret_from_fork+0x1f/0x30
>>
>> This is because the timer callback fq_flush_timeout may run more than
>> 10ms, and timer may be processed continuously in the softirq so trigger
>> softlockup and rcu stall. We can use work to deal with fq_ring_free for
>> each cpu which may take long time, that to avoid triggering softlockup
>> and rcu stall.
>>
>> This patch is modified from the patch[1] of openEuler.
>>
> 
> Hi Robin,
> 
> I was looking at something similar to this recently were in this case
> they were beating the heck out the system with the hazard io stress
> test, and someone else with some medusa test tool. In one case they
> had them force a dump on the soft lockup. On the 384 core genoa, 90
> cores were spinning the iovad rb tree lock for one domain, 1 had it,
> and the poor flush queue timer handler was having to fight everyone
> for the lock. I'm not sure what would be considered a realistic workload
> compared to these stressors, but could this be an issue over time as
> systems continue to get more cores since the timer handler potentially
> grabs and releases the iova domain rb tree lock for each cpu? The only
> cases I know of are using io stressors, so I don't know how big a deal
> it is.
> 
> I think soft lockups could still be produced with this patch, since
> there would still be the lock contention.

Right. We hit this issue when running stress-ng on 384 CPU system having NVMe
disk. I have tried this patch. This improved the situation. But we still hit
soft lockup after an hour or so.

Backtrace with this patch:

 4551.470797] watchdog: BUG: soft lockup - CPU#328 stuck for 30s!
[kworker/328:1:2710]
[ 4551.479441] Modules linked in: nvme_fabrics(E) intel_rapl_msr(E)
intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) kvm(E)
crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) aesni_intel(E)
crypto_simd(E) cryptd(E) rapl(E) wmi_bmof(E) binfmt_misc(E) ipmi_ssif(E)
nls_iso8859_1(E) ast(E) drm_shmem_helper(E) drm_kms_helper(E) i2c_algo_bit(E)
syscopyarea(E) sysfillrect(E) sysimgblt(E) acpi_ipmi(E) ccp(E) k10temp(E)
ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) mac_hid(E) sch_fq_codel(E) msr(E)
parport_pc(E) ppdev(E) lp(E) parport(E) ramoops(E) reed_solomon(E) pstore_blk(E)
pstore_zone(E) efi_pstore(E) drm(E) ip_tables(E) x_tables(E) autofs4(E)
crc32_pclmul(E) nvme(E) ahci(E) xhci_pci(E) tg3(E) nvme_core(E) libahci(E)
i2c_piix4(E) xhci_pci_renesas(E) wmi(E)
[ 4551.479475] CPU: 328 PID: 2710 Comm: kworker/328:1 Kdump: loaded Tainted: G
          E      6.2.9schd_work+ #1
[ 4551.479477] Hardware name: AMD Corporation Genoa, BIOS RTI 03/20/2023
[ 4551.479479] Workqueue: events free_iova_work_func
[ 4551.479484] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60
[ 4551.479488] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89
e5 c6 07 00 0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d 50
f8 fc 7d 74 13 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89
[ 4551.479489] RSP: 0018:ffa0000092e07e30 EFLAGS: 00000206
[ 4551.479491] RAX: 0000000000000000 RBX: 00000000000000f2 RCX: 0000000000000000
[ 4551.479492] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffd1ff3cf82be060
[ 4551.479492] RBP: ffa0000092e07e30 R08: 0000000000000282 R09: 0000000000000000
[ 4551.479493] R10: 0000000000000000 R11: 0000000000000000 R12: ff1100010d2af6d0
[ 4551.479493] R13: ffd1ff3cf82be060 R14: ffd1ff3cf82bb858 R15: 0000000000000282
[ 4551.479494] FS:  0000000000000000(0000) GS:ff11018056200000(0000)
knlGS:0000000000000000
[ 4551.479495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4551.479496] CR2: 00007ffd392f5f28 CR3: 0000000152f06004 CR4: 0000000000771ee0
[ 4551.479497] PKRU: 55555554
[ 4551.479497] Call Trace:
[ 4551.479498]  <TASK>
[ 4551.479502]  free_iova_work_func+0x6a/0xc0
[ 4551.479505]  process_one_work+0x21f/0x440
[ 4551.479507]  worker_thread+0x50/0x3f0
[ 4551.479509]  ? __pfx_worker_thread+0x10/0x10
[ 4551.479510]  kthread+0xee/0x120
[ 4551.479513]  ? __pfx_kthread+0x10/0x10
[ 4551.479514]  ret_from_fork+0x2c/0x50
[ 4551.479518]  </TASK>

I did slight modification on top of this patch (schedule work on each CPU) and
it worked for me.

@Robin, Joerg,
  Does something like below (schedule work on each CPU to free iova
 (fq_ring_free()) is acceptable?

-Vasant


------

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7aed15e53f10..eb52814e67e5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -157,10 +157,12 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
 static void fq_flush_timeout(struct timer_list *t)
 {
 	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
+	int cpu;

 	atomic_set(&cookie->fq_timer_on, 0);
 	fq_flush_iotlb(cookie);
-	schedule_work(&cookie->free_iova_work);
+	for_each_possible_cpu(cpu)
+		schedule_work_on(cpu, &cookie->free_iova_work);
 }

 static void queue_iova(struct iommu_dma_cookie *cookie,
@@ -235,17 +237,15 @@ static void free_iova_work_func(struct work_struct *work)
 {
 	struct iommu_dma_cookie *cookie;
 	int cpu;
+	unsigned long flags;
+	struct iova_fq *fq;

 	cookie = container_of(work, struct iommu_dma_cookie, free_iova_work);
-	for_each_possible_cpu(cpu) {
-		unsigned long flags;
-		struct iova_fq *fq;
+	fq = this_cpu_ptr(cookie->fq);

-		fq = per_cpu_ptr(cookie->fq, cpu);
-		spin_lock_irqsave(&fq->lock, flags);
-		fq_ring_free(cookie, fq);
-		spin_unlock_irqrestore(&fq->lock, flags);
-	}
+	spin_lock_irqsave(&fq->lock, flags);
+	fq_ring_free(cookie, fq);
+	spin_unlock_irqrestore(&fq->lock, flags);
 }

 /* sysfs updates are serialised by the mutex of the group owning @domain */

---------