linux-kernel - Re: [PATCH] iommu: Avoid softlockup and rcu stall in fq_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <21f69b43-a1e7-6c84-a360-dae410bedb3f@amd.com>
Date:   Fri, 28 Apr 2023 11:14:54 +0530
From:   Vasant Hegde <vasant.hegde@....com>
To:     Jerry Snitselaar <jsnitsel@...hat.com>,
        Peng Zhang <zhangpeng.00@...edance.com>,
        Robin Murphy <robin.murphy@....com>,
        Joerg Roedel <joro@...tes.org>
Cc:     will@...nel.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, Li Bin <huawei.libin@...wei.com>,
        Xie XiuQi <xiexiuqi@...wei.com>,
        Yang Yingliang <yangyingliang@...wei.com>,
        Suravee Suthikulpanit <suravee.suthikulpanit@....com>
Subject: Re: [PATCH] iommu: Avoid softlockup and rcu stall in
 fq_flush_timeout().

Hello Robin, Joerg,

Ping. Any suggestion on below proposal (schedule work on each CPU to free iova)?

-Vasant



On 4/18/2023 4:11 PM, Vasant Hegde wrote:
> Hi All,
> 
> 
> On 4/11/2023 3:52 AM, Jerry Snitselaar wrote:
>> On Thu, Feb 16, 2023 at 03:11:48PM +0800, Peng Zhang wrote:
>>> There is softlockup under fio pressure test with smmu enabled:
>>> watchdog: BUG: soft lockup - CPU#81 stuck for 22s!  [swapper/81:0]
>>> ...
>>> Call trace:
>>>  fq_flush_timeout+0xc0/0x110
>>>  call_timer_fn+0x34/0x178
>>>  expire_timers+0xec/0x158
>>>  run_timer_softirq+0xc0/0x1f8
>>>  __do_softirq+0x120/0x324
>>>  irq_exit+0x11c/0x140
>>>  __handle_domain_irq+0x6c/0xc0
>>>  gic_handle_irq+0x6c/0x170
>>>  el1_irq+0xb8/0x140
>>>  arch_cpu_idle+0x38/0x1c0
>>>  default_idle_call+0x24/0x44
>>>  do_idle+0x1f4/0x2d8
>>>  cpu_startup_entry+0x2c/0x30
>>>  secondary_start_kernel+0x17c/0x1c8
>>>
>>> Rcu stall may also be triggered:
>>>
>>> rcu: INFO: rcu_sched self-detected stall on CPU
>>> NMI backtrace for cpu 21
>>> CPU: 21 PID: 118 Comm: ksoftirqd/21
>>> ...
>>> Call trace:
>>>  fq_flush_timeout+0x6d/0x90
>>>  ? fq_ring_free+0xc0/0xc0
>>>  call_timer_fn+0x2b/0x120
>>>  run_timer_softirq+0x1a6/0x420
>>>  ? finish_task_switch+0x80/0x280
>>>  __do_softirq+0xda/0x2da
>>>  ? sort_range+0x20/0x20
>>>  run_ksoftirqd+0x26/0x40
>>>  smpboot_thread_fn+0xb8/0x150
>>>  kthread+0x110/0x130
>>>  ? __kthread_cancel_work+0x40/0x40
>>>  ret_from_fork+0x1f/0x30
>>>
>>> This is because the timer callback fq_flush_timeout may run more than
>>> 10ms, and timer may be processed continuously in the softirq so trigger
>>> softlockup and rcu stall. We can use work to deal with fq_ring_free for
>>> each cpu which may take long time, that to avoid triggering softlockup
>>> and rcu stall.
>>>
>>> This patch is modified from the patch[1] of openEuler.
>>>
>>
>> Hi Robin,
>>
>> I was looking at something similar to this recently were in this case
>> they were beating the heck out the system with the hazard io stress
>> test, and someone else with some medusa test tool. In one case they
>> had them force a dump on the soft lockup. On the 384 core genoa, 90
>> cores were spinning the iovad rb tree lock for one domain, 1 had it,
>> and the poor flush queue timer handler was having to fight everyone
>> for the lock. I'm not sure what would be considered a realistic workload
>> compared to these stressors, but could this be an issue over time as
>> systems continue to get more cores since the timer handler potentially
>> grabs and releases the iova domain rb tree lock for each cpu? The only
>> cases I know of are using io stressors, so I don't know how big a deal
>> it is.
>>
>> I think soft lockups could still be produced with this patch, since
>> there would still be the lock contention.
> 
> Right. We hit this issue when running stress-ng on 384 CPU system having NVMe
> disk. I have tried this patch. This improved the situation. But we still hit
> soft lockup after an hour or so.
> 
> Backtrace with this patch:
> 
>  4551.470797] watchdog: BUG: soft lockup - CPU#328 stuck for 30s!
> [kworker/328:1:2710]
> [ 4551.479441] Modules linked in: nvme_fabrics(E) intel_rapl_msr(E)
> intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) kvm_amd(E) kvm(E)
> crct10dif_pclmul(E) ghash_clmulni_intel(E) sha512_ssse3(E) aesni_intel(E)
> crypto_simd(E) cryptd(E) rapl(E) wmi_bmof(E) binfmt_misc(E) ipmi_ssif(E)
> nls_iso8859_1(E) ast(E) drm_shmem_helper(E) drm_kms_helper(E) i2c_algo_bit(E)
> syscopyarea(E) sysfillrect(E) sysimgblt(E) acpi_ipmi(E) ccp(E) k10temp(E)
> ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) mac_hid(E) sch_fq_codel(E) msr(E)
> parport_pc(E) ppdev(E) lp(E) parport(E) ramoops(E) reed_solomon(E) pstore_blk(E)
> pstore_zone(E) efi_pstore(E) drm(E) ip_tables(E) x_tables(E) autofs4(E)
> crc32_pclmul(E) nvme(E) ahci(E) xhci_pci(E) tg3(E) nvme_core(E) libahci(E)
> i2c_piix4(E) xhci_pci_renesas(E) wmi(E)
> [ 4551.479475] CPU: 328 PID: 2710 Comm: kworker/328:1 Kdump: loaded Tainted: G
>           E      6.2.9schd_work+ #1
> [ 4551.479477] Hardware name: AMD Corporation Genoa, BIOS RTI 03/20/2023
> [ 4551.479479] Workqueue: events free_iova_work_func
> [ 4551.479484] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60
> [ 4551.479488] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89
> e5 c6 07 00 0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d 50
> f8 fc 7d 74 13 5d 31 c0 89 c2 89 c1 89 c6 89 c7 41 89
> [ 4551.479489] RSP: 0018:ffa0000092e07e30 EFLAGS: 00000206
> [ 4551.479491] RAX: 0000000000000000 RBX: 00000000000000f2 RCX: 0000000000000000
> [ 4551.479492] RDX: 0000000000000000 RSI: 0000000000000282 RDI: ffd1ff3cf82be060
> [ 4551.479492] RBP: ffa0000092e07e30 R08: 0000000000000282 R09: 0000000000000000
> [ 4551.479493] R10: 0000000000000000 R11: 0000000000000000 R12: ff1100010d2af6d0
> [ 4551.479493] R13: ffd1ff3cf82be060 R14: ffd1ff3cf82bb858 R15: 0000000000000282
> [ 4551.479494] FS:  0000000000000000(0000) GS:ff11018056200000(0000)
> knlGS:0000000000000000
> [ 4551.479495] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4551.479496] CR2: 00007ffd392f5f28 CR3: 0000000152f06004 CR4: 0000000000771ee0
> [ 4551.479497] PKRU: 55555554
> [ 4551.479497] Call Trace:
> [ 4551.479498]  <TASK>
> [ 4551.479502]  free_iova_work_func+0x6a/0xc0
> [ 4551.479505]  process_one_work+0x21f/0x440
> [ 4551.479507]  worker_thread+0x50/0x3f0
> [ 4551.479509]  ? __pfx_worker_thread+0x10/0x10
> [ 4551.479510]  kthread+0xee/0x120
> [ 4551.479513]  ? __pfx_kthread+0x10/0x10
> [ 4551.479514]  ret_from_fork+0x2c/0x50
> [ 4551.479518]  </TASK>
> 
> I did slight modification on top of this patch (schedule work on each CPU) and
> it worked for me.
> 
> @Robin, Joerg,
>   Does something like below (schedule work on each CPU to free iova
>  (fq_ring_free()) is acceptable?
> 
> -Vasant
> 
> 
> ------
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index 7aed15e53f10..eb52814e67e5 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -157,10 +157,12 @@ static void fq_flush_iotlb(struct iommu_dma_cookie *cookie)
>  static void fq_flush_timeout(struct timer_list *t)
>  {
>  	struct iommu_dma_cookie *cookie = from_timer(cookie, t, fq_timer);
> +	int cpu;
> 
>  	atomic_set(&cookie->fq_timer_on, 0);
>  	fq_flush_iotlb(cookie);
> -	schedule_work(&cookie->free_iova_work);
> +	for_each_possible_cpu(cpu)
> +		schedule_work_on(cpu, &cookie->free_iova_work);
>  }
> 
>  static void queue_iova(struct iommu_dma_cookie *cookie,
> @@ -235,17 +237,15 @@ static void free_iova_work_func(struct work_struct *work)
>  {
>  	struct iommu_dma_cookie *cookie;
>  	int cpu;
> +	unsigned long flags;
> +	struct iova_fq *fq;
> 
>  	cookie = container_of(work, struct iommu_dma_cookie, free_iova_work);
> -	for_each_possible_cpu(cpu) {
> -		unsigned long flags;
> -		struct iova_fq *fq;
> +	fq = this_cpu_ptr(cookie->fq);
> 
> -		fq = per_cpu_ptr(cookie->fq, cpu);
> -		spin_lock_irqsave(&fq->lock, flags);
> -		fq_ring_free(cookie, fq);
> -		spin_unlock_irqrestore(&fq->lock, flags);
> -	}
> +	spin_lock_irqsave(&fq->lock, flags);
> +	fq_ring_free(cookie, fq);
> +	spin_unlock_irqrestore(&fq->lock, flags);
>  }
> 
>  /* sysfs updates are serialised by the mutex of the group owning @domain */
> 
> ---------
> 
> 
> 
> 
>