linux-kernel - Re: [RFC PATCH v4] sched: Fix performance regression introduced by mm

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f51288fa-b9d7-c7ef-fca8-b17535694d48@efficios.com>
Date:   Thu, 13 Apr 2023 09:38:46 -0400
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Aaron Lu <aaron.lu@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        linux-kernel@...r.kernel.org, Olivier Dion <odion@...icios.com>,
        michael.christie@...cle.com
Subject: Re: [RFC PATCH v4] sched: Fix performance regression introduced by
 mm_cid

On 2023-04-13 09:13, Aaron Lu wrote:
> On Wed, Apr 12, 2023 at 04:57:50PM -0400, Mathieu Desnoyers wrote:
>> On 2023-04-12 00:27, Aaron Lu wrote:
>>> Just noticed below warning in dmesg on v4, the warning is triggered by
>>> WARN_ON_ONCE((int) mm_cid < 0); in rseq_update_cpu_node_id().
>>
>> I think I know what triggers this in v4.
>>
>> See sched_mm_cid_migrate_from() (near the end):
>>
>> +       /*
>> +        * The src_cid is unused, so it can be unset.
>> +        */
>> +       if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
>> +           mm_cid_set_lazy_put(src_cid))
>> +               return;
>> +       __mm_cid_put(mm, src_cid);
>>
>> There is a short window here between successful cmpxchg and clear mask bit
>> where the thread has ownership of the cid without rq lock held. This can
>> lead to rare over-allocation of cids beyond nr_possible_cpus if the src cpu
>> runqueue reallocates a cid concurrently, which causes __mm_cid_get() to
>> return -1.
>>
>> Because we want to avoid taking the src rq lock in migrate-from, I plan to
>> fix this by doing the following:
>>
>> - disable interrupts around this cmpxchg and __mm_cid_put(),
> 
> To reduce the short window?

Yes, just reduce. Eliminating the window would require grabbing the
src rq lock which we want to avoid.

> 
>> - modify __mm_cid_get so it retries if cpumask_first_zero() returns -1.
>>
>> This should take care of this kind of extremely rare over-allocation
>> scenario through retry, which will be bounded by the duration of the
>> instruction sequence between cmpxchg and clearing the bit in the bitmask.
>>
>> I have not reproduced the warning on my end, only figured this out from
>> code review.
>>
>> Thoughts ?
> 
> Sounds reasonable to me.
> 
> I tested below diff to verify your theory:
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..2c0764e53b83 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
> 
>          cpumask = mm_cidmask(mm);
>          cid = cpumask_first_zero(cpumask);
> -       if (cid >= nr_cpu_ids)
> +       if (cid >= nr_cpu_ids) {
> +               WARN_ON_ONCE(1);
>                  return -1;
> +       }
>          cpumask_set_cpu(cid, cpumask);
>          return cid;
>   }
> 
> And it indeed fired.
> 
> I then applied the below diff according to your description:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 425766cc1300..881571519a90 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -11536,6 +11536,7 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>   	struct mm_struct *mm = t->mm;
>   	struct rq *src_rq;
>   	struct task_struct *src_task;
> +	unsigned long flags;
>   
>   	if (!mm)
>   		return;
> @@ -11623,10 +11624,12 @@ void sched_mm_cid_migrate_from(struct task_struct *t)
>   	/*
>   	 * The src_cid is unused, so it can be unset.
>   	 */
> +	local_irq_save(flags);
>   	if (cmpxchg(src_pcpu_cid, mm_cid_set_lazy_put(src_cid), MM_CID_UNSET) !=
>   	    mm_cid_set_lazy_put(src_cid))
>   		return;
>   	__mm_cid_put(mm, src_cid);
> +	local_irq_restore(flags);
>   }
>   
>   /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f3e7dc2cd1cc..41ed7022bab0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3315,8 +3315,10 @@ static inline int __mm_cid_get_locked(struct mm_struct *mm)
>   
>   	cpumask = mm_cidmask(mm);
>   	cid = cpumask_first_zero(cpumask);
> -	if (cid >= nr_cpu_ids)
> +	if (cid >= nr_cpu_ids) {
> +		WARN_ON_ONCE(1);
>   		return -1;
> +	}
>   	cpumask_set_cpu(cid, cpumask);
>   	return cid;
>   }
> @@ -3345,8 +3347,10 @@ static inline int mm_cid_get(struct mm_struct *mm)
>   		if (cmpxchg(pcpu_cid, cid, MM_CID_UNSET) == cid)
>   			__mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
>   	}
> +	cid = -1;
>   	raw_spin_lock(&mm->cid_lock);
> -	cid = __mm_cid_get_locked(mm);
> +	while (cid == -1)
> +		cid = __mm_cid_get_locked(mm);
>   	raw_spin_unlock(&mm->cid_lock);
>   	WRITE_ONCE(*pcpu_cid, cid);
>   	return cid;
> 
> And did several iterations of hackbench, no more warnings from the
> original place rseq_update_cpu_node_id(). It appears solved the issue,
> will let you know if I ever hit it again.
> 
> One more thing is, with the above diff applied, I still get the warning
> about cid == -1 in __mm_cid_get_locked() so I suppose disabling irq in
> migrate_from can't entirely close the race.

Yes, this is expected. We need to do a retry loop on "get" when we find
a fully set bitmap and retry. This is why I want to disable irqs in
migrate-from: to keep this retry loop done with irqs off and rq lock 
held as small as possible. Disabling interrupts on the other side 
achieves this.

> 
> Another thing, it doesn't appear there is an user of __mm_get_cid()?

True. I'm preparing a v5 which significantly changes things: I introduce
a lock-free get, which hopefully should help reducing contention on the 
qspinlock.

Thanks,

Mathieu


> 
> Thanks,
> Aaron
>   
>>>
>>> [ 1819.649803] ------------[ cut here ]------------
>>> [ 1819.649813] WARNING: CPU: 188 PID: 29881 at kernel/rseq.c:95 __rseq_handle_notify_resume+0x49b/0x590
>>> [ 1819.649823] Modules linked in: veth tls xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel binfmt_misc kvm nls_iso8859_1 rapl intel_cstate mei_me isst_if_mbox_pci isst_if_mmio idxd input_leds joydev isst_if_common idxd_bus mei ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr ramoops pstore_blk reed_solomon pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid ast i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect
>>> [ 1819.649903]  sysimgblt crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel dax_hmem cxl_acpi crypto_simd nvme cryptd cxl_core nvme_core i2c_i801 xhci_pci i40e drm i2c_smbus xhci_pci_renesas i2c_ismt wmi pinctrl_emmitsburg
>>> [ 1819.649924] CPU: 188 PID: 29881 Comm: hackbench Not tainted 6.3.0-rc6-00002-g1acfd6ae9afc #2
>>> [ 1819.649927] Hardware name: Intel Corporation D50DNP1SBB/D50DNP1SBB, BIOS SE5C7411.86B.8901.D03.2210131232 10/13/2022
>>> [ 1819.649929] RIP: 0010:__rseq_handle_notify_resume+0x49b/0x590
>>> [ 1819.649934] Code: f0 ff ff ff ff ff 00 e9 00 fe ff ff 48 ba 00 f0 ff ff ff ff ff 00 e9 3a fe ff ff 48 ba 00 f0 ff ffff ff ff 00 e9 66 fe ff ff <0f> 0b e9 7d fc ff ff 0f 01 ca e9 a6 fc ff ff 48 8b 4c 24 30 48 8b
>>> [ 1819.649936] RSP: 0018:ffa0000018b0fe60 EFLAGS: 00010286
>>> [ 1819.649939] RAX: ff11007f73500000 RBX: 00007f81a7226fe0 RCX: 0000000000000000
>>> [ 1819.649941] RDX: 00000000ffffffff RSI: 0000000000000001 RDI: ffffffff828f9477
>>> [ 1819.649943] RBP: ffa0000018b0fee8 R08: ff110040c192cc28 R09: ff1100408e2a3980
>>> [ 1819.649944] R10: 0000000000000001 R11: 0000000000000000 R12: ff110040caa64000
>>> [ 1819.649946] R13: 000000000002fa40 R14: 0000000000000000 R15: 00000000000000bc
>>> [ 1819.649947] FS:  00007f81a7226640(0000) GS:ff11007f73500000(0000) knlGS:0000000000000000
>>> [ 1819.649950] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1819.649951] CR2: 00007fd8b00d5000 CR3: 00000040cf060001 CR4: 0000000000f71ee0
>>> [ 1819.649953] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [ 1819.649955] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
>>> [ 1819.649957] PKRU: 55555554
>>> [ 1819.649958] Call Trace:
>>> [ 1819.649960]  <TASK>
>>> [ 1819.649964]  exit_to_user_mode_prepare+0x13b/0x1a0
>>> [ 1819.649970]  syscall_exit_to_user_mode+0x2a/0x50
>>> [ 1819.649976]  ? __x64_sys_read+0x1d/0x30
>>> [ 1819.649980]  do_syscall_64+0x6d/0x90
>>> [ 1819.649984]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>> [ 1819.649990] RIP: 0033:0x7f81a77149cc
>>> [ 1819.649996] Code: ec 28 48 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 c0 f7 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 34 44 89 c7 48 89 44 24 08 e8 ff c0 f7 ff 48
>>> [ 1819.649998] RSP: 002b:00007f81a7225d70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>>> [ 1819.650004] RAX: 0000000000000064 RBX: 00007f81a7225da0 RCX: 00007f81a77149cc
>>> [ 1819.650005] RDX: 0000000000000064 RSI: 00007f81a7225da0 RDI: 0000000000000135
>>> [ 1819.650007] RBP: 00007f81a7225e50 R08: 0000000000000000 R09: 0000000000000000
>>> [ 1819.650008] R10: 0000000000000008 R11: 0000000000000246 R12: 00000000004a3c12
>>> [ 1819.650009] R13: 00007f81a7225e10 R14: 0000000000000000 R15: 0000558df7af6d00
>>> [ 1819.650012]  </TASK>
>>> [ 1819.650013] ---[ end trace 0000000000000000 ]---
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com