linux-kernel - Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871q4td59k.fsf@oracle.com>
Date: Tue, 18 Jun 2024 19:40:23 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, tglx@...utronix.de,
        peterz@...radead.org, torvalds@...ux-foundation.org,
        paulmck@...nel.org, rostedt@...dmis.org, mark.rutland@....com,
        juri.lelli@...hat.com, joel@...lfernandes.org, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        LKML
 <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling


Shrikanth Hegde <sshegde@...ux.ibm.com> writes:

> On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>>
>> _auto.
>>>>
>>>> 6.10-rc1:
>>>> =========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
>>>> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
>>>> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
>>>> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>>>>
>>>> PREEMPT_AUTO:
>>>> ===========
>>>> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
>>>> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
>>>> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
>>>> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
>>>> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
>>>> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>>>>
>>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>>
>>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>>> increased %irq is? For instance are the resched IPIs numbers greater?
>>
>> Hi Ankur,
>>
>>
>> Used mpstat -I ALL to capture this info for 20 seconds.
>>
>> HARDIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 417956.86	1114642.30	1712683.65	2058664.99	0.00	0.00	18.30	0.39	31978.37	0.00	0.35	351.98	0.00	0.00	0.00	6405.54	329189.45
>>
>> Preempt_auto:
>> ===================
>> 18		19		22		23		48	49	50	51	LOC		BCT	LOC2	SPU	PMI	MCE	NMI	WDG	DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 609509.69	1910413.99	1923503.52	2061876.33	0.00	0.00	19.14	0.30	31916.59	0.00	0.45	497.88	0.00	0.00	0.00	6825.49	88247.85
>>
>> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.
>>
>>
>> SOFTIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	3966.47	0.00	18.25	0.59	0.00		0.34		12811.00	0.00		9693.95
>>
>> Preempt_auto:
>> ===================
>> HI	TIMER	NET_TX	NET_RX	BLOCK	IRQ_POLL	TASKLET		SCHED		HRTIMER		RCU
>> 0.00	4871.67	0.00	18.94	0.40	0.00		0.25		13518.66	0.00		15732.77
>>
>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>>
>>
>> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
>> enabled. I still see the same regression in hackbench. These configs still may need attention?
>>
>> 					6.10				       | 					preempt auto
>>   CONFIG_INLINE_SPIN_UNLOCK_IRQ=y                                              |  CONFIG_UNINLINE_SPIN_UNLOCK=y
>>   CONFIG_INLINE_READ_UNLOCK=y                                                  |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_READ_UNLOCK_IRQ=y                                              |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK=y                                                 |  ----------------------------------------------------------------------------
>>   CONFIG_INLINE_WRITE_UNLOCK_IRQ=y                                             |  ----------------------------------------------------------------------------
>>
>>
>
> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.
> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
> spanning across sockets.

Ah. That's really interesting. So, upto 160 CPUs was okay?

> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
> which may not scale well.

Yeah this would explain why I don't see similar behaviour on a 384 CPU
x86 box.

Also, IIRC the powerpc numbers on preempt=full were significantly worse
than preempt=none. That test might also be worth doing once you have the
percpu based method working.

> Will try to shift to percpu based method and see. will get back if I can get that done successfully.

Sounds good to me.


Thanks
Ankur