[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <871q4td59k.fsf@oracle.com>
Date: Tue, 18 Jun 2024 19:40:23 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, tglx@...utronix.de,
peterz@...radead.org, torvalds@...ux-foundation.org,
paulmck@...nel.org, rostedt@...dmis.org, mark.rutland@....com,
juri.lelli@...hat.com, joel@...lfernandes.org, raghavendra.kt@....com,
boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
LKML
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
Shrikanth Hegde <sshegde@...ux.ibm.com> writes:
> On 6/15/24 8:34 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/10/24 12:53 PM, Ankur Arora wrote:
>>>
>> _auto.
>>>>
>>>> 6.10-rc1:
>>>> =========
>>>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>>> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37
>>>> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20
>>>> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18
>>>> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99
>>>>
>>>> PREEMPT_AUTO:
>>>> ===========
>>>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>>>> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96
>>>> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60
>>>> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92
>>>> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73
>>>> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71
>>>>
>>>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>>>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>>>
>>> Yeah, the %sys is lower and %irq, higher. Can you also see where the
>>> increased %irq is? For instance are the resched IPIs numbers greater?
>>
>> Hi Ankur,
>>
>>
>> Used mpstat -I ALL to capture this info for 20 seconds.
>>
>> HARDIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45
>>
>> Preempt_auto:
>> ===================
>> 18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85
>>
>> 18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.
>>
>>
>> SOFTIRQ per second:
>> ===================
>> 6.10:
>> ===================
>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>> 0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95
>>
>> Preempt_auto:
>> ===================
>> HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
>> 0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77
>>
>> Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
>> It maybe irq triggering to softirq or softirq causing more IPI.
>>
>>
>>
>> Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
>> enabled. I still see the same regression in hackbench. These configs still may need attention?
>>
>> 6.10 | preempt auto
>> CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y
>> CONFIG_INLINE_READ_UNLOCK=y | ----------------------------------------------------------------------------
>> CONFIG_INLINE_READ_UNLOCK_IRQ=y | ----------------------------------------------------------------------------
>> CONFIG_INLINE_WRITE_UNLOCK=y | ----------------------------------------------------------------------------
>> CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ----------------------------------------------------------------------------
>>
>>
>
> Did an experiment keeping the number of CPU constant, while changing the number of sockets they span across.
> When all CPU belong to same socket, there is no regression w.r.t to PREEMPT_AUTO. Regression starts when the CPUs start
> spanning across sockets.
Ah. That's really interesting. So, upto 160 CPUs was okay?
> Since Preempt auto by default enables preempt count, I think that may cause the regression. I see Powerpc uses generic implementation
> which may not scale well.
Yeah this would explain why I don't see similar behaviour on a 384 CPU
x86 box.
Also, IIRC the powerpc numbers on preempt=full were significantly worse
than preempt=none. That test might also be worth doing once you have the
percpu based method working.
> Will try to shift to percpu based method and see. will get back if I can get that done successfully.
Sounds good to me.
Thanks
Ankur
Powered by blists - more mailing lists