lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <28424a58-1159-c3f9-1efb-f1366993afcf@huawei.com>
Date:   Tue, 10 Dec 2019 09:45:45 +0000
From:   John Garry <john.garry@...wei.com>
To:     Ming Lei <ming.lei@...hat.com>, <maz@...nel.org>
CC:     <tglx@...utronix.de>, <chenxiang66@...ilicon.com>,
        <bigeasy@...utronix.de>, <linux-kernel@...r.kernel.org>,
        <hare@...e.com>, <hch@....de>, <axboe@...nel.dk>,
        <bvanassche@....org>, <peterz@...radead.org>, <mingo@...hat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity
 for managed interrupt

On 10/12/2019 01:43, Ming Lei wrote:
> On Mon, Dec 09, 2019 at 02:30:59PM +0000, John Garry wrote:
>> On 07/12/2019 08:03, Ming Lei wrote:
>>> On Fri, Dec 06, 2019 at 10:35:04PM +0800, John Garry wrote:
>>>> Currently the cpu allowed mask for the threaded part of a threaded irq
>>>> handler will be set to the effective affinity of the hard irq.
>>>>
>>>> Typically the effective affinity of the hard irq will be for a single cpu. As such,
>>>> the threaded handler would always run on the same cpu as the hard irq.
>>>>
>>>> We have seen scenarios in high data-rate throughput testing that the cpu
>>>> handling the interrupt can be totally saturated handling both the hard
>>>> interrupt and threaded handler parts, limiting throughput.
>>>

Hi Ming,

>>> Frankly speaking, I never observed that single CPU is saturated by one storage
>>> completion queue's interrupt load. Because CPU is still much quicker than
>>> current storage device.
>>>
>>> If there are more drives, one CPU won't handle more than one queue(drive)'s
>>> interrupt if (nr_drive * nr_hw_queues) < nr_cpu_cores.
>>
>> Are things this simple? I mean, can you guarantee that fio processes are
>> evenly distributed as such?
> 
> That is why I ask you for the details of your test.
> 
> If you mean hisilicon SAS,

Yes, it is.

  the interrupt load should have been distributed
> well given the device has multiple reply queues for distributing interrupt
> load.
> 
>>
>>>
>>> So could you describe your case in a bit detail? Then we can confirm
>>> if this change is really needed.
>>
>> The issue is that the CPU is saturated in servicing the hard and threaded
>> part of the interrupt together - here's the sort of thing which we saw
>> previously:
>> Before:
>> CPU	%usr	%sys	%irq	%soft	%idle
>> all	2.9	13.1	1.2	4.6	78.2				
>> 0	0.0	29.3	10.1	58.6	2.0
>> 1	18.2	39.4	0.0	1.0	41.4
>> 2	0.0	2.0	0.0	0.0	98.0
>>
>> CPU0 has no effectively no idle.
> 
> The result just shows the saturation, we need to root cause it instead
> of workaround it via random changes.
> 
>>
>> Then, by allowing the threaded part to roam:
>> After:
>> CPU	%usr	%sys	%irq	%soft	%idle
>> all	3.5	18.4	2.7	6.8	68.6
>> 0	0.0	20.6	29.9	29.9	19.6
>> 1	0.0	39.8	0.0	50.0	10.2
>>
>> Note: I think that I may be able to reduce the irq hard part load in the
>> endpoint driver, but not that much such that we see still this issue.
>>
>>>
>>>>
>>>> For when the interrupt is managed, allow the threaded part to run on all
>>>> cpus in the irq affinity mask.
>>>
>>> I remembered that performance drop is observed by this approach in some
>>> test.
>>
>>  From checking the thread about the NVMe interrupt swamp, just switching to
>> threaded handler alone degrades performance. I didn't see any specific
>> results for this change from Long Li - https://lkml.org/lkml/2019/8/21/128
> 
> I am pretty clear the reason for Azure, which is caused by aggressive interrupt
> coalescing, and this behavior shouldn't be very common, and it can be
> addressed by the following patch:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2019-November/028008.html
> 
> Then please share your lockup story, such as, which HBA/drivers, test steps,
> if you complete IOs from multiple disks(LUNs) on single CPU, if you have
> multiple queues, how many active LUNs involved in the test, ...

There is no lockup, just a potential performance boost in this change.

My colleague Xiang Chen can provide specifics of the test, as he is the 
one running it.

But one key bit of info - which I did not think most relevant before - 
that is we have 2x SAS controllers running the throughput test on the 
same host.

As such, the completion queue interrupts would be spread identically 
over the CPUs for each controller. I notice that ARM GICv3 ITS interrupt 
controller (which we use) does not use the generic irq matrix allocator, 
which I think would really help with this.

Hi Marc,

Is there any reason for which we couldn't utilise of the generic irq 
matrix allocator for GICv3?

Thanks,
John

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ