netdev - Re: mlx5 broken affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0bf8b9a8-4931-526d-2d0b-848a8e61f359@fb.com>
Date:   Thu, 9 Nov 2017 17:03:38 -0500
From:   Jes Sorensen <jsorensen@...com>
To:     Thomas Gleixner <tglx@...utronix.de>
CC:     Sagi Grimberg <sagi@...mberg.me>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@....mellanox.co.il>,
        Networking <netdev@...r.kernel.org>,
        Leon Romanovsky <leonro@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Kernel Team <kernel-team@...com>,
        Christoph Hellwig <hch@....de>
Subject: Re: mlx5 broken affinity

On 11/08/2017 12:33 PM, Thomas Gleixner wrote:
> On Wed, 8 Nov 2017, Jes Sorensen wrote:
>> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
>>> Depending on the machine and the number of queues this might even result in
>>> completely losing the ability to suspend/hibernate because the number of
>>> available vectors on CPU0 is not sufficient to accomodate all queue
>>> interrupts.
>>
>> Depending on the system, suspend/resume is really the lesser interesting
>> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
>> it will be in some sort of rack and is unlikely to ever want to
>> suspend/resume.
> 
> The discussions with Intel about that tell a different story and cpu
> online/offline for power management purposes is - while debatable - widely
> used.

I certainly do not want to dispute that, it just underlines that
different users have different priorities.

>>> A lot of things are possible, the question is whether it makes sense. The
>>> whole point is to have resources (queues, interrupts etc.) per CPU and have
>>> them strictly associated.
>>>
>>> Why would you give the user a knob to destroy what you carefully optimized?
>>>
>>> Just because we can and just because users love those knobs or is there any
>>> real technical reason?
>>
>> Because the user sometimes knows better based on statically assigned
>> loads, or the user wants consistency across kernels. It's great that the
>> system is better at allocating this now, but we also need to allow for a
>> user to change it. Like anything on Linux, a user wanting to blow off
>> his/her own foot, should be allowed to do so.
> 
> That's fine, but that's not what the managed affinity facility provides. If
> you want to leverage the spread mechanism, but avoid the managed part, then
> this is a different story and we need to figure out how to provide that
> without breaking the managed side of it.
> 
> As I said it's possible, but I vehemently disagree, that this is a bug in
> the core code, as it was claimed several times in this thread.

So it may be my original message was confusing on this. Currently
IRQ-affinity.txt describes how a user can change that, but it no longer
works if an IRQ is marked managed. That is what I qualified as a bug in
my original posting. If an IRQ is not meant to have it's affinity
modified because it is managed, then I think it should also result in
the permissions on the /proc file change to rrr rather than getting EIO
when trying to write to it, but that is a minor detail IMHO.

None of this is a major showstopper for the next kernel release, as long
as we can work on a longer term fix that satisfies everyone.

What I would ideally like to see is the option for drivers to benefit
from the new allocation scheme without being locked in and also have the
option to go managed if that is right for the given devices. Basically a
best of both Worlds situation.

> The real issue is that the driver was converted to something which was
> expected to behave differently. That's hardly a bug in the core code, at
> most it's a documentation problem.

When I hit this issue I read the driver commit stating it was taking
advantage of the new allocation. Not having been part of the discussion
and design of the new core code, I just caught what was described in the
commit message for mlx5.

For now I have just reverted the offending mlx5 patch in our kernel
while we figure out how to resolve it long term.

Cheers,
Jes