netdev - Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d88f2288-2292-1569-a336-aa4075dd74ea@grimberg.me>
Date:   Mon, 13 Nov 2017 21:20:41 +0200
From:   Sagi Grimberg <sagi@...mberg.me>
To:     Thomas Gleixner <tglx@...utronix.de>, Jens Axboe <axboe@...com>
Cc:     Jes Sorensen <jsorensen@...com>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@....mellanox.co.il>,
        Networking <netdev@...r.kernel.org>,
        Leon Romanovsky <leonro@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Kernel Team <kernel-team@...com>,
        Christoph Hellwig <hch@....de>
Subject: Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

Hi Thomas,

> What can be done with some time to work on?
> 
> The managed mechanism consists of 3 pieces:
> 
>   1) Vector spreading
> 
>   2) Managed vector allocation, which becomes a guaranteed reservation in
>      4.15 due of the big rework of the vector management code.
> 
>      Non managed interrupts get a best effort reservation to handle theCPU
>      unplug vector pressure problem in a sane way.
> 
>   3) CPU hotplug management
> 
>      If the last CPU in the affinity set goes offline, then the interrupt is
>      shutdown and restarted when the first CPU in the affinity set comes
>      online again. The driver code needs to ensure that the queue associated
>      to that interrupt is drained before shutdown and nothing is queued
>      there after this point.
> 
> So we have options:
> 
> 1) Initial vector spreading
> 
>   Let the driver use the initial vector spreading. That does only the
>   initial affinity setup, but otherwise the interrupts are handled like any
>   other non managed interrupt, i.e. best effort reservation, affinity
>   settings enabled and CPU unplug breaks affinity and moves them to some
>   random other online CPU.
> 
>   The simplest solution of all.
> 
> 2) Allowing a driver supplied mask
> 
>   Certainly simple to do, but as you said it's not really a solution. I'm
>   not sure whether we want to go there as this is going to be replaced fast
>   enough and then create another breakage/frustration level.
> 
> 
> 3) Affinity override in managed mode
> 
>   Doable, but there are a couple of things to think about:

I think that it will be good to shoot for (3). Given that there are
driver requirements I'd say that driver will expose up front if it can
handle it, and if not we fallback to (1).

>    * How is this enabled?
> 
>      - Opt-in by driver
> 	
>      - Extra sysfs/procfs knob
> 
>      We definitely should not enable it per default because that would
>      surprise users/drivers which work with the current managed devices and
>      rely on the affinity files to be non writeable in managed mode.

Do you know if any exist? Would it make sense to have a survey to
understand if anyone relies on it?

 From what I've seen so far, drivers that were converted simply worked
with the non-managed facility and didn't have any special code for it.
Perhaps Christoph can comment as he convert most of them.

But if there aren't any drivers that absolutely rely on it, maybe its
not a bad idea to allow it by default?


>    * When and how is the driver informed about the change?
> 
>       When:
> 
>         #1 Before the core tries to move the interrupt so it can veto the
> 	  move if it cannot allocate new resources or whatever is required
> 	  to operate after the move.

What would the core do if a driver veto a move? I'm wandering in what
conditions a driver will be unable to allocate resources for move to cpu
X but able to allocate for move to cpu Y.

> 	
>         #2 After the core made the move effective because:
> 
>            - The interrupt might be moved from an offline set to an online
>              set and needs to be started up, so the related queue must be
>              enabled as well.
> 
>            - The interrupt might be moved from an online set to an offline
>              set, so the queue needs to be drained and disabled.
> 
> 	  - Resources which have been allocated in the first step must be
>              made effective and old resources freed.
> 
>       How:
> 
>         The existing affinity notification mechanism does not work for this
>         and it's a horrible piece of crap which should go away sooner than
>         later.
> 
>         So we need some sensible way to provide callback. Emphasis on
>         callbacks as one multiplexing callback is not a good idea.
> 
>    * How can the change made effective?
> 
>      When the preliminaries (vector reservation on the new set and
>      evtl. resource allocation in the subsystem have been done, then the
>      actual move can be made.
> 
>      But, there is a caveat. x86 is not good in reassociating interrupts on
>      the fly except when it sits behind an interrupt remapping unit, but we
>      cannot rely on that.
> 
>      So the change flow which works for everything would be:
> 
>      if (reserve_vectors() < 0)
>         return FAIL;
> 
>      if (subsys_prep_callback() < 0) {
>         release_vectors();
>         return FAIL;
>      }
> 
>      shutdown(irq);
> 
>      if (!online(newset))
>         return SUCCESS;
> 
>      startup(irq);
> 
>      subsys_post_callback();
>      return SUCCESS;
> 
>      subsys_prep_callback() must basically work the same way as the CPU
>      offline mechanism and drain the queue and prevent queueing before the
>      irq is restarted. If the move results in keeping it shutdown because
>      the new set is offline, then the irq will be restarted via the CPU
>      hotplug code and the subsystem will be informed about that via the
>      hotplug mechanism as well.
> 
>      subsys_post_callback() is more or less the same as the hotplug callback
>      and restarts the queue. The only difference to the hotplug code as of
>      today is that it might need to make previously allocated resources
>      effective and free the old ones.
> 
>      I named that subsys_*_callback() on purpose because this should be
>      handled in a generic way for multiqueue devices and not done at the
>      driver level.
> 
>    There are some very interesting locking problems to solve, especially
>    vs. CPU hotplug, but that should be solvable.

This looks like it can work to me, but I'm probably not familiar enough
to see the full picture here.