netdev - [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1711092226080.2690@nanos>
Date:   Thu, 9 Nov 2017 22:42:12 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Jens Axboe <axboe@...com>
cc:     Sagi Grimberg <sagi@...mberg.me>, Jes Sorensen <jsorensen@...com>,
        Tariq Toukan <tariqt@...lanox.com>,
        Saeed Mahameed <saeedm@....mellanox.co.il>,
        Networking <netdev@...r.kernel.org>,
        Leon Romanovsky <leonro@...lanox.com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Kernel Team <kernel-team@...com>,
        Christoph Hellwig <hch@....de>
Subject: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

Find below a summary of the technical details, implications and options

What can be done for 4.14?

  We basically have two options: Revert at the driver level or ship as
  is.

  Even if we come up with a quick and dirty hack then it will be too late
  for proper testing before sunday.


What can be done with some time to work on?

The managed mechanism consists of 3 pieces:

 1) Vector spreading

 2) Managed vector allocation, which becomes a guaranteed reservation in
    4.15 due of the big rework of the vector management code.

    Non managed interrupts get a best effort reservation to handle theCPU
    unplug vector pressure problem in a sane way.

 3) CPU hotplug management

    If the last CPU in the affinity set goes offline, then the interrupt is
    shutdown and restarted when the first CPU in the affinity set comes
    online again. The driver code needs to ensure that the queue associated
    to that interrupt is drained before shutdown and nothing is queued
    there after this point.

So we have options:

1) Initial vector spreading 

 Let the driver use the initial vector spreading. That does only the
 initial affinity setup, but otherwise the interrupts are handled like any
 other non managed interrupt, i.e. best effort reservation, affinity
 settings enabled and CPU unplug breaks affinity and moves them to some
 random other online CPU.

 The simplest solution of all.

2) Allowing a driver supplied mask

 Certainly simple to do, but as you said it's not really a solution. I'm
 not sure whether we want to go there as this is going to be replaced fast
 enough and then create another breakage/frustration level.


3) Affinity override in managed mode

 Doable, but there are a couple of things to think about:

  * How is this enabled?

    - Opt-in by driver
	     
    - Extra sysfs/procfs knob

    We definitely should not enable it per default because that would
    surprise users/drivers which work with the current managed devices and
    rely on the affinity files to be non writeable in managed mode.

  * Is it allowed to set the affinity to offline, but present CPUs?

     In principle yes, because the core management code can do that as well
     at setup time.

  * The affinity setting must fail when it cannot do a guaranteed
    reservation on the new target CPU(s).

     This is not much of a question. That's a matter of fact because
     otherwise the association cannot be guaranteed and things fall apart
     all over the place.

  * When and how is the driver informed about the change?

     When:

       #1 Before the core tries to move the interrupt so it can veto the
	  move if it cannot allocate new resources or whatever is required
	  to operate after the move.
	  
       #2 After the core made the move effective because:

          - The interrupt might be moved from an offline set to an online
            set and needs to be started up, so the related queue must be
            enabled as well.

          - The interrupt might be moved from an online set to an offline
            set, so the queue needs to be drained and disabled.

	  - Resources which have been allocated in the first step must be
            made effective and old resources freed.

     How:

       The existing affinity notification mechanism does not work for this
       and it's a horrible piece of crap which should go away sooner than
       later.

       So we need some sensible way to provide callback. Emphasis on
       callbacks as one multiplexing callback is not a good idea.

  * How can the change made effective?

    When the preliminaries (vector reservation on the new set and
    evtl. resource allocation in the subsystem have been done, then the
    actual move can be made.

    But, there is a caveat. x86 is not good in reassociating interrupts on
    the fly except when it sits behind an interrupt remapping unit, but we
    cannot rely on that.

    So the change flow which works for everything would be:

    if (reserve_vectors() < 0)
       return FAIL;

    if (subsys_prep_callback() < 0) {
       release_vectors();
       return FAIL;
    }

    shutdown(irq);

    if (!online(newset))
       return SUCCESS;

    startup(irq);

    subsys_post_callback();
    return SUCCESS;

    subsys_prep_callback() must basically work the same way as the CPU
    offline mechanism and drain the queue and prevent queueing before the
    irq is restarted. If the move results in keeping it shutdown because
    the new set is offline, then the irq will be restarted via the CPU
    hotplug code and the subsystem will be informed about that via the
    hotplug mechanism as well.

    subsys_post_callback() is more or less the same as the hotplug callback
    and restarts the queue. The only difference to the hotplug code as of
    today is that it might need to make previously allocated resources
    effective and free the old ones.

    I named that subsys_*_callback() on purpose because this should be
    handled in a generic way for multiqueue devices and not done at the
    driver level.

  There are some very interesting locking problems to solve, especially
  vs. CPU hotplug, but that should be solvable.

4) Break managed mode when affinity is changed by user

  I'm not going to describe that because this is going to require at least
  as much effort as #2 plus a few extra interesting twists versus vector
  management and CPU hotplug.

5) Other options:

   Maybe ponies, but I have no clue how to implement them.
 

Thoughts?

Thanks,

	tglx