lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878rdveocj.ffs@tglx>
Date:   Thu, 11 May 2023 01:52:28 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Nipun Gupta <nipun.gupta@....com>,
        "gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
        "maz@...nel.org" <maz@...nel.org>, "jgg@...pe.ca" <jgg@...pe.ca>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Cc:     "git (AMD-Xilinx)" <git@....com>,
        "Anand, Harpreet" <harpreet.anand@....com>,
        "Jansen Van Vuuren, Pieter" <pieter.jansen-van-vuuren@....com>,
        "Agarwal, Nikhil" <nikhil.agarwal@....com>,
        "Simek, Michal" <michal.simek@....com>,
        "Gangurde, Abhijit" <abhijit.gangurde@....com>,
        "Cascon, Pablo" <pablo.cascon@....com>,
        Jason Gunthorpe <jgg@...dia.com>
Subject: Re: [PATCH] cdx: add MSI support for CDX bus

Nipun!

On Thu, May 11 2023 at 00:29, Thomas Gleixner wrote:
> On Wed, May 10 2023 at 19:34, Nipun Gupta wrote:
> #2) That's actually the worse part of it and completely broken versus
>     device setup
>
>     probe()
>       cdx_msi_domain_alloc_irqs()
>       ...
>       request_irq() {
>         ...
>         irq_activate()
>           irq_chip_write_msi_msg()
>             ...
>             queue_work()
>           ...
>       }
>
>       enable_irq_in_device()
>       
>         <- device raises interrupt and eventually uses an uninitialized
>            MSI message because the scheduled work has not yet completed.
>
>     That's going to be a nightmare to debug and it's going to happen
>     once in a blue moon out in the field.

Here is another failure scenario:

cpu_down()
  // No scheduling possible as this happens in stomp_machine() context
  __cpu_disable()
    // Point of no return
    set_cpu_online(cpu, false);
    irq_migrate_all_off_this_cpu()
      ...
      set_affinity()

That works with your current implementation by some definition of works
because you schedule the affinity change async.

But what makes sure that this takes effect _before_ the CPU goes into
lala land and cannot respond to an interrupt anymore?

Nothing, other than pure luck, which is not really a solid engineering
principle.

While the regular operations can be fixed via the bus_lock mechanics,
this one falls flat on its nose because in that context this can't
schedule anymore so acquiring a mutex or waiting for some async muck to
complete is a no-no.

Not the end of the world though. There is a way to handle that
gracefully and we need that for the PCI/IMS implementations which will
do that via command queues too.

irq_migrate_all_off_this_cpu() is invoked from stomp_machine() context
with interrupts disabled because the current x86 interrupt management
hardware trainwreck requires it at least in the case that interrupt
remapping is not available.

Of course that got copied to all other architectures... Whether they
really require it or not from a hardware POV is a different
story. Though they all rely on the stomp_machine() context which
prevents concurrent modifications.

So from a historical POV all of this makes sense to some extent, but
that does not prevent us to fix this and make it an two stage process
which can actually handle both worlds.

I'm way too tired to think that through, but you get the idea and you
are welcome to beat me to it.

Thanks,

        tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ