[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK9=C2XJYTfY4nXWtjK9OP1iXLDXBVF-=mN1SmJDmJ_dO5CohA@mail.gmail.com>
Date: Wed, 7 Feb 2024 14:48:38 +0530
From: Anup Patel <apatel@...tanamicro.com>
To: Björn Töpel <bjorn@...nel.org>
Cc: Anup Patel <anup@...infault.org>, Palmer Dabbelt <palmer@...belt.com>,
Paul Walmsley <paul.walmsley@...ive.com>, Thomas Gleixner <tglx@...utronix.de>,
Rob Herring <robh+dt@...nel.org>,
Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>, Frank Rowand <frowand.list@...il.com>,
Conor Dooley <conor+dt@...nel.org>, devicetree@...r.kernel.org,
Saravana Kannan <saravanak@...gle.com>, Marc Zyngier <maz@...nel.org>, linux-kernel@...r.kernel.org,
Atish Patra <atishp@...shpatra.org>, linux-riscv@...ts.infradead.org,
linux-arm-kernel@...ts.infradead.org, Andrew Jones <ajones@...tanamicro.com>
Subject: Re: [PATCH v12 00/25] Linux RISC-V AIA Support
On Wed, Feb 7, 2024 at 12:57 PM Björn Töpel <bjorn@...nelorg> wrote:
>
> Hi!
>
> Anup Patel <anup@...infault.org> writes:
>
> > On Tue, Feb 6, 2024 at 9:09 PM Björn Töpel <bjorn@...nel.org> wrote:
> >>
> >> Hi Anup,
> >>
> >> Anup Patel <apatel@...tanamicro.com> writes:
> >>
> >> > The RISC-V AIA specification is ratified as-per the RISC-V international
> >> > process. The latest ratified AIA specifcation can be found at:
> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf
> >> >
> >> > At a high-level, the AIA specification adds three things:
> >> > 1) AIA CSRs
> >> > - Improved local interrupt support
> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC)
> >> > - Per-HART MSI controller
> >> > - Support MSI virtualization
> >> > - Support IPI along with virtualization
> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC)
> >> > - Wired interrupt controller
> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator)
> >> > - In Direct-mode, injects external interrupts directly into HARTs
> >> >
> >> > For an overview of the AIA specification, refer the AIA virtualization
> >> > talk at KVM Forum 2022:
> >> > https://static.sched.com/hosted_files/kvmforum2022/a1/AIA_Virtualization_in_KVM_RISCV_final.pdf
> >> > https://www.youtube.com/watch?v=r071dL8Z0yo
> >>
> >> Thank you for continuing to work on this series! I like this
> >> direction of the series!
> >>
> >> TL;DR: I think we can get rid of most of the id/householding data
> >> structures, except for the irq matrix.
> >>
> >> Most of my comments are more of a design/overview nature, so I'll
> >> comment here in the cover letter.
> >>
> >> I took the series for a spin with and it with Alex' ftrace fix it,
> >> passes all my tests nicely!
> >>
> >> Now some thoughts/comments (I'm coming from the x86 side of things!):
> >>
> >> id/enable-tracking: There are a lot of different id/enabled tracking
> >> with corresponding locks, where there's IMO overlap with what the
> >> matrix provides.
> >
> > The matrix allocator does not track the enabled/disabled state of
> > the per-CPU IDs. This is why we have a separate per-CPU
> > ids_enabled_bitmap which is also used for remote synchronization
> > across CPUs.
>
> Exactly, but what I'm asking is if that structure is really needed. More
> below.
>
> >> Let's start with struct imsic_priv:
> >>
> >> | /* Dummy HW interrupt numbers */
> >> | unsigned int nr_hwirqs;
> >> | raw_spinlock_t hwirqs_lock;
> >> | unsigned long *hwirqs_used_bitmap;
> >
> > The matrix allocator manages actual IDs for each CPU whereas
> > the Linux irq_data expects a fixed hwirq which does not change.
> >
> > Due to this, we have a dummy hwirq space which is always
> > fixed. The only thing that is changed under-the-hood by the
> > IMSIC driver is the dummy hwirq to actual HW vector (cpu, id)
> > mapping.
>
> Read below. I'm not talking about local_id from the irq_matrix, I'm
> saying use virq, which has the properties you're asking for, and doesn't
> require an additional structure. When an irq/desc is allocated, you have
> a nice unique number with the virq for the lifetime of the interrupt.
Sure, let me explore using virq in-place of hwirq.
>
> >> These are used to for the domain routing (hwirq -> desc/virq), and not
> >> needed. Just use the same id as virq (at allocation time), and get rid
> >> of these data structures/corresponding functions. The lookup in the
> >> interrupt handler via imsic_local_priv.vectors doesn't care about
> >> hwirq. This is what x86 does... The imsic_vector roughly corresponds
> >> to apic_chip_data (nit: imsic_vector could have the chip_data suffix
> >> as well, at least it would have helped me!)
> >
> > Yes, imsic_vector corresponds to apic_chip_data in the x86 world.
>
> ...and I'm trying to ask the following; Given the IMSIC is pretty much
> x86 vector (arch/x86/kernel/apic/vector.c), I'm trying to figure out the
> rational why IMSIC has all the extra householding data, not needed by
> x86. The x86 has been battle proven, and having to deal with all kind of
> quirks (e.g. lost interrupts on affinity changes).
Understood.
>
> >> Moving/affinity changes. The moving of a vector to another CPU
> >> currently involves:
> >>
> >> 1. Allocate a new vector from the matrix
> >> 2. Disable/enable the corresponding per-cpu ids_enabled_bitmap (nested
> >> spinlocks)
> >> 3. Trigger two IPIs to apply the bitmap
> >> 4. On each CPU target (imsic_local_sync()) loop the bitmap and flip
> >> all bits, and potentially rearm
> >>
> >> This seems a bit heavy-weight: Why are you explicitly setting/clearing
> >> all the bits in a loop at the local sync?
> >
> > This can be certainly optimized by introducing another
> > ids_dirty_bitmap. I will add this in the next revision.
>
> I rather have fewer maps, and less locks! ;-)
>
> >> x86 does it a bit differently (more lazily): The chip_data has
> >> prev_{cpu,vector}/move_in_progress fields, and keep both vectors
> >> enabled until there's an interrupt on the new vector, and then the old
> >> one is cleaned (irq_complete_move()).
> >>
> >> Further; When it's time to remove the old vector, x86 doesn't trigger
> >> an IPI on the disabling side, but queues a cleanup job on a per-cpu
> >> list and triggers a timeout. So, the per-cpu chip_data (per-cpu
> >> "vectors" in your series) can reside in two places during the transit.
> >
> > We can't avoid IPIs when moving vectors from one CPU to another
> > CPU because IMSIC id enable/disable is only possible through
> > CSRs. Also, keep in-mind that irq affinity change might be initiated
> > on CPU X for some interrupt targeting CPU Y which is then changed
> > to target CPU Z.
> >
> > In the case of x86, they have memory mapped registers which
> > allows one CPU to enable/disable the ID of another CPU.
>
> Nope. Same mechanics on x86 -- the cleanup has to be done one the
> originating core. What I asked was "what about using a timer instead of
> an IPI". I think this was up in the last rev as well?
>
> Check out commit bdc1dad299bb ("x86/vector: Replace
> IRQ_MOVE_CLEANUP_VECTOR with a timer callback") Specifically, the
> comment about lost interrupts, and the rational for keeping the original
> target active until there's a new interrupt on the new cpu.
Trying timer interrupt is still TBD on my side because with v12
my goal was to implement per-device MSI domains. Let me
explore timer interrupts for v13.
>
> >> I wonder if this clean up is less intrusive, and you just need to
> >> perform what's in the per-list instead of dealing with the
> >> ids_enabled_bitmap? Maybe we can even remove that bitmap as well. The
> >> chip_data/desc has that information. This would mean that
> >> imsic_local_priv() would only have the local vectors (chip_data), and
> >> a cleanup list/timer.
> >>
> >> My general comment is that instead of having these global id-tracking
> >> structures, use the matrix together with some desc/chip_data local
> >> data, which should be sufficient.
> >
> > The "ids_enabled_bitmap", "dummy hwirqs" and private imsic_vectors
> > are required since the matrix allocator only manages allocation of
> > per-CPU IDs.
>
> The information in ids_enabled_bitmap is/could be inherent in
> imsic_local_priv.vectors (guess what x86 does... ;-)).
>
> Dummy hwirqs could be replaced with the virq.
>
> Hmm, seems like we're talking past each other, or at least I get the
> feeling I can't get my opinions out right. I'll try to do a quick PoC,
> to show you what I mean. That's probably easier than just talking about
> it. ...and maybe I'll come realizing I'm all wrong!
I suggest to wait for my v13 and try something on top of that
otherwise we might duplicate efforts.
>
> My reaction is -- you're doing a lot of householding with a lot of
> locks, and my worry is that we'll just end up with same issues/bloat
> that x86 once had (has? ;-)).
>
> >> Random thought: Do we need to explicitly disable (csr) the vector,
> >> when we're changing the affinity? What if we just leave it enabled,
> >> and only when mask/unmask is performed it's actually explicitly masked
> >> (writes to the csr)?
> >
> > We should not leave it enabled because some rough/buggy device
> > can inject spurious interrupts using MSI writes to unused enabled
> > interrupts.
>
> OK!
>
> >>
> >> Missing features (which can be added later):
> >> * Reservation mode/activate support (allocate many MSI, but only
> >> request/activate a subset)
> >
> > I did not see any PCIe or platform device requiring this kind of
> > reservation. Any examples ?
>
> It's not a requirement. Some devices allocate a gazillion interrupts
> (NICs with many QoS queues, e.g.), but only activate a subset (via
> request_irq()). A system using these kind of devices might run out of
> interrupts.
I don't see how this is not possible currently.
>
> Problems you run into once you leave the embedded world, pretty much.
>
> >> * Handle managed interrupts
> >
> > Any examples of managed interrupts in the RISC-V world ?
>
> E.g. all nvme drives: nvme_setup_irqs(), and I'd assume contemporary
> netdev drivers would use it. Typically devices with per-cpu queues.
We have tested with NVMe devices, e1000e, VirtIO-net, etc and I did
not see any issue.
We can always add new features as separate incremental series as long
as there is clear use-cause backed by real-world devices.
>
> >> * There might be some irqd flags are missing, which mostly cpuhp care
> >> about (e.g. irqd_*_single_target())...
> >
> > Okay, let me check and update.
>
> I haven't dug much into cpuhp, so I'm out on a limb here...
>
> >> Finally; Given that the APLIC requires a lot more patches, depending
> >> on how the review process moves on -- maybe the IMSIC side could go as
> >> a separate series?
> >>
> >
> > The most popular implementation choice across RISC-V platforms
> > will be IMSIC + APLIC so both drivers should go together. In fact,
> > we need both drivers for the QEMU virt machine as well because
> > UART interrupt (and other wired interrupts) on the QEMU virt
> > machine goes through APLIC.
>
> Thanks for clearing that out! Hmm, an IMSIC only QEMU would be awesome.
>
>
> Cheers,
> Björn
Regards,
Anup
Powered by blists - more mailing lists