lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090413193717.GB8393@us.ibm.com>
Date:	Mon, 13 Apr 2009 12:37:17 -0700
From:	Gary Hade <garyhade@...ibm.com>
To:	Yinghai Lu <yhlu.kernel@...il.com>
Cc:	Gary Hade <garyhade@...ibm.com>,
	"Eric W. Biederman" <ebiederm@...ssion.com>, mingo@...e.hu,
	mingo@...hat.com, tglx@...utronix.de, hpa@...or.com,
	x86@...nel.org, linux-kernel@...r.kernel.org, lcm@...ibm.com
Subject: Re: [PATCH 3/3] [BUGFIX] x86/x86_64: fix IRQ migration triggered
	active device IRQ interrruption

On Fri, Apr 10, 2009 at 11:46:39PM -0700, Yinghai Lu wrote:
> On Wed, Apr 8, 2009 at 4:08 PM, Gary Hade <garyhade@...ibm.com> wrote:
> > On Wed, Apr 08, 2009 at 03:03:49PM -0700, Yinghai Lu wrote:
> >> On Wed, Apr 8, 2009 at 2:07 PM, Gary Hade <garyhade@...ibm.com> wrote:
> >> > Impact: Eliminates an issue that can leave the system in an
> >> >        unusable state.
> >> >
> >> > This patch addresses an issue where device generated IRQs
> >> > are no longer seen by the kernel following IRQ affinity
> >> > migration while the device is generating IRQs at a high rate.
> >> > We have seen this problem happen when IRQ affinities are
> >> > adjusted in response to CPU offlining but I believe it
> >> > could also happen in during user initiated IRQ affinity
> >> > changes unrelated to CPU offlining. e.g. while the
> >> > irqbalance daemon is adjusting IRQ affinities when the
> >> > system is heavily loaded.
> >> >
> >> > I have been able to consistently reproduce the problem on
> >> > some of our systems by running the following script (VICTIM_IRQ
> >> > specifies the IRQ for the aic94xx device) while a single instance
> >> > of the command
> >> >  # while true; do find / -exec file {} \;; done
> >> > is keeping the filesystem activity and IRQ rate reasonably high.
> >> >
> >> > #!/bin/sh
> >> >
> >> > SYS_CPU_DIR=/sys/devices/system/cpu
> >> > VICTIM_IRQ=25
> >> > IRQ_MASK=f0
> >> >
> >> > iteration=0
> >> > while true; do
> >> >  echo $iteration
> >> >  echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity
> >> >  for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
> >> >    echo 0 > $cpudir/online
> >> >  done
> >> >  for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
> >> >    echo 1 > $cpudir/online
> >> >  done
> >> >  iteration=`expr $iteration + 1`
> >> > done
> >> >
> >> > The root cause is a known issue already addressed for some
> >> > code paths [e.g. ack_apic_level() and the now obsolete
> >> > migrate_irq_remapped_level_desc()] where the ioapic can
> >> > misbehave when the I/O redirection table register is written
> >> > while the Remote IRR bit is set.
> >> >
> >> > The proposed fix uses the same avoidance method and much
> >> > of same code that the Interrupt Remapping code previously
> >> > used to avoid the same problem.
> >> >
> >> > Signed-off-by: Gary Hade <garyhade@...ibm.com>
> >> >
> >> > ---
> >> >  arch/x86/kernel/apic/io_apic.c |   72 ++++++++++++++++++++++++++++++-
> >> >  1 file changed, 71 insertions(+), 1 deletion(-)
> >> >
> >> > Index: linux-2.6.30-rc1/arch/x86/kernel/apic/io_apic.c
> >> > ===================================================================
> >> > --- linux-2.6.30-rc1.orig/arch/x86/kernel/apic/io_apic.c        2009-04-08 09:24:11.000000000 -0700
> >> > +++ linux-2.6.30-rc1/arch/x86/kernel/apic/io_apic.c     2009-04-08 09:24:23.000000000 -0700
> >> > @@ -2331,7 +2331,8 @@ set_desc_affinity(struct irq_desc *desc,
> >> >  }
> >> >
> >> >  static void
> >> > -set_ioapic_affinity_irq_desc(struct irq_desc *desc, const struct cpumask *mask)
> >> > +set_ioapic_irq_affinity_desc(struct irq_desc *desc,
> >> > +                            const struct cpumask *mask)
> >> >  {
> >> >        struct irq_cfg *cfg;
> >> >        unsigned long flags;
> >> > @@ -2352,6 +2353,75 @@ set_ioapic_affinity_irq_desc(struct irq_
> >> >  }
> >> >
> >> >  static void
> >> > +delayed_irq_move(struct work_struct *work)
> >> > +{
> >> > +       unsigned int irq;
> >> > +       struct irq_desc *desc;
> >> > +
> >> > +       for_each_irq_desc(irq, desc) {
> >> > +               if (desc->status & IRQ_MOVE_PENDING) {
> >> > +                       unsigned long flags;
> >> > +
> >> > +                       spin_lock_irqsave(&desc->lock, flags);
> >> > +                       if (!desc->chip->set_affinity ||
> >> > +                           !(desc->status & IRQ_MOVE_PENDING)) {
> >> > +                               desc->status &= ~IRQ_MOVE_PENDING;
> >> > +                               spin_unlock_irqrestore(&desc->lock, flags);
> >> > +                               continue;
> >> > +                       }
> >> > +
> >> > +                       desc->chip->set_affinity(irq, desc->pending_mask);
> >> > +                       spin_unlock_irqrestore(&desc->lock, flags);
> >> > +               }
> >> > +       }
> >> > +}
> >> > +
> >> > +static DECLARE_DELAYED_WORK(delayed_irq_move_work, delayed_irq_move);
> >> > +
> >> > +static void
> >> > +set_ioapic_irq_affinity_level_desc(struct irq_desc *desc)
> >> > +{
> >> > +
> >> > +       struct irq_cfg *cfg = desc->chip_data;
> >> > +
> >> > +       mask_IO_APIC_irq_desc(desc);
> >> > +
> >> > +       if (io_apic_level_ack_pending(cfg)) {
> >> > +               /*
> >> > +                * Interrupt in progress. Migrating irq now will change
> >> > +                * the vector information in the IO-APIC RTE which will
> >> > +                * confuse the EOI broadcast performed by cpu.
> >> > +                * So, we delay the irq migration.
> >> > +                */
> >> > +               schedule_delayed_work(&delayed_irq_move_work, 1);
> >> > +               goto unmask;
> >> > +       }
> >> > +
> >> > +       /* Interrupt not in progress. we can change the vector
> >> > +        * information in the IO-APIC RTE. */
> >> > +       set_ioapic_irq_affinity_desc(desc, desc->pending_mask);
> >> > +
> >> > +       desc->status &= ~IRQ_MOVE_PENDING;
> >> > +       cpumask_clear(desc->pending_mask);
> >> > +
> >> > +unmask:
> >> > +       unmask_IO_APIC_irq_desc(desc);
> >> > +}
> >> > +
> >> > +static void
> >> > +set_ioapic_affinity_irq_desc(struct irq_desc *desc,
> >> > +                            const struct cpumask *mask)
> >> > +{
> >> > +       if (desc->status & IRQ_LEVEL) {
> >> > +               desc->status |= IRQ_MOVE_PENDING;
> >> > +               cpumask_copy(desc->pending_mask, mask);
> >> > +               set_ioapic_irq_affinity_level_desc(desc);
> >> > +               return;
> >> > +       }
> >> > +       set_ioapic_irq_affinity_desc(desc, mask);
> >> > +}
> >> > +
> >> > +static void
> >> >  set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask)
> >> >  {
> >> >        struct irq_desc *desc;
> >> > --
> >>
> >> it seems, ack_apic_level() already checked io_apic_level_ack_pending()
> >>
> >>                 cfg = desc->chip_data;
> >>                 if (!io_apic_level_ack_pending(cfg))
> >>                         move_masked_irq(irq);
> >>                 unmask_IO_APIC_irq_desc(desc);
> >
> > Yes, I have actually observed instances where the command
> >   `echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity`
> > in the above test script failed to modify the affinity due to
> > a debug printk confirmed non-zero return from io_apic_level_ack_pending()
> > at this location.  However, this is definitely not catching all the cases.
> > This was confirmed this with a printk in __target_IO_APIC_irq() that would
> > sometimes shows the Remote IRR bit set before the io_apic_modify() call.
> > e.g.
> > __target_IO_APIC_irq: XXX before io_apic_modify irq=25 vector=0xd9 reg=0x1e0d9
> 
> 
> that is strange,  before that irq is masked,
> 
> #ifdef CONFIG_GENERIC_PENDING_IRQ
>         /* If we are moving the irq we need to mask it */
>         if (unlikely(desc->status & IRQ_MOVE_PENDING)) {
>                 do_unmask_irq = 1;
>                 mask_IO_APIC_irq_desc(desc);
>         }
> #endif

When the device is very actively generating IRQs I suspect
that __target_IO_APIC_irq() is sometimes visited (via CPU
offline motivated path) before the IRQ is masked at this location.
I believe the vulnerability window is actually a bit larger
because it spans this mask_IO_APIC_irq_desc() call.

Gary

-- 
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503  IBM T/L: 775-4503
garyhade@...ibm.com
http://www.ibm.com/linux/ltc

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ