linux-kernel - Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2e0c971a-0199-ff0d-c13c-d007d9f03122@nvidia.com>
Date:   Thu, 19 Jan 2023 07:00:49 -0600
From:   Shanker Donthineni <sdonthineni@...dia.com>
To:     Marc Zyngier <maz@...nel.org>
Cc:     James Morse <james.morse@....com>,
        Catalin Marinas <catalin.marinas@....com>,
        Will Deacon <will@...nel.org>,
        linux-arm-kernel@...ts.infradead.org, kvmarm@...ts.linux.dev,
        linux-kernel@...r.kernel.org, Vikram Sethi <vsethi@...dia.com>,
        Zenghui Yu <yuzenghui@...wei.com>,
        Oliver Upton <oliver.upton@...ux.dev>,
        Suzuki K Poulose <suzuki.poulose@....com>,
        Ard Biesheuvel <ardb@...nel.org>
Subject: Re: [PATCH] KVM: arm64: vgic: Fix soft lockup during VM teardown



On 1/19/23 01:11, Marc Zyngier wrote:
> External email: Use caution opening links or attachments
> 
> 
> [dropping the now dead old kvmarm list]
> 
> On Wed, 18 Jan 2023 19:24:01 +0000,
> Shanker Donthineni <sdonthineni@...dia.com> wrote:
>>
>>
>>
>> On 1/18/23 05:54, Marc Zyngier wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Shanker,
>>>
>>> Please Cc all the KVM/arm64 reviewers when sending KVM/arm64 patches.
>>>
>>> On Wed, 18 Jan 2023 02:23:48 +0000,
>>> Shanker Donthineni <sdonthineni@...dia.com> wrote:
>>>>
>>>> Getting intermittent CPU soft lockups during the virtual machines
>>>> teardown on a system with GICv4 features enabled. The function
>>>> __synchronize_hardirq() has been waiting for IRQD_IRQ_INPROGRESS
>>>> to be cleared forever as per the current implementation.
>>>>
>>>> CPU stuck here for a long time leads to soft lockup:
>>>>     while (irqd_irq_inprogress(&desc->irq_data))
>>>>         cpu_relax();
>>>
>>> Is it a soft-lockup from which the system recovers? or a livelock
>>> which leaves the system dead?
>>>
>> The system is not recovering, did a power cycle to recover.
> 
> This isn't a soft-lockup then. This is at best a livelock.
> 
>>> Are these two traces an indication of concurrent events? Or are they
>>> far apart?
>>>
>> Concurrent.
> 
> So you can see the VM being torn down while the vgic save sequence is
> still in progress?
> 
> If you can actually see that, then this is a much bigger bug than the
> simple race you are describing, and we're missing a reference on the
> kvm structure. This would be a *MAJOR* bug.
> 
How do we know vGIC save sequence is in progress while VM is being teardown?
I'm launching/terminating ~32 VMs in a loop to reproduce the issue.
  
> Please post the full traces, not snippets. The absolutely full kernel
> log, the configuration, what you run, how you run it, *EVERYTHING*. I
> need to be able to reproduce this.
Sure, I'll share the complete boot log messages of host kernel next run.
  
> 
>>
>>>>
>>>> irqreturn_t handle_irq_event(struct irq_desc *desc)
>>>> {
>>>>       irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>>       raw_spin_unlock(&desc->lock);
>>>>
>>>>       ret = handle_irq_event_percpu(desc);
>>>>
>>>>       raw_spin_lock(&desc->lock);
>>>>       irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
>>>> }
>>>
>>> How is that relevant to this trace? Do you see this function running
>>> concurrently with the teardown? If it matters here, it must be a VPE
>>> doorbell, right? But you claim that this is on a GICv4 platform, while
>>> this would only affect GICv4.1... Or are you using GICv4.1?
>>>
>> handle_irq_event() is running concurrently with irq_domain_activate_irq()
>> which happens before free_irq() called. Corruption at [78.983544] and
>> teardown started at [87.360891].
> 
> But that doesn't match the description you made of concurrent
> events. Does it take more than 9 seconds for the vgic state to be
> saved to memory?

Are there any other possibilities of corrupting IRQD_IRQ_INPROGRESS
state bit other than concurrent accesses?

I did another experiment, changed the bit manipulation of the IRQD
state to atomic operations, the issue is not reproducible.

struct irq_common_data {
-       unsigned  int		__private state_use_accessors;
+	atomic_t		__private state_use_accessors;

+#define __irqd_to_state(d)       ACCESS_PRIVATE((d)->common, state_use_accessors)
+#define __irqd_get_state(d)      atomic_read(&__irqd_to_state(d))
+#define __irqd_set_state(d, m)   atomic_or((m), &__irqd_to_state(d))
+#define __irqd_clear_state(d, m) atomic_andnot((m), &__irqd_to_state(d))
> 
>>
>> [   78.983544] irqd_set_activated: lost IRQD_IRQ_INPROGRESS old=0x10401400, new=0x10441600
>>
>> [   87.360891]  __synchronize_hardirq+0x48/0x140
>>
>> Yes, I'm using GICv4.1, used these 2 functions to trace the issue.
> 
> Then *please* be precise in your descriptions. You send people in the
> wrong direction.
> 
It was my mistake not to mention the exact GIC version.

> [...]
> 
>> I ran stress test launch/teardown multiple VMs for 3hrs. The issue
>> is not reproducible. The same test fails in 10-30min without code
>> changes.
> 
> That doesn't add up with the earlier description of concurrent events,
> with the VM teardown and the vgic saving running in parallel. My patch
> doesn't prevent this.
> 
> So either your testing is insufficient, or your description of
> concurrent events is inaccurate.
> 

I'm using the unmodified 6.2.rc3/v6.0 from kernel.org and QEMU to
reproduce the issue.

Using the below steps for launching/terminating 32 VMs in loop. The
failure is intermittent. The same issue is reproducible with KVMTOOL
also.

numvms=32
socketcnt=1
iterations=100
while [ ${iterations} -ne 0 ]; do
  iterations=$(( iterations - 1 ))
  for idx in `seq 1 ${numvms}`; do
    vmname=vm${idx}
    cpunode=$((idx % socketcnt))
    cpuset=`lscpu  | grep "NUMA node${cpunode} CPU(s)" | awk -F' ' '{ print $4 }'`
    qemu-img create -f qcow2 -F raw -b /var/lib/libvirt/images/cloudimg.raw /var/lib/libvirt/images/${vmname}.qcow2
    virt-sysprep -a /var/lib/libvirt/images/${vmname}.qcow2 --hostname ${vmname}
    virt-install \
      --name ${vmname} \
      --numatune ${cpunode},mode=preferred \
      --vcpus vcpus=16,maxvcpus=16,cpuset=${cpuset} \
      --memory 4096 \
      --import \
      --noautoconsole \
      --os-variant name=ubuntu22.04 \
      --disk path=/var/lib/libvirt/images/${vmname}.qcow2 \
      --boot loader=/usr/share/AAVMF/AAVMF_CODE.fd,loader_ro=yes,loader_type=pflash
    if [ $? -ne 0 ]; then
      echo "Failed to create vm${idx}"
      while true ; do  sleep 1; done
    fi
    sleep 5
  done

  for idx in `seq 1 ${numvms}`; do
    echo -n "Checking vm${idx} boot: "
    while true; do
      ghostname=`virsh guestinfo vm${idx} --hostname 2>/dev/null | awk '{ print $3 }'`
      [ "${ghostname}" == "vm${idx}" ] && echo "done" && break
      sleep 1
    done
  done

  for idx in `seq 1 ${numvms}`; do
   virsh domstate vm${idx} >/dev/null 2>&1 || continue
   echo "Started shutdown & undefine vm${idx}..."
   virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
   while true; do
     virsh domstate vm${idx} 2>/dev/null 2>&1 | grep -w "^shut off" >/dev/null 2>&1
     [ $? -eq 0 ] && break
     sleep 1
     virsh shutdown "vm${idx}" --mode agent >/dev/null 2>&1
   done
   virsh undefine --nvram --remove-all-storage vm${idx} 2>/dev/null
  done
done


>          M.
> 
> --
> Without deviation from the norm, progress is not possible.