linux-kernel - Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c90abd39-375a-15cc-847a-d1d28115ca97@linux.ibm.com>
Date:   Fri, 21 Jan 2022 15:17:01 +0100
From:   Christian Borntraeger <borntraeger@...ux.ibm.com>
To:     Mark Rutland <mark.rutland@....com>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     linux-kernel@...r.kernel.org,
        Michael Ellerman <mpe@...erman.id.au>,
        aleksandar.qemu.devel@...il.com, alexandru.elisei@....com,
        anup.patel@....com, aou@...s.berkeley.edu, atish.patra@....com,
        bp@...en8.de, catalin.marinas@....com, chenhuacai@...nel.org,
        dave.hansen@...ux.intel.com, frankja@...ux.ibm.com,
        frederic@...nel.org, gor@...ux.ibm.com, hca@...ux.ibm.com,
        james.morse@....com, jmattson@...gle.com, joro@...tes.org,
        luto@...nel.org, maz@...nel.org, mingo@...hat.com,
        nsaenzju@...hat.com, palmer@...belt.com, paulmck@...nel.org,
        paul.walmsley@...ive.com, peterz@...radead.org, seanjc@...gle.com,
        suzuki.poulose@....com, svens@...ux.ibm.com, tglx@...utronix.de,
        tsbogend@...ha.franken.de, vkuznets@...hat.com,
        wanpengli@...cent.com, will@...nel.org,
        Anup Patel <anup@...infault.org>,
        Atish Patra <atishp@...shpatra.org>
Subject: Re: [PATCH v2 0/7] kvm: fix latent guest entry/exit bugs



Am 21.01.22 um 10:53 schrieb Christian Borntraeger:
> Am 20.01.22 um 16:14 schrieb Christian Borntraeger:
>>
>>
>> Am 20.01.22 um 13:03 schrieb Mark Rutland:
>>> On Thu, Jan 20, 2022 at 12:28:09PM +0100, Paolo Bonzini wrote:
>>>> On 1/19/22 20:22, Mark Rutland wrote:
>>>>> I wonder, is the s390 guest entry/exit*preemptible*  ?
>>>>>
>>>>> If a timer IRQ can preempt in the middle of the EQS, we wouldn't balance
>>>>> things before a ctx-switch to the idle thread, which would then be able
>>>>> to hit this.
>>>>>
>>>>> I'll need to go audit the other architectures for similar.
>>>>
>>>> They don't enable interrupts in the entry/exit path so they should be okay.
>>>
>>> True.
>>>
>>> So it sounds like for s390 adding an explicit preempt_{disable,enable}() is the
>>> right thing to do. I'll add that and explanatory commentary.
>>
>> That would not be trivial I guess. We do allow (and need) page faults on sie for guest
>> demand paging and
>>
>> this piece of arch/s390/mm/fault.c
>>
>>         case GMAP_FAULT:
>>                  if (faulthandler_disabled() || !mm)
>>                          goto out;
>>                  break;
>>          }
>>
>> would no longer work since faulthandler_disabled checks for the preempt count.
>>
> 
> 
> Something like this
> 
> 
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index d30f5986fa85..1c7d45346e12 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -385,10 +385,18 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access)
>                          return 0;
>                  goto out;
>          case USER_FAULT:
> -       case GMAP_FAULT:
>                  if (faulthandler_disabled() || !mm)
>                          goto out;
>                  break;
> +               /*
> +                * We know that we interrupted SIE and we are not in a IRQ.
> +                * preemption might be disabled thus checking for in_atomic
> +                * would result in failures
> +                */
> +       case GMAP_FAULT:
> +               if (pagefault_disabled() || !mm)
> +                       goto out;
> +               break;
>          }
> 
>          perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> 
> seems to work with preemption disabled around sie. Not sure yet if this is correct.


No it does not work. scheduling while preemption is disabled.
[ 1880.448663] BUG: scheduling while atomic: qemu-system-s39/1806/0x00000002
[ 1880.448674] INFO: lockdep is turned off.
[ 1880.448676] Modules linked in: kvm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc mlx5_ib ib_uverbs s390_trng ib_core genwqe_card crc_itu_t vfio_ccw mdev vfio_iommu_type1 eadm_sch vfio zcrypt_cex4 sch_fq_codel configfs ip_tables x_tables mlx5_core ghash_s390 prng aes_s390 des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
[ 1880.448730] Preemption disabled at:
[ 1880.448731] [<000003ff8070da88>] ckc_irq_pending+0x30/0xe0 [kvm]
[ 1880.448778] CPU: 47 PID: 1806 Comm: qemu-system-s39 Tainted: G        W         5.16.0-00007-g89e9021389e2-dirty #15
[ 1880.448782] Hardware name: IBM 3906 M04 704 (LPAR)
[ 1880.448784] Call Trace:
[ 1880.448785]  [<000000000bf001d6>] dump_stack_lvl+0x8e/0xc8
[ 1880.448794]  [<000000000b26e08a>] __schedule_bug+0xe2/0xf8
[ 1880.448801]  [<000000000b26e212>] schedule_debug+0x172/0x1a8
[ 1880.448804]  [<000000000bf0bcae>] __schedule+0x56/0x8b0
[ 1880.448808]  [<000000000bf0c570>] schedule+0x68/0x110
[ 1880.448811]  [<000000000bf13e76>] schedule_timeout+0x106/0x160
[ 1880.448815]  [<000000000bf0ddf2>] wait_for_completion+0xc2/0x110
[ 1880.448818]  [<000000000b258674>] __flush_work+0xd4/0x118
[ 1880.448823]  [<000000000b4e3c88>] __drain_all_pages+0x218/0x308
[ 1880.448829]  [<000000000b4ec3bc>] __alloc_pages_slowpath.constprop.0+0x5bc/0xc98
[ 1880.448832]  [<000000000b4ece5c>] __alloc_pages+0x3c4/0x448
[ 1880.448835]  [<000000000b5143cc>] alloc_pages_vma+0x9c/0x360
[ 1880.448841]  [<000000000b4c0d6e>] do_swap_page+0x66e/0xca0
[ 1880.448845]  [<000000000b4c3012>] __handle_mm_fault+0x29a/0x4b0
[ 1880.448869]  [<000000000b4c33ac>] handle_mm_fault+0x184/0x3a8
[ 1880.448872]  [<000000000b2062ce>] do_exception+0x136/0x490
[ 1880.448877]  [<000000000b206b9a>] do_dat_exception+0x2a/0x50
[ 1880.448880]  [<000000000bf03650>] __do_pgm_check+0x120/0x1f0
[ 1880.448882]  [<000000000bf164ee>] pgm_check_handler+0x11e/0x180
[ 1880.448885]  [<000000000bf16298>] sie_exit+0x0/0x48
[ 1880.448888] ([<000003ff8071e954>] kvm_s390_enter_exit_sie+0x64/0x98 [kvm])
[ 1880.448910]  [<000003ff807061fa>] __vcpu_run+0x2a2/0x5b8 [kvm]
[ 1880.448931]  [<000003ff807069ba>] kvm_arch_vcpu_ioctl_run+0x10a/0x270 [kvm]
[ 1880.448953]  [<000003ff806ed02c>] kvm_vcpu_ioctl+0x27c/0xa40 [kvm]
[ 1880.448975]  [<000000000b58b5c6>] __s390x_sys_ioctl+0xbe/0x100
[ 1880.448982]  [<000000000bf038fa>] __do_syscall+0x1da/0x208
[ 1880.448984]  [<000000000bf16362>] system_call+0x82/0xb0