linux-kernel - Re: linux-next: KVM/s390x regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cb85aaa3-e456-4fd8-b323-46c75d453a02@redhat.com>
Date: Sat, 18 Oct 2025 00:41:23 +0200
From: David Hildenbrand <david@...hat.com>
To: Balbir Singh <balbirs@...dia.com>,
 Christian Borntraeger <borntraeger@...ux.ibm.com>
Cc: Liam.Howlett@...cle.com, airlied@...il.com, akpm@...ux-foundation.org,
 apopple@...dia.com, baohua@...nel.org, baolin.wang@...ux.alibaba.com,
 byungchul@...com, dakr@...nel.org, dev.jain@....com,
 dri-devel@...ts.freedesktop.org, francois.dugast@...el.com,
 gourry@...rry.net, joshua.hahnjy@...il.com, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, lorenzo.stoakes@...cle.com, lyude@...hat.com,
 matthew.brost@...el.com, mpenttil@...hat.com, npache@...hat.com,
 osalvador@...e.de, rakie.kim@...com, rcampbell@...dia.com,
 ryan.roberts@....com, simona@...ll.ch, ying.huang@...ux.alibaba.com,
 ziy@...dia.com, kvm@...r.kernel.org, linux-s390@...r.kernel.org,
 linux-next@...r.kernel.org
Subject: Re: linux-next: KVM/s390x regression

On 18.10.25 00:15, David Hildenbrand wrote:
> On 17.10.25 23:56, Balbir Singh wrote:
>> On 10/18/25 04:07, David Hildenbrand wrote:
>>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>>
>>>>
>>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>>
>>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>>
>>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>>
>>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>>
>>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>>> related to use disabling THP for the kvm address space?
>>>>>
>>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>>
>>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>>
>>>>>
>>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>>
>>>> yes.
>>>>
>>>>>
>>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>>
>>>>>
>>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>>
>>>> That partial revert seems to fix the issue
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index 0c847cdf4fd3..567e2d084071 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>>>                if (pmdvalp)
>>>>                     *pmdvalp = pmdval;
>>>> -       if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>>> +       if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>>
>>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>>
>>> And I would expect that it's a page table, because otherwise the change
>>> wouldn't make a difference.
>>>
>>> And the weird thing is that this only triggers sometimes, because if
>>> it would always trigger nothing would ever work.
>>>
>>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>>
>>
>> Good point
>>
>>> Staring at the definition of pmd_present() on s390x it's really just
>>>
>>>       return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>>
>>>
>>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>>
>>
>>
>> I am not an s390 expert, but just looking at the code
>>
>> So the check on s390 effectively
>>
>> segment_entry/present = false or segment_entry_empty/invalid = true
> 
> pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
> 
> because
> 
> 	return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
> 
> is the same as
> 
> 	return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
> 
> But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
> 
> I suspect that can only be the gmap tables.
> 
> Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
> because it's a software managed bit for "ordinary" page tables, not gmap
> tables.
> 
> Which raises the question why someone would wrongly use
> pte_offset_map()/__pte_offset_map() on the gmap tables.
> 
> I cannot immediately spot any such usage in kvm/gmap code, though.
> 

Ah, it's all that pte_alloc_map_lock() stuff in gmap.c.

Oh my.

So we're mapping a user PTE table that is linked into the gmap tables 
through a PMD table that does not have the right sw bits set we would 
expect in a user PMD table.

What's also scary is that pte_alloc_map_lock() would try to pte_alloc() 
a user page table in the gmap, which sounds completely wrong?

Yeah, when walking the gmap and wanting to lock the linked user PTE 
table, we should probably never use the pte_*map variants but obtain
the lock through pte_lockptr().

All magic we end up doing with RCU etc in __pte_offset_map_lock()
does not apply to the gmap PMD table.

-- 
Cheers

David / dhildenb