[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a2db8fc-d289-415b-ae67-5a35c9c32a76@redhat.com>
Date: Sat, 18 Oct 2025 00:15:01 +0200
From: David Hildenbrand <david@...hat.com>
To: Balbir Singh <balbirs@...dia.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>
Cc: Liam.Howlett@...cle.com, airlied@...il.com, akpm@...ux-foundation.org,
apopple@...dia.com, baohua@...nel.org, baolin.wang@...ux.alibaba.com,
byungchul@...com, dakr@...nel.org, dev.jain@....com,
dri-devel@...ts.freedesktop.org, francois.dugast@...el.com,
gourry@...rry.net, joshua.hahnjy@...il.com, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, lorenzo.stoakes@...cle.com, lyude@...hat.com,
matthew.brost@...el.com, mpenttil@...hat.com, npache@...hat.com,
osalvador@...e.de, rakie.kim@...com, rcampbell@...dia.com,
ryan.roberts@....com, simona@...ll.ch, ying.huang@...ux.alibaba.com,
ziy@...dia.com, kvm@...r.kernel.org, linux-s390@...r.kernel.org,
linux-next@...r.kernel.org
Subject: Re: linux-next: KVM/s390x regression
On 17.10.25 23:56, Balbir Singh wrote:
> On 10/18/25 04:07, David Hildenbrand wrote:
>> On 17.10.25 17:20, Christian Borntraeger wrote:
>>>
>>>
>>> Am 17.10.25 um 17:07 schrieb David Hildenbrand:
>>>> On 17.10.25 17:01, Christian Borntraeger wrote:
>>>>> Am 17.10.25 um 16:54 schrieb David Hildenbrand:
>>>>>> On 17.10.25 16:49, Christian Borntraeger wrote:
>>>>>>> This patch triggers a regression for s390x kvm as qemu guests can no longer start
>>>>>>>
>>>>>>> error: kvm run failed Cannot allocate memory
>>>>>>> PSW=mask 0000000180000000 addr 000000007fd00600
>>>>>>> R00=0000000000000000 R01=0000000000000000 R02=0000000000000000 R03=0000000000000000
>>>>>>> R04=0000000000000000 R05=0000000000000000 R06=0000000000000000 R07=0000000000000000
>>>>>>> R08=0000000000000000 R09=0000000000000000 R10=0000000000000000 R11=0000000000000000
>>>>>>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>>>>>>> C00=00000000000000e0 C01=0000000000000000 C02=0000000000000000 C03=0000000000000000
>>>>>>> C04=0000000000000000 C05=0000000000000000 C06=0000000000000000 C07=0000000000000000
>>>>>>> C08=0000000000000000 C09=0000000000000000 C10=0000000000000000 C11=0000000000000000
>>>>>>> C12=0000000000000000 C13=0000000000000000 C14=00000000c2000000 C15=0000000000000000
>>>>>>>
>>>>>>> KVM on s390x does not use THP so far, will investigate. Does anyone have a quick idea?
>>>>>>
>>>>>> Only when running KVM guests and apart from that everything else seems to be fine?
>>>>>
>>>>> We have other weirdness in linux-next but in different areas. Could that somehow be
>>>>> related to use disabling THP for the kvm address space?
>>>>
>>>> Not sure ... it's a bit weird. I mean, when KVM disables THPs we essentially just remap everything to be mapped by PTEs. So there shouldn't be any PMDs in that whole process.
>>>>
>>>> Remapping a file THP (shmem) implies zapping the THP completely.
>>>>
>>>>
>>>> I assume in your kernel config has CONFIG_ZONE_DEVICE and CONFIG_ARCH_ENABLE_THP_MIGRATION set, right?
>>>
>>> yes.
>>>
>>>>
>>>> I'd rule out copy_huge_pmd(), zap_huge_pmd() a well.
>>>>
>>>>
>>>> What happens if you revert the change in mm/pgtable-generic.c?
>>>
>>> That partial revert seems to fix the issue
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index 0c847cdf4fd3..567e2d084071 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -290,7 +290,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>> if (pmdvalp)
>>> *pmdvalp = pmdval;
>>> - if (unlikely(pmd_none(pmdval) || !pmd_present(pmdval)))
>>> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>
>> Okay, but that means that effectively we stumble over a PMD entry that is not a migration entry but still non-present.
>>
>> And I would expect that it's a page table, because otherwise the change
>> wouldn't make a difference.
>>
>> And the weird thing is that this only triggers sometimes, because if
>> it would always trigger nothing would ever work.
>>
>> Is there some weird scenario where s390x might set a left page table mapped in a PMD to non-present?
>>
>
> Good point
>
>> Staring at the definition of pmd_present() on s390x it's really just
>>
>> return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
>>
>>
>> Maybe this is happening in the gmap code only and not actually in the core-mm code?
>>
>
>
> I am not an s390 expert, but just looking at the code
>
> So the check on s390 effectively
>
> segment_entry/present = false or segment_entry_empty/invalid = true
pmd_present() == true iff _SEGMENT_ENTRY_PRESENT is set
because
return (pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT) != 0;
is the same as
return pmd_val(pmd) & _SEGMENT_ENTRY_PRESENT;
But that means we have something where _SEGMENT_ENTRY_PRESENT is not set.
I suspect that can only be the gmap tables.
Likely __gmap_link() does not set _SEGMENT_ENTRY_PRESENT, which is fine
because it's a software managed bit for "ordinary" page tables, not gmap
tables.
Which raises the question why someone would wrongly use
pte_offset_map()/__pte_offset_map() on the gmap tables.
I cannot immediately spot any such usage in kvm/gmap code, though.
--
Cheers
David / dhildenb
Powered by blists - more mailing lists