[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <23f50178-59d1-462b-8463-48cc707d6b3f@os.amperecomputing.com>
Date: Tue, 14 Jan 2025 13:24:25 -0800
From: Yang Shi <yang@...amperecomputing.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: arnd@...db.de, gregkh@...uxfoundation.org, Liam.Howlett@...cle.com,
vbabka@...e.cz, jannh@...gle.com, willy@...radead.org,
liushixin2@...wei.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] /dev/zero: make private mapping full anonymous mapping
On 1/14/25 11:13 AM, Lorenzo Stoakes wrote:
> On Tue, Jan 14, 2025 at 11:03:48AM -0800, Yang Shi wrote:
>>
>>
>> On 1/14/25 10:14 AM, Lorenzo Stoakes wrote:
>>> This is getting into realms of discussion so to risk sounding rude - to be
>>> clear - NACK.
>>>
>>> The user-visible change to /proc/$pid/[s]maps kills this patch dead. This
>>> is regardless of any other discussed issue.
>> I admit this is a concern, but I don't think this is really that bad to kill
>> this patch. May this change result in userspace regression? Maybe, likely
>> happens to some debugging and monitoring scripts, typically we don't worry
>> them that much. Of course, I can't completely guarantee no regression for
>> real life applications, it should just be unlikely IMHO.
> Yeah, I don't think we can accept this unfortunately.
>
> This patch is SUPER important though even if rejected, because you've made
> me realise we really need to audit all of these mmap handlers... so it's
> all super appreciated regardless :)
:-)
>
>>> But more importantly, I hadn't realise mmap_zero() was on the .mmap()
>>> callback (sorry my mistake) - you're simply not permitted to change
>>> vm_pgoff and vm_file fields here, the mapping logic doesn't expect it, and
>>> it's broken.
>>>
>>> To me the alternative would be to have a custom fault handler that hands
>>> back the zero page, but I"m not sure that's workable, you'd have to install
>>> a special mapping etc. and huge pages are weird and...
>> TBH, I don't think we need to make fault handler more complicated, it is
>> just handled as anonymous fault handler.
>>
>> I understand your concern about changing those vma filed outside core mm. An
>> alternative is to move such change to vma.c. For example:
>>
>> diff --git a/mm/vma.c b/mm/vma.c
>> index bb2119e5a0d0..2a7ea9901f57 100644
>> --- a/mm/vma.c
>> +++ b/mm/vma.c
>> @@ -2358,6 +2358,12 @@ static int __mmap_new_vma(struct mmap_state *map,
>> struct vm_area_struct **vmap)
>> else
>> vma_set_anonymous(vma);
>>
>> + if (vma_is_anonymous(vma) && vma->vm_file) {
>> + fput(vma->vm_file);
>> + vma->vm_file = NULL;
>> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
>> + }
>> +
> OK that's more interesting. Though the user-facing thing remains...
>
> It's possiible we could detect that the underlying thing is a zero page and
> manually print out /dev/zero, but can somebody create a zero page file
> elsewhere? In which case they might find this confusing.
I'm not sure about file mapping. However reading an anonymous mapping
will instantiate zero page. It should not be marked as /dev/zero mapping.
>
> It's actually a nice idea to have this _explicitly_ covered off as we could
> then also add a comment explaining 'hey there's this weird type of VMA' and
> have it in a place where it's actually obvious to mm folk anyway.
>
> But this maps thing is just a killer. Somebody somewhere will be
> confused. And it is not for us to judge whether that's silly or not...
I just thought of named anonymous VMA may help. We can give the private
/dev/zero mapping a name, for example, just "/dev/zero". However,
"[anon:/dev/zero]" will show up in smaps/maps. We can't keep the device
numbers and inode number either, but it seems it can tell the user this
mapping comes from /dev/zero, and it also explicitly tells us it is
specially treated by kernel. Hopefully setting anon_name is permitted.
>
>> if (error)
>> goto free_iter_vma;
>>
>>
>>> I do appreciate you raising this especially as I was blissfully unaware,
>>> but I don't see how this patch can possibly work, sorry :(
>>>
>>> On Tue, Jan 14, 2025 at 08:53:01AM -0800, Yang Shi wrote:
>>>>
>>>> On 1/14/25 4:05 AM, Lorenzo Stoakes wrote:
>>>>> + Willy for the fs/weirdness elements of this.
>>>>>
>>>>> On Mon, Jan 13, 2025 at 02:30:33PM -0800, Yang Shi wrote:
>>>>>> When creating private mapping for /dev/zero, the driver makes it an
>>>>>> anonymous mapping by calling set_vma_anonymous(). But it just sets
>>>>>> vm_ops to NULL, vm_file is still valid and vm_pgoff is also file offset.
>>>>> Hm yikes.
>>>>>
>>>>>> This is a special case and the VMA doesn't look like either anonymous VMA
>>>>>> or file VMA. It confused other kernel subsystem, for example, khugepaged [1].
>>>>>>
>>>>>> It seems pointless to keep such special case. Making private /dev/zero
>>>>>> mapping a full anonymous mapping doesn't change the semantic of
>>>>>> /dev/zero either.
>>>>> My concern is that ostensibly there _is_ a file right? Are we certain that by
>>>>> not setting this we are not breaking something somewhere else?
>>>>>
>>>>> Are we not creating a sort of other type of 'non-such-beast' here?
>>>> But the file is /dev/zero. I don't see this could break the semantic of
>>>> /dev/zero. The shared mapping of /dev/zero is not affected by this change,
>>>> kernel already treated private mapping of /dev/zero as anonymous mapping,
>>>> but with some weird settings in VMA. When reading the mapping, it returns 0
>>>> with zero page, when writing the mapping, a new anonymous folio is
>>>> allocated.
>>> You're creating a new concept of an anon but not anon but also now with
>>> anon vm_pgoff and missing vm_file even though it does reference a file
>>> and... yeah.
>>>
>>> This is not usual :)
>> It does reference a file, but the file is /dev/zero... And if kernel already
>> treated it as anonymous mapping, it sounds like the file may not matter that
>> much, so why not make it as a real anonymous mapping? Then we end up having
>> anonymous VMA and file VMA only instead of anonymous VMA, file VMA and
>> hybrid special VMA. So we have less thing to worry about. If VMA is
>> anonymous VMA, it is guaranteed vm_file is NULL, vm_ops is NULL and vm_pgoff
>> is linear pgoff. But it is not true now.
> It's about user confusion for me really.
>
>>>>> I mean already setting it anon and setting vm_file non-NULL is really strange.
>>>>>
>>>>>> The user visible effect is the mapping entry shown in /proc/<PID>/smaps
>>>>>> and /proc/<PID>/maps.
>>>>>>
>>>>>> Before the change:
>>>>>> ffffb7190000-ffffb7590000 rw-p 00001000 00:06 8 /dev/zero
>>>>>>
>>>>>> After the change:
>>>>>> ffffb6130000-ffffb6530000 rw-p 00000000 00:00 0
>>>>>>
>>>>> Yeah this seems like it might break somebody to be honest, it's really
>>>>> really really strange to map a file then for it not to be mapped.
>>>> Yes, it is possible if someone really care whether the anonymous-like
>>>> mapping is mapped by /dev/zero or just created by malloc(). But I don't know
>>>> who really do...
>>>>
>>>>> But it's possibly EVEN WEIRDER to map a file and for it to seem mapped as a
>>>>> file but for it to be marked anonymous.
>>>>>
>>>>> God what a mess.
>>>>>
>>>>>> [1]: https://lore.kernel.org/linux-mm/20250111034511.2223353-1-liushixin2@huawei.com/
>>>>> I kind of hate that we have to mitigate like this for a case that should
>>>>> never ever happen so I'm inclined towards your solution but a lot more
>>>>> inclined towards us totally rethinking this.
>>>>>
>>>>> Do we _have_ to make this anonymous?? Why can't we just reference the zero
>>>>> page as if it were in the page cache (Willy - feel free to correct naive
>>>>> misapprehension here).
>>>> TBH, I don't see why page cache has to be involved. When reading, 0 is
>>>> returned by zero page. When writing a CoW is triggered if page cache is
>>>> involved, but the content of the page cache should be just 0, so we copy 0
>>>> to the new folio then write to it. It doesn't make too much sense. I think
>>>> this is why private /dev/zero mapping is treated as anonymous mapping in the
>>>> first place.
>>> I'm obviously not suggesting allocating a bunch of extra folios, I was
>>> thinking there would be some means of handing back the actual zero
>>> page. But I am not sure this is workable.
>> As I mentioned above, even handing back zero page should be not needed.
> Ack.
>
>>>>>> Signed-off-by: Yang Shi <yang@...amperecomputing.com>
>>>>>> ---
>>>>>> drivers/char/mem.c | 4 ++++
>>>>>> 1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
>>>>>> index 169eed162a7f..dae113f7fc1b 100644
>>>>>> --- a/drivers/char/mem.c
>>>>>> +++ b/drivers/char/mem.c
>>>>>> @@ -527,6 +527,10 @@ static int mmap_zero(struct file *file, struct vm_area_struct *vma)
>>>>>> if (vma->vm_flags & VM_SHARED)
>>>>>> return shmem_zero_setup(vma);
>>>>>> vma_set_anonymous(vma);
>>>>>> + fput(vma->vm_file);
>>>>>> + vma->vm_file = NULL;
>>>>>> + vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
>>> This is just not permitted. We maintain mmap state which contains the file
>>> and pgoff state which gets threaded through the mapping operation, and
>>> simply do not expect you to change these fields.
>>>
>>> In future we will assert on this or preferably, restrict users to only
>>> changing VMA flags, the private field and vm_ops.
>> Sure, hardening the VMA initialization code and making less surprising
>> corner case is definitely helpful.
> Yes and I've opened a can of worms and the worms have jumped out and on to
> my face and were not worms but in fact an alien facehugger :P
>
> In other words, I am going to be looking into this very seriously and
> auditing this whole thing... yay for making work for myself... :>)
Thank you for taking the action to kill the alien facehugger :-)
>
>>>>> Hmm, this might have been mremap()'d _potentially_ though? And then now
>>>>> this will be wrong? But then we'd have no way of tracking it correctly...
>>>> I'm not quite familiar with the subtle details and corner cases of
>>>> meremap(). But mmap_zero() should be called by mmap(), so the VMA has not
>>>> been visible to user yet at this point IIUC. How come mremap() could move
>>>> it?
>>> Ah OK, in that case fine on that front.
>>>
>>> But you are not permitted to touch this field (we need to enforce this...)
>>>
>>>>> I've not checked the function but do we mark this as a special mapping of
>>>>> some kind?
>>>>>
>>>>>> +
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> --
>>>>>> 2.47.0
>>>>>>
Powered by blists - more mailing lists