lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6281ffc9-398e-44b9-a95c-2527004e09b7@bytedance.com>
Date: Thu, 27 Feb 2025 14:58:50 +0800
From: Qi Zheng <zhengqi.arch@...edance.com>
To: Johannes Weiner <hannes@...xchg.org>, peterz@...radead.org
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: CONFIG_PT_RECLAIM

Hi Johannes,

On 2/27/25 2:08 PM, Johannes Weiner wrote:
> On Thu, Feb 27, 2025 at 11:04:51AM +0800, Qi Zheng wrote:
>> Hi Johannes,
>>
>> On 2/27/25 2:30 AM, Johannes Weiner wrote:
>>> Does PT_RECLAIM need to be configurable by the user?
>>
>> The PT_RECLAIM will select MMU_GATHER_RCU_TABLE_FREE, but not all archs
>> support MMU_GATHER_RCU_TABLE_FREE, and even before Rik's a37259732a7dc
>> ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE unconditional"), x86 only
>> supports MMU_GATHER_RCU_TABLE_FREE in the case of PARAVIRT.
>>
>> Therefore, PT_RECLAIM also implies the meaning of enabling
>> MMU_GATHER_RCU_TABLE_FREE, so I made it user-configurable. And I just
>> thought that as a new feature, it would be better to give users the
>> ability to turn it on and off.
> 
> New *features*, yes - something that has a significant enough cost
> that clearly not all users want to pay for the benefits.

Got it.

> 
> But it's hard to imagine anybody would WANT to keep the page tables
> around if they madvised away all the pages inside of them. It's a
> great optimization, what would be a reason to opt out?

OK, now I think it makes sense to change it to 'def_bool y'.

> 
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 2761098dbc1a..99383c93db33 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1309,16 +1309,9 @@ config ARCH_SUPPORTS_PT_RECLAIM
>>>    	def_bool n
>>>    
>>>    config PT_RECLAIM
>>> -	bool "reclaim empty user page table pages"
>>> -	default y
>>> +	def_bool y
>>>    	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>>    	select MMU_GATHER_RCU_TABLE_FREE
>>> -	help
>>> -	  Try to reclaim empty user page table pages in paths other than munmap
>>> -	  and exit_mmap path.
>>> -
>>> -	  Note: now only empty user PTE page table pages will be reclaimed.
>>> -
>>
>> Maybe keep the help information?
> 
> I don't find it very helpful :( Which "other paths?" It doesn't
> explain any pros and cons, and why anybody might choose to enable or
> disable it. The Note repeats what's in the sentence before it.

Sorry about that. :(

> 
> Maybe I'm missing something. Could this not just be an #ifdef block
> inside mm/madvise.c, instead of living inside a new file with two new
> config symbols?
> 
> #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> ...
> #endif
> 
> Is there an arch-specific feature that it requires besides
> MMU_GATHER_RCU_TABLE_FREE such that only x86 supports it now?

No, it only needs MMU_GATHER_RCU_TABLE_FREE.

> 
> And why *does* it require MMU_GATHER_RCU_TABLE_FREE?

Because in the madvise(MADV_DONTNEED) path, mmu_gather has been used
to batch flush tlb and free physical pages. It is a better choice to
free PTE pages in this ways as well.

And because PT_RECLAIM needs rcu, we need MMU_GATHER_RCU_TABLE_FREE to
make pte_free_tlb() free PTE pages through rcu. Of course, we also need
to modify __tlb_remove_table_one() to make it use rcu as well.

> 
> Documentation/mm/process_addrs.rst explains why you need rcu, but
> there is free_pte_defer() that THP was using long before x86 needed
> MMU_GATHER_RCU_TABLE_FREE. It seems to me if you could use that, this
> feature would also work fine on architectures that do not generally
> need RCU for flush & frees otherwise. So is the main issue that there

As mentioned above, we want to flush & frees in batches, so we don't
use pte_free_defer().

> just isn't an explicitly deferred variant of pte_free_tlb()?

The pte_free_defer() seems to have been adapted to all archs, so I
wonder if all archs can support MMU_GATHER_RCU_TABLE_FREE, so that
pte_free_tlb() will always use rcu to free PTE pages.

Maybe I missed something.

+Peter.

> 
> If so, this is a fairly non-obvious dependency that should be
> documented. It would help somebody trying to port this to a !RCU
> mmu_gather arch.
> 
> And I apologize if all this was discussed before. But if it was, the
> conclusions should be in the changelog or in code comments. This is a
> very delicate synchronization scheme that I think deserves explicit
> documentation somewhere.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ