linux-kernel - Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <df71c9e6-52e2-4afa-b0bd-42f5aadbad71@amd.com>
Date: Thu, 22 May 2025 13:03:35 +0530
From: Bharata B Rao <bharata@....com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 Jonathan.Cameron@...wei.com, dave.hansen@...el.com, hannes@...xchg.org,
 mgorman@...hsingularity.net, mingo@...hat.com, peterz@...radead.org,
 raghavendra.kt@....com, riel@...riel.com, rientjes@...gle.com,
 sj@...nel.org, weixugc@...gle.com, willy@...radead.org,
 ying.huang@...ux.alibaba.com, ziy@...dia.com, dave@...olabs.net,
 nifan.cxl@...il.com, joshua.hahnjy@...il.com, xuezhengchu@...wei.com,
 yiannis@...corp.com, akpm@...ux-foundation.org, david@...hat.com
Subject: Re: [RFC PATCH v0 2/2] mm: sched: Batch-migrate misplaced pages

On 22-May-25 9:25 AM, Gregory Price wrote:
> On Wed, May 21, 2025 at 01:32:38PM +0530, Bharata B Rao wrote:
>>   
>> +static void task_check_pending_migrations(struct task_struct *curr)
>> +{
>> +	struct callback_head *work = &curr->numa_mig_work;
>> +
>> +	if (work->next != work)
>> +		return;
>> +
>> +	if (time_after(jiffies, curr->numa_mig_interval) ||
>> +	    (curr->migrate_count > NUMAB_BATCH_MIGRATION_THRESHOLD)) {
>> +		curr->numa_mig_interval = jiffies + HZ;
>> +		task_work_add(curr, work, TWA_RESUME);
>> +	}
>> +}
>> +
>>   /*
>>    * Drive the periodic memory faults..
>>    */
>> @@ -3610,6 +3672,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>>   	if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>>   		return;
>>   
>> +	task_check_pending_migrations(curr);
>> +
> 
> So I know this was discussed in the cover leter a bit and alluded to in
> the patch, but I want to add my 2cents from work on the unmapped page
> cache set.
> 
> In that set, I chose to always schedule the task work on the next return
> to user-space, rather than defer to a tick like the current numa-balance
> code.  This was for two concerns:
> 
> 1) I didn't want to leave a potentially large number of isolated folios
>     on a list that may not be reaped for an unknown period of time.
> 
>     I don't know the real limitations on the number of isolated folios,
>     but given what we have here I think we can represent a mathematical
>     worst case on the nubmer of stranded folios.
> 
>     If (N=1,000,000, and M=511) then we could have ~1.8TB of pages
>     stranded on these lists - never to be migrated because it never hits
>     the threshhold.  In practice this won't happen to that extreme, but
>     in practice it absolutely will happen for some chunk of tasks.

In addition to the threshold, I have a time limit too and hence at the 
end of that period, the isolated folios do get migrated even if the 
threshold isn't hit.

The other thing I haven't taken care yet is to put back the isolated 
folios if the task exits with pending isolated folios.

> 
>     So I chose to never leave kernel space with isolated folios on the
>     task numa_mig_list.
> 
>     This discussion changes if the numa_mig_list is not on the
>     task_struct and instead some per-cpu list routinely reaped by a
>     kthread (kpromoted or whatever).
>   
> 
> 2) I was not confident I could measure the performance implications of
>     the migrations directly when it was deferred.  When would I even know
>     it happened?  The actual goal is to *not* know it happened, right?
> 
>     But now it might happen during a page fault, or any random syscall.
> 
>     This concerned me - so i just didn't defer.  That was largely out of
>     lack of confidence in my own understanding of the task_work system.
> 
> 
> So i think this, as presented, is a half-measure - and I don't think
> it's a good half-measure.  I think we might need to go all the way to a
> set of per-cpu migration lists that a kernel work can pluck the head of
> on some interval.  That would bound the number of isolated folios to the
> number of CPUs rather than the number of tasks.

Why per-cpu and not per-node? All folios that are targeted for a node 
can be in that node's list.

I think if we are leaving the migration to be done by the migrator 
thread later, then isolating them beforehand may not be ideal. In such 
cases tracking the hot pages via PFNs like I did in kpromoted may be better.

Even when we have per-node migrator threads that would handle migration 
requests from multiple hot page sources (or single unified layer), I 
still think that there should be a "migrate now" kind of interface 
(which is essentially what your migrate_misplaced_folio_batch() is). 
That will be more suitable for handling migration request originating 
from locality-based NUMA balancing (NUMAB=1 case).

Regards,
Bharata.