lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250724084441.380404-1-link@vivo.com>
Date: Thu, 24 Jul 2025 16:44:28 +0800
From: Huan Yang <link@...o.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
	David Hildenbrand <david@...hat.com>,
	Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
	Rik van Riel <riel@...riel.com>,
	"Liam R. Howlett" <Liam.Howlett@...cle.com>,
	Vlastimil Babka <vbabka@...e.cz>,
	Harry Yoo <harry.yoo@...cle.com>,
	Xu Xin <xu.xin16@....com.cn>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Mike Rapoport <rppt@...nel.org>,
	Suren Baghdasaryan <surenb@...gle.com>,
	Michal Hocko <mhocko@...e.com>,
	Zi Yan <ziy@...dia.com>,
	Matthew Brost <matthew.brost@...el.com>,
	Joshua Hahn <joshua.hahnjy@...il.com>,
	Rakie Kim <rakie.kim@...com>,
	Byungchul Park <byungchul@...com>,
	Gregory Price <gourry@...rry.net>,
	Ying Huang <ying.huang@...ux.alibaba.com>,
	Alistair Popple <apopple@...dia.com>,
	"Matthew Wilcox (Oracle)" <willy@...radead.org>,
	Huan Yang <link@...o.com>,
	Christian Brauner <brauner@...nel.org>,
	Usama Arif <usamaarif642@...il.com>,
	Yu Zhao <yuzhao@...gle.com>,
	Baolin Wang <baolin.wang@...ux.alibaba.com>,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: [RFC PATCH 0/9] introduce PGTY_mgt_entry page_type

Summary
==
This patchset reuses page_type to store migrate entry count during the
period from migrate entry setup to removal, enabling accelerated VMA
traversal when removing migrate entries, following a similar principle to
early termination when folio is unmapped in try_to_migrate.

In my self-constructed test scenario, the migration time can be reduced
from over 150+ms to around 30+ms, achieving nearly a 70% performance
improvement. Additionally, the flame graph shows that the proportion of
remove_migration_ptes can be reduced from 80%+ to 60%+.

Notice: migrate entry specifically refers to migrate PTE entry, as large
folio are not supported page type and 0 mapcount reuse.

Principle
==
When a page removes all PTEs in try_to_migrate and sets up a migrate PTE
entry, we can determine whether the traversal of remaining VMAs can be
terminated early by checking if mapcount is zero. This optimization
helps improve performance during migration.

However, when removing migrate PTE entries and setting up PTEs for the
destination folio in remove_migration_ptes, there is no such information
available to assist in deciding whether the traversal of remaining VMAs
can be ended early. Therefore, it is necessary to traversal all VMAs
associated with this folio.

In reality, when a folio is fully unmapped and before all migrate PTE
entries are removed, the mapcount will always be zero. Since page_type
and mapcount share a union, and referring to folio_mapcount, we can
reuse page_type to record the number of migrate PTE entries of the
current folio in the system as long as it's not a large folio. This
reuse does not affect calls to folio_mapcount, which will always return
zero.

Therefore, we can set the folio's page_type to PGTY_mgt_entry when
try_to_migrate completes, the folio is already unmapped, and it's not a
large folio. The remaining 24 bits can then be used to record the number
of migrate PTE entries generated by try_to_migrate.

Then, in remove_migration_ptes, when the nr_mgt_entry count drops to
zero, we can terminate the VMA traversal early.

It's important to note that we need to initialize the folio's page_type
to PGTY_mgt_entry and set the migrate entry count only while holding the
rmap walk lock.This is because during the lock period, we can prevent
new VMA fork (which would increase migrate entries) and VMA unmap
(which would decrease migrate entries).

However, I doubt there is actually an additional critical section here, for
example anon:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree tail
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter                            pte_lock
                ....                           pte_present copy
                                            ...
                pte_lock
                    new forked pte clean
....
remove_migration_ptes
    rmap_walk_anon_lock

If my understanding is correct and such a critical section exists, it
shouldn't cause any issues—newly added PTEs can still be properly
removed and converted into migrate entries.

But in this:

Process Parent                          fork
try_to_migrate
                                        anon_vma_clone
                                            write_lock
                                                avc_inster_tree
                                        ....
    folio_lock_anon_vma_read             copy_pte_range
        vma_iter
                pte_lock
                    migrate entry set
                ....                        pte_lock
                                                pte_nonpresent copy
                                            ....
....
remove_migration_ptes
    rmap_walk_anon_lock

If the parent process first acquires the pte_lock to set a migrate
entry, the child process will then directly copy the non-present migrate
entry, resulting in an increase in migrate entries. However, since the
newly added VMA is positioned later in the rb tree of the folio's
anon_vma, when we traverse to this child-process-added migrate entry,
the count of migrate entries will still be correctly recorded, and this
will not cause any issues.

If I misunderstand, please correct me. :)

After a folio exits try_to_migrate and before remove_migration_ptes
acquires the rmap lock, the system can perform normal fork and unmap
operations. Therefore, we need to increment or decrement the migrate
entry count recorded in the folio (if it's of type PGTY_mgt_entry) when
handling copy/zap_nonpresent_pte.

When performing remove_migration_ptes during migration to start removing
migrate entries, we need to dynamically decrement the recorded migrate
entry count. Once this count reaches zero, it indicates there are no
remaining migrate entries in the associated VMAs that need to be cleared
and replaced with the destination PFN. This allows us to safely
terminate the VMA traversal early.

However, it's important to note that if issues occur during migration
requiring an undo operation, PGTY_mgt_entry can no longer be used. This
is because the dst needs to be set back to the src, and the presence of
PGTY_mgt_entry would interfere with the normal usage of mapcount when
setup rmap info.

Test
==
I set up a 2-node test environment using QEMU, and used mbind to trigger
page migration between nodes for the specified VMA.

The core idea of the test scenario is to create a situation where the
number of VMAs that need to be itered in the anon_vma is significantly
larger than the folio's mapcount.

To achieve this, I constructed an exaggerated scenario: the parent
process allocates 5MB of memory and binds it to node0, then immediately
forks 1000 child processes. Each child process runs and immediately
memset all this memory to complete COW-ed. Afterwards, the parent process
calls mbind to migrate the memory from node0 to node1, while recording
the time consumed during this period.
Additionally, perf is used to capture a flame graph during the mbind
execution.

The time cost results are as follows:
    Patch1-9               Normal(f817b6d)
      18ms                    197ms
      58ms                    152ms
      40ms                    120ms

The hot path show in fireflame:
    Patch1-9
      move_to_new_folio        38.89%
      remove_migration_ptes    61.11%
      ---------------------
      move_to_new_folio        32.76%
      remove_migration_ptes    67.24%
      ---------------------
      move_to_new_folio        37.50%
      remove_migration_ptes    62.50%

    Normal(f817b6d)
      move_to_new_folio        11.43%
      remove_migration_ptes    87.43%
      ---------------------
      move_to_new_folio        13.91%
      remove_migration_ptes    86.09%
      ---------------------
      move_to_new_folio        12.50%
      remove_migration_ptes    85.83%

Can easy see that cost time optimized by approximately 75.3%.
And the proportion of the remove_migration_ptes function path
has decreased by approximately 20%.

Simplify Test Code:

```c
#define size (5 << 20)
#define CHILD_COUNT 1000

int *buffer = (int *)mmap(NULL, size, PROT_READ | PROT_WRITE,
                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

unsigned long mask = 1UL << 0;
mbind(buffer, size, MPOL_BIND, &mask, 2, 0);
// let all page-faulted in node 0
memset(buffer, 0, size);

// fork child.
pid_t children[CHILD_COUNT];
for (int i = 0; i < CHILD_COUNT; i++) {
    pid_t pid = fork();
    if (pid == 0) {
        // let all child process COW-ed
        memset(buffer, 0, size);
        sleep(100000);
    } else {
        children[i] = pid;
    }
}

// maybe you need sleep to wait child process COW-ed
sleep(10);


// You can use perf watch here
mask = 1UL << 1;
// migrate this buffer from node 0 -> node 1
mbind(buffer, size, MPOL_BIND, &mask, 4, MPOL_MF_MOVE);

```
Notice: this code removed many error assert and resource clean
action, time record ...

Why RFC
==
Memory migration is one of the most general-purpose modules.
My own tests cannot cover all system scenarios, and there
may be omissions or misunderstandings in the code modifications.

If good enough, I will send the formal patch.

Patch 1-7 do some code clean work.
Patch 8 prepare for PGTY_mgt_entry.
Patch 9 apply it.

Huan Yang (9):
  mm: introduce PAGE_TYPE_SHIFT
  mm: add page_type value helper
  mm/rmap: simplify rmap_walk invoke
  mm/rmap: add args in rmap_walk_control done hook
  mm/rmap: introduce exit hook
  mm/rmap: introduce migrate_walk_arg
  mm/migrate: rename rmap_walk_arg folio
  mm/migrate: infrastructure for migrate entry page_type.
  mm/migrate: apply migrate entry page_type

 include/linux/page-flags.h | 106 +++++++++++++++++++++++++++++++++++--
 include/linux/rmap.h       |   7 ++-
 mm/ksm.c                   |   2 +-
 mm/memory.c                |   2 +
 mm/migrate.c               |  38 ++++++++++---
 mm/rmap.c                  |  85 ++++++++++++++++++-----------
 6 files changed, 193 insertions(+), 47 deletions(-)

--
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ