[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <57617a1a-15ab-4c08-bcdf-ff0ff9bbaf96@linux.alibaba.com>
Date: Wed, 1 Nov 2023 21:00:26 +0800
From: Rongwei Wang <rongwei.wang@...ux.alibaba.com>
To: Khalid Aziz <khalid.aziz@...cle.com>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-arch@...r.kernel.org" <linux-arch@...r.kernel.org>,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Peter Xu <peterx@...hat.com>,
Mark Hemment <markhemm@...glemail.com>
Subject: Re: Sharing page tables across processes (mshare)
On 2023/11/1 07:01, Khalid Aziz wrote:
> On 10/29/23 20:45, Rongwei Wang wrote:
>>
>>
>> On 2023/10/24 06:44, Khalid Aziz wrote:
>>> Threads of a process share address space and page tables that allows
>>> for
>>> two key advantages:
>>>
>>> 1. Amount of memory required for PTEs to map physical pages stays low
>>> even when large number of threads share the same pages since PTEs are
>>> shared across threads.
>>>
>>> 2. Page protection attributes are shared across threads and a change
>>> of attributes applies immediately to every thread without any overhead
>>> of coordinating protection bit changes across threads.
>>>
>>> These advantages no longer apply when unrelated processes share pages.
>>> Some applications can require 1000s of processes that all access the
>>> same set of data on shared pages. For instance, a database server may
>>> map in a large chunk of database into memory to provide fast access to
>>> data to the clients using buffer cache. Server may launch new processes
>>> to provide services to new clients connecting to the shared database.
>>> Each new process will map in the shared database pages. When the PTEs
>>> for mapping in shared pages are not shared across processes, each
>>> process will consume some memory to store these PTEs. On x86_64, each
>>> page requires a PTE that is only 8 bytes long which is very small
>>> compared to the 4K page size. When 2000 processes map the same page in
>>> their address space, each one of them requires 8 bytes for its PTE and
>>> together that adds up to 8K of memory just to hold the PTEs for one 4K
>>> page. On a database server with 300GB SGA, a system crash was seen with
>>> out-of-memory condition when 1500+ clients tried to share this SGA even
>>> though the system had 512GB of memory. On this server, in the worst
>>> case
>>> scenario of all 1500 processes mapping every page from SGA would have
>>> required 878GB+ for just the PTEs. If these PTEs could be shared,
>>> amount
>>> of memory saved is very significant.
>>>
>>> When PTEs are not shared between processes, each process ends up with
>>> its own set of protection bits for each shared page. Database servers
>>> often need to change protection bits for pages as they manipulate and
>>> update data in the database. When changing page protection for a shared
>>> page, all PTEs across all processes that have mapped the shared page in
>>> need to be updated to ensure data integrity. To accomplish this, the
>>> process making the initial change to protection bits sends messages to
>>> every process sharing that page. All processes then block any access to
>>> that page, make the appropriate change to protection bits, and send a
>>> confirmation back. To ensure data consistency, access to shared page
>>> can be resumed when all processes have acknowledged the change. This is
>>> a disruptive and expensive coordination process. If PTEs were shared
>>> across processes, a change to page protection for a shared PTE becomes
>>> applicable to all processes instantly with no coordination required to
>>> ensure consistency. Changing protection bits across all processes
>>> sharing database pages is a common enough operation on Oracle databases
>>> that the cost is significant and cost goes up with the number of
>>> clients.
>>>
>>> This is a proposal to extend the same model of page table sharing for
>>> threads across processes. This will allow processes to tap into the
>>> same benefits that threads get from shared page tables,
>>>
>>> Sharing page tables across processes opens their address spaces to each
>>> other and thus must be done carefully. This proposal suggests sharing
>>> PTEs across processes that trust each other and have explicitly agreed
>>> to share page tables. The proposal is to add a new flag to mmap()
>>> call -
>>> MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a
>>> process to hint to kernel that it wishes to share page table entries
>>> for this file mapping mmap region with other processes. Any other
>>> process
>>> that mmaps the same file with MAP_SHARED_PT flag can then share the
>>> same
>>> page table entries. Besides specifying MAP_SHARED_PT flag, the processe
>>> must map the files at a PMD aligned address with a size that is a
>>> multiple of PMD size and at the same virtual addresses. NOTE: This
>>> last requirement of same virtual addresses can possibly be relaxed if
>>> that is the consensus.
>>>
>>> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
>>> is created to hold the shared page tables. Host mm struct is not
>>> attached to a process. Start and size of host mm are set to the
>>> start and size of the mmap region and a VMA covering this range is
>>> also added to host mm struct. Existing page table entries from the
>>> process that creates the mapping are copied over to the host mm
>>> struct. All processes mapping this shared region are considered
>>> guest processes. When a guest process mmap's the shared region, a vm
>>> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
>>> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
>>> flag is found, its corresponding PMD is updated with the PMD from
>>> host mm struct so the PMD will point to the page tables in host mm
>>> struct. When a new PTE is created, it is created in the host mm struct
>>> page tables and the PMD in guest mm points to the same PTEs.
>>>
>>>
>>> --------------------------
>>> Evolution of this proposal
>>> --------------------------
>>>
>>> The original proposal -
>>> <https://lore.kernel.org/lkml/cover.1642526745.git.khalid.aziz@oracle.com/>,
>>>
>>> was for an mshare() system call that a donor process calls to create
>>> an empty mshare'd region. This shared region is pgdir aligned and
>>> multiple of pgdir size. Each mshare'd region creates a corresponding
>>> file under /sys/fs/mshare which can be read to get information on
>>> the region. Once an empty region has been created, any objects can
>>> be mapped into this region and page tables for those objects will be
>>> shared. Snippet of the code that a donor process would run looks
>>> like below:
>>>
>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>> if (addr == MAP_FAILED)
>>> perror("ERROR: mmap failed");
>>>
>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> strncpy(addr, "Some random shared text",
>>> sizeof("Some random shared text"));
>>>
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> fd = open("testregion", O_RDONLY);
>>> if (fd < 0) {
>>> perror("open failed");
>>> exit(1);
>>> }
>>>
>>> if ((count = read(fd, &mshare_info, sizeof(mshare_info)) > 0))
>>> printf("INFO: %ld bytes shared at addr %lx \n",
>>> mshare_info[1], mshare_info[0]);
>>> else
>>> perror("read failed");
>>>
>>> close(fd);
>>>
>>> addr = (char *)mshare_info[0];
>>> err = syscall(MSHARE_SYSCALL, "testregion", (void
>>> *)mshare_info[0],
>>> mshare_info[1], O_RDWR, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> printf("Guest mmap at %px:\n", addr);
>>> printf("%s\n", addr);
>>> printf("\nDone\n");
>>>
>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>> if (err < 0) {
>>> perror("mshare_unlink() failed");
>>> exit(1);
>>> }
>>>
>>>
>>> This proposal evolved into completely file and mmap based API -
>>> <https://lore.kernel.org/lkml/cover.1656531090.git.khalid.aziz@oracle.com/>.
>>>
>>> This new API looks like below:
>>>
>>> 1. Mount msharefs on /sys/fs/mshare -
>>> mount -t msharefs msharefs /sys/fs/mshare
>>>
>>> 2. mshare regions have alignment and size requirements. Start
>>> address for the region must be aligned to an address boundary and
>>> be a multiple of fixed size. This alignment and size requirement
>>> can be obtained by reading the file /sys/fs/mshare/mshare_info
>>> which returns a number in text format. mshare regions must be
>>> aligned to this boundary and be a multiple of this size.
>>>
>>> 3. For the process creating mshare region:
>>> a. Create a file on /sys/fs/mshare, for example -
>>> fd = open("/sys/fs/mshare/shareme",
>>> O_RDWR|O_CREAT|O_EXCL, 0600);
>>>
>>> b. mmap this file to establish starting address and size -
>>> mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>>> MAP_SHARED, fd, 0);
>>>
>>> c. Write and read to mshared region normally.
>>>
>>> 4. For processes attaching to mshare'd region:
>>> a. Open the file on msharefs, for example -
>>> fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>>
>>> b. Get information about mshare'd region from the file:
>>> struct mshare_info {
>>> unsigned long start;
>>> unsigned long size;
>>> } m_info;
>>>
>>> read(fd, &m_info, sizeof(m_info));
>>>
>>> c. mmap the mshare'd region -
>>> mmap(m_info.start, m_info.size,
>>> PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>>
>>> 5. To delete the mshare region -
>>> unlink("/sys/fs/mshare/shareme");
>>>
>>>
>>>
>>> Further discussions over mailing lists and LSF/MM resulted in
>>> eliminating
>>> msharefs and making this entirely mmap based -
>>> <https://lore.kernel.org/lkml/cover.1682453344.git.khalid.aziz@oracle.com/>.
>>>
>>> With this change, if two processes map the same file with same
>>> size, PMD aligned address, same virtual address and both specify
>>> MAP_SHARED_PT flag, they start sharing PTEs for the file mapping.
>>> These changes eliminate support for any arbitrary objects being
>>> mapped in mshare'd region. The last implementation required sharing
>>> minimum PMD sized chunks across processes. These changes were
>>> significant enough to make this proposal distinct enough for me to
>>> use a new name - ptshare.
>>>
>>>
>>> ----------
>>> What next?
>>> ----------
>>>
>>> There were some more discussions on this proposal while I was on
>>> leave for a few months. There is enough interest in this feature to
>>> continue to refine this. I will refine the code further but before
>>> that I want to make sure we have a common understanding of what this
>>> feature should do.
>>>
>>> As a result of many discussions, a new distinct version of
>>> original proposal has evolved. Which one do we agree to continue
>>> forward with - (1) current version which restricts sharing to PMD sized
>>> and aligned file mappings only, using just a new mmap flag
>>> (MAP_SHARED_PT), or (2) original version that creates an empty page
>>> table shared mshare region using msharefs and mmap for arbitrary
>>> objects to be mapped into later?
>> Hi, Khalid
>>
>> I am unfamiliar to original version, but I can provide some feedback
>> on the issues encountered
>> during the implementation of current version (mmap & MAP_SHARED_PT).
>> We realize our internal pgtable sharing version in the current
>> method, but the codes
>> are a bit hack in some places, e.g. (1) page fault, need to switch
>> original mm to flush TLB or
>> charge memcg; (2) shrink memory, a bit complicated to to handle pte
>> entries like normal pte mapping;
>> (3) munmap/madvise support;
>>
>> If these hack codes can be resolved, the current method seems already
>> simple and usable enough (just my humble opinion).
> Thanks for taking the time to review. Yes, the code could use some
> improvement and I expect to do that as I get feedback. Can I ask you
> what you mean by "internal pgtable sharing version"? Are you using the
> patch I had sent out or a modified version of it on internal test
> machines?
Yes, a modified version with functions mentioned in the previous mail
based on your mmap(MAP_SHARED_PT) patchset. That realized in kernel-5.10.
And if everyone thinks it's helpful for this discussion, I can send it
out next.
>
> Thanks,
> Khalid
>
>>
>>
>> And besides above issues, we (our internal version) do not care
>> memory migration, compaction, etc,. I'm not sure what
>> functions pgtable sharing needs to support. Maybe we can have a
>> discussion about that firstly, then decide
>> which one? Here are the things we support in pgtable sharing:
>>
>> a. share pgtables only between parent and child processes; > b.
>> support anonymous shared memory and id-known (SYSV shared memory);
>> c. madvise(MADV_DONTNEED, MADV_DONTDUMP, MADV_DODUMP), DONTNEED
>> supports 2M granularity;
>> d. reclaim pgtable sharing memory in shrinker;
>>
>> The above support is actually requested by our internal user. Plus,
>> we skip memory migration, compaction, mprotect, mremap etc, directly.
>> IMHO, support all memory behavior likes normal pte mapping is
>> unnecessary?
>> (Next, It seems I need to study your original version :-))
>>
>> Thanks,
>> -wrw
>>>
>>> Thanks,
>>> Khalid
>>
Powered by blists - more mailing lists