[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8bd15f25-a468-e495-25f2-a8657b308bbe@oracle.com>
Date: Wed, 29 Jun 2022 11:48:17 -0600
From: Khalid Aziz <khalid.aziz@...cle.com>
To: David Hildenbrand <david@...hat.com>,
Barry Song <21cnbao@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Matthew Wilcox <willy@...radead.org>,
Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
Arnd Bergmann <arnd@...db.de>,
Jonathan Corbet <corbet@....net>,
Dave Hansen <dave.hansen@...ux.intel.com>,
ebiederm@...ssion.com, hagen@...u.net, jack@...e.cz,
Kees Cook <keescook@...omium.org>, kirill@...temov.name,
kucharsk@...il.com, linkinjeon@...nel.org,
linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>, longpeng2@...wei.com,
Andy Lutomirski <luto@...nel.org>, markhemm@...glemail.com,
pcc@...gle.com, Mike Rapoport <rppt@...nel.org>,
sieberf@...zon.com, sjpark@...zon.de,
Suren Baghdasaryan <surenb@...gle.com>, tst@...oebel-theuer.de,
Iurii Zaikin <yzaikin@...gle.com>
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes
On 5/30/22 05:18, David Hildenbrand wrote:
> On 30.05.22 12:48, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@...cle.com> wrote:
>>>
>>> Page tables in kernel consume some of the memory and as long as number
>>> of mappings being maintained is small enough, this space consumed by
>>> page tables is not objectionable. When very few memory pages are
>>> shared between processes, the number of page table entries (PTEs) to
>>> maintain is mostly constrained by the number of pages of memory on the
>>> system. As the number of shared pages and the number of times pages
>>> are shared goes up, amount of memory consumed by page tables starts to
>>> become significant.
>>>
>>> Some of the field deployments commonly see memory pages shared across
>>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>>> bytes long which is very small compared to the 4K page size. When 2000
>>> processes map the same page in their address space, each one of them
>>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>>> just to hold the PTEs for one 4K page. On a database server with 300GB
>>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>>> clients tried to share this SGA even though the system had 512GB of
>>> memory. On this server, in the worst case scenario of all 1500
>>> processes mapping every page from SGA would have required 878GB+ for
>>> just the PTEs. If these PTEs could be shared, amount of memory saved
>>> is very significant.
>>>
>>> This patch series implements a mechanism in kernel to allow userspace
>>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>>> mshare(), which can be used by a process to create a region (we will
>>> call it mshare'd region) which can be used by other processes to map
>>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>>> detach from the mshare'd region. Once an mshare'd region is created,
>>> other process(es), assuming they have the right permissions, can make
>>> the mashare() system call to map the shared pages into their address
>>> space using the shared PTEs. When a process is done using this
>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>> access. When the last process accessing mshare'd region calls
>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>> it is freed.
>>>
>>>
>>> API
>>> ===
>>>
>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>
>>> --
>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>
>>> mshare() creates and opens a new, or opens an existing mshare'd
>>> region that will be shared at PTE level. "name" refers to shared object
>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>> of this shared memory area and length is the size of this area.
>>> oflags can be one of:
>>>
>>> - O_RDONLY opens shared memory area for read only access by everyone
>>> - O_RDWR opens shared memory area for read and write access
>>> - O_CREAT creates the named shared memory area if it does not exist
>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>> exists with that name, return an error.
>>>
>>> mode represents the creation mode for the shared object under
>>> /sys/fs/mshare.
>>>
>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> PTEs are shared at pgdir level and hence it imposes following
>>> requirements on the address and size given to the mshare():
>>>
>>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>> This alignment value can be looked up in /proc/sys/vm//mshare_size
>>> - Size must be a multiple of pgdir size
>>> - Any mappings created in this address range at any time become
>>> shared automatically
>>> - Shared address range can have unmapped addresses in it. Any access
>>> to unmapped address will result in SIGBUS
>>>
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in
>>> the shared area. All other processes that want mshare'd access to
>>> this memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>> this file returns two long values - (1) starting address, and (2)
>>> size of the mshare'd region.
>>>
>>> --
>>> int mshare_unlink(char *name)
>>>
>>> A shared address range created by mshare() can be destroyed using
>>> mshare_unlink() which removes the shared named object. Once all
>>> processes have unmapped the shared object, the shared address range
>>> references are de-allocated and destroyed.
>>>
>>> mshare_unlink() returns 0 on success or -1 on error.
>>>
>>>
>>> Example Code
>>> ============
>>>
>>> Snippet of the code that a donor process would run looks like below:
>>>
>>> -----------------
>>> addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>> MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>> if (addr == MAP_FAILED)
>>> perror("ERROR: mmap failed");
>>>
>>> err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>> GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> strncpy(addr, "Some random shared text",
>>> sizeof("Some random shared text"));
>>> -----------------
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> -----------------
>>> struct mshare_info minfo;
>>>
>>> fd = open("testregion", O_RDONLY);
>>> if (fd < 0) {
>>> perror("open failed");
>>> exit(1);
>>> }
>>>
>>> if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>> printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>> minfo.size, minfo.start);
>>> else
>>> perror("read failed");
>>>
>>> close(fd);
>>>
>>> addr = (void *)minfo.start;
>>> err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>> O_RDWR, 600);
>>> if (err < 0) {
>>> perror("mshare() syscall failed");
>>> exit(1);
>>> }
>>>
>>> printf("Guest mmap at %px:\n", addr);
>>> printf("%s\n", addr);
>>> printf("\nDone\n");
>>>
>>> err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>> if (err < 0) {
>>> perror("mshare_unlink() failed");
>>> exit(1);
>>> }
>>> -----------------
>>
>>
>> Does that mean those shared pages will get page_mapcount()=1 ?
>
> AFAIU, for mshare() that is the case.
>
>>
>> A big pain for a memory limited system like a desktop/embedded system is
>> that reverse mapping will take tons of cpu in memory reclamation path
>> especially for those pages mapped by multiple processes. sometimes,
>> 100% cpu utilization on LRU to scan and find out if a page is accessed
>> by reading PTE young.
>
> Regarding PTE-table sharing:
>
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
>
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
>
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.
Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a
mapcount of 1 sounds reasonable to me.
>
>>
>> if we result in one PTE only by this patchset, it means we are getting
>> significant
>> performance improvement in kernel LRU particularly when free memory
>> approaches the watermarks.
>>
>> But I don't see how a new system call like mshare(), can be used
>> by those systems as they might need some more automatic PTEs sharing
>> mechanism.
>
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
>
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.
It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous
to make that assumption automatically without processes explicitly declaring that level of trust.
Thanks,
Khalid
Powered by blists - more mailing lists