linux-kernel - Re: [PATCH v1 00/14] Add support for shared PTEs across processes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8bd15f25-a468-e495-25f2-a8657b308bbe@oracle.com>
Date:   Wed, 29 Jun 2022 11:48:17 -0600
From:   Khalid Aziz <khalid.aziz@...cle.com>
To:     David Hildenbrand <david@...hat.com>,
        Barry Song <21cnbao@...il.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Matthew Wilcox <willy@...radead.org>,
        Aneesh Kumar <aneesh.kumar@...ux.ibm.com>,
        Arnd Bergmann <arnd@...db.de>,
        Jonathan Corbet <corbet@....net>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        ebiederm@...ssion.com, hagen@...u.net, jack@...e.cz,
        Kees Cook <keescook@...omium.org>, kirill@...temov.name,
        kucharsk@...il.com, linkinjeon@...nel.org,
        linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        Linux-MM <linux-mm@...ck.org>, longpeng2@...wei.com,
        Andy Lutomirski <luto@...nel.org>, markhemm@...glemail.com,
        pcc@...gle.com, Mike Rapoport <rppt@...nel.org>,
        sieberf@...zon.com, sjpark@...zon.de,
        Suren Baghdasaryan <surenb@...gle.com>, tst@...oebel-theuer.de,
        Iurii Zaikin <yzaikin@...gle.com>
Subject: Re: [PATCH v1 00/14] Add support for shared PTEs across processes

On 5/30/22 05:18, David Hildenbrand wrote:
> On 30.05.22 12:48, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@...cle.com> wrote:
>>>
>>> Page tables in kernel consume some of the memory and as long as number
>>> of mappings being maintained is small enough, this space consumed by
>>> page tables is not objectionable. When very few memory pages are
>>> shared between processes, the number of page table entries (PTEs) to
>>> maintain is mostly constrained by the number of pages of memory on the
>>> system. As the number of shared pages and the number of times pages
>>> are shared goes up, amount of memory consumed by page tables starts to
>>> become significant.
>>>
>>> Some of the field deployments commonly see memory pages shared across
>>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>>> bytes long which is very small compared to the 4K page size. When 2000
>>> processes map the same page in their address space, each one of them
>>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>>> just to hold the PTEs for one 4K page. On a database server with 300GB
>>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>>> clients tried to share this SGA even though the system had 512GB of
>>> memory. On this server, in the worst case scenario of all 1500
>>> processes mapping every page from SGA would have required 878GB+ for
>>> just the PTEs. If these PTEs could be shared, amount of memory saved
>>> is very significant.
>>>
>>> This patch series implements a mechanism in kernel to allow userspace
>>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>>> mshare(), which can be used by a process to create a region (we will
>>> call it mshare'd region) which can be used by other processes to map
>>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>>> detach from the mshare'd region. Once an mshare'd region is created,
>>> other process(es), assuming they have the right permissions, can make
>>> the mashare() system call to map the shared pages into their address
>>> space using the shared PTEs.  When a process is done using this
>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>> access. When the last process accessing mshare'd region calls
>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>> it is freed.
>>>
>>>
>>> API
>>> ===
>>>
>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>
>>> --
>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>
>>> mshare() creates and opens a new, or opens an existing mshare'd
>>> region that will be shared at PTE level. "name" refers to shared object
>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>> of this shared memory area and length is the size of this area.
>>> oflags can be one of:
>>>
>>> - O_RDONLY opens shared memory area for read only access by everyone
>>> - O_RDWR opens shared memory area for read and write access
>>> - O_CREAT creates the named shared memory area if it does not exist
>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>    exists with that name, return an error.
>>>
>>> mode represents the creation mode for the shared object under
>>> /sys/fs/mshare.
>>>
>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> PTEs are shared at pgdir level and hence it imposes following
>>> requirements on the address and size given to the mshare():
>>>
>>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>>    This alignment value can be looked up in /proc/sys/vm//mshare_size
>>> - Size must be a multiple of pgdir size
>>> - Any mappings created in this address range at any time become
>>>    shared automatically
>>> - Shared address range can have unmapped addresses in it. Any access
>>>    to unmapped address will result in SIGBUS
>>>
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in
>>> the shared area. All other processes that want mshare'd access to
>>> this memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>> this file returns two long values - (1) starting address, and (2)
>>> size of the mshare'd region.
>>>
>>> --
>>> int mshare_unlink(char *name)
>>>
>>> A shared address range created by mshare() can be destroyed using
>>> mshare_unlink() which removes the  shared named object. Once all
>>> processes have unmapped the shared object, the shared address range
>>> references are de-allocated and destroyed.
>>>
>>> mshare_unlink() returns 0 on success or -1 on error.
>>>
>>>
>>> Example Code
>>> ============
>>>
>>> Snippet of the code that a donor process would run looks like below:
>>>
>>> -----------------
>>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>          if (addr == MAP_FAILED)
>>>                  perror("ERROR: mmap failed");
>>>
>>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          strncpy(addr, "Some random shared text",
>>>                          sizeof("Some random shared text"));
>>> -----------------
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> -----------------
>>>          struct mshare_info minfo;
>>>
>>>          fd = open("testregion", O_RDONLY);
>>>          if (fd < 0) {
>>>                  perror("open failed");
>>>                  exit(1);
>>>          }
>>>
>>>          if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>>                  printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>>                                  minfo.size, minfo.start);
>>>          else
>>>                  perror("read failed");
>>>
>>>          close(fd);
>>>
>>>          addr = (void *)minfo.start;
>>>          err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>>                          O_RDWR, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          printf("Guest mmap at %px:\n", addr);
>>>          printf("%s\n", addr);
>>>          printf("\nDone\n");
>>>
>>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>          if (err < 0) {
>>>                  perror("mshare_unlink() failed");
>>>                  exit(1);
>>>          }
>>> -----------------
>>
>>
>> Does  that mean those shared pages will get page_mapcount()=1 ?
> 
> AFAIU, for mshare() that is the case.
> 
>>
>> A big pain for a memory limited system like a desktop/embedded system is
>> that reverse mapping will take tons of cpu in memory reclamation path
>> especially for those pages mapped by multiple processes. sometimes,
>> 100% cpu utilization on LRU to scan and find out if a page is accessed
>> by reading PTE young.
> 
> Regarding PTE-table sharing:
> 
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
> 
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
> 
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes 
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a 
mapcount of 1 sounds reasonable to me.

> 
>>
>> if we result in one PTE only by this patchset, it means we are getting
>> significant
>> performance improvement in kernel LRU particularly when free memory
>> approaches the watermarks.
>>
>> But I don't see how a new system call like mshare(),  can be used
>> by those systems as they might need some more automatic PTEs sharing
>> mechanism.
> 
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
> 
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.

It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust 
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous 
to make that assumption automatically without processes explicitly declaring that level of trust.

Thanks,
Khalid