linux-kernel - Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YFCH8vzFGmfFRCvV@cmpxchg.org>
Date:   Tue, 16 Mar 2021 06:26:58 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Arjun Roy <arjunroy.kdev@...il.com>
Cc:     akpm@...ux-foundation.org, davem@...emloft.net,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        cgroups@...r.kernel.org, linux-mm@...ck.org, arjunroy@...gle.com,
        shakeelb@...gle.com, edumazet@...gle.com, soheil@...gle.com,
        kuba@...nel.org, mhocko@...nel.org, shy828301@...il.com,
        guro@...com
Subject: Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

Hello,

On Mon, Mar 15, 2021 at 09:16:45PM -0700, Arjun Roy wrote:
> From: Arjun Roy <arjunroy@...gle.com>
> 
> TCP zerocopy receive is used by high performance network applications
> to further scale. For RX zerocopy, the memory containing the network
> data filled by the network driver is directly mapped into the address
> space of high performance applications. To keep the TLB cost low,
> these applications unmap the network memory in big batches. So, this
> memory can remain mapped for long time. This can cause a memory
> isolation issue as this memory becomes unaccounted after getting
> mapped into the application address space. This patch adds the memcg
> accounting for such memory.
> 
> Accounting the network memory comes with its own unique challenges.
> The high performance NIC drivers use page pooling to reuse the pages
> to eliminate/reduce expensive setup steps like IOMMU. These drivers
> keep an extra reference on the pages and thus we can not depend on the
> page reference for the uncharging. The page in the pool may keep a
> memcg pinned for arbitrary long time or may get used by other memcg.

The page pool knows when a page is unmapped again and becomes
available for recycling, right? Essentially the 'free' phase of that
private allocator. That's where the uncharge should be done.

For one, it's more aligned with the usual memcg charge lifetime rules.

But also it doesn't add what is essentially a private driver callback
to the generic file unmapping path.

Finally, this will eliminate the need for making up a new charge type
(MEMCG_DATA_SOCK) and allow using the standard kmem charging API.

> This patch decouples the uncharging of the page from the refcnt and
> associates it with the map count i.e. the page gets uncharged when the
> last address space unmaps it. Now the question is, what if the driver
> drops its reference while the page is still mapped? That is fine as
> the address space also holds a reference to the page i.e. the
> reference count can not drop to zero before the map count.
> 
> Signed-off-by: Arjun Roy <arjunroy@...gle.com>
> Co-developed-by: Shakeel Butt <shakeelb@...gle.com>
> Signed-off-by: Shakeel Butt <shakeelb@...gle.com>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@...gle.com>
> ---
> 
> Changelog since v1:
> - Pages accounted for in this manner are now tracked via MEMCG_SOCK.
> - v1 allowed for a brief period of double-charging, now we have a
>   brief period of under-charging to avoid undue memory pressure.

I'm afraid we'll have to go back to v1.

Let's address the issues raised with it:

1. The NR_FILE_MAPPED accounting. It is longstanding Linux behavior
   that driver pages mapped into userspace are accounted as file
   pages, because userspace is actually doing mmap() against a driver
   file/fd (as opposed to an anon mmap). That is how they show up in
   vmstat, in meminfo, and in the per process stats. There is no
   reason to make memcg deviate from this. If we don't like it, it
   should be taken on by changing vm_insert_page() - not trick rmap
   into thinking these arent memcg pages and then fixing it up with
   additional special-cased accounting callbacks.

   v1 did this right, it charged the pages the way we handle all other
   userspace pages: before rmap, and then let the generic VM code do
   the accounting for us with the cgroup-aware vmstat infrastructure.

2. The double charging. Could you elaborate how much we're talking
   about in any given batch? Is this a problem worth worrying about?

   The way I see it, any conflict here is caused by the pages being
   counted in the SOCK counter already, but not actually *tracked* on
   a per page basis. If it's worth addressing, we should look into
   fixing the root cause over there first if possible, before trying
   to work around it here.

   The newly-added GFP_NOFAIL is especially worrisome. The pages
   should be charged before we make promises to userspace, not be
   force-charged when it's too late.

   We have sk context when charging the inserted pages. Can we
   uncharge MEMCG_SOCK after each batch of inserts? That's only 32
   pages worth of overcharging, so not more than the regular charge
   batch memcg is using.

   An even better way would be to do charge stealing where we reuse
   the existing MEMCG_SOCK charges and don't have to get any new ones
   at all - just set up page->memcg and remove the charge from the sk.

   But yeah, it depends a bit if this is a practical concern.

Thanks,
Johannes