linux-kernel - Re: Orphan filesystems after mount namespace destruction and tmpfs "leak"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20260203-bestanden-ballhaus-941e4c365701@brauner>
Date: Tue, 3 Feb 2026 15:58:52 +0100
From: Christian Brauner <brauner@...nel.org>
To: Kiryl Shutsemau <kas@...nel.org>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	Hugh Dickins <hughd@...gle.com>, Baolin Wang <baolin.wang@...ux.alibaba.com>, linux-mm@...ck.org, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: Orphan filesystems after mount namespace destruction and tmpfs
 "leak"

On Mon, Feb 02, 2026 at 05:50:30PM +0000, Kiryl Shutsemau wrote:
> Hi,
> 
> In the Meta fleet, we saw a problem where destroying a container didn't
> lead to freeing the shmem memory attributed to a tmpfs mounted inside
> that container. It triggered an OOM when a new container attempted to
> start.
> 
> Investigation has shown that this happened because a process outside of
> the container kept a file from the tmpfs mapped. The mapped file is
> small (4k), but it holds all the contents of the tmpfs (~47GiB) from
> being freed.
> 
> When a tmpfs filesystem is mounted inside a mount namespace (e.g., a
> container), and a process outside that namespace holds an open file
> descriptor to a file on that tmpfs, the tmpfs superblock remains in
> kernel memory indefinitely after:
> 
> 1. All processes inside the mount namespace have exited.
> 2. The mount namespace has been destroyed.
> 3. The tmpfs is no longer visible in any mount namespace.
> 
> The superblock persists with mnt_ns = NULL in its mount structures,
> keeping all tmpfs contents pinned in memory until the external file
> descriptor is closed.
> 
> The problem is not specific to tmpfs, but for filesystems with backing
> storage, the memory impact is not as severe since the page cache is
> reclaimable.
> 
> The obvious solution to the problem is "Don't do that": the file should
> be unmapped/closed upon container destruction.
> 
> But I wonder if the kernel can/should do better here? Currently, this
> scenario is hard to diagnose. It looks like a leak of shmem pages.
> 
> Also, I wonder if the current behavior can lead to data loss on a
> filesystem with backing storage:
>  - The mount namespace where my USB stick was mounted is gone.
>  - The USB stick is no longer mounted anywhere.
>  - I can pull the USB stick out.
>  - Oops, someone was writing there: corruption/data loss.

If the USB stick is yanked and the filesystem uses fs_holder_ops it will
be notified about sudden device removal and can decide to handle it as
it sees fit. That works for all devices; including log devices or rt
devices or what have you. Usually it will shut the filesystem down and
tell userspace to EIO. I've switched all major filesystems to this model
a few kernel releases ago.

> I am not sure what a possible solution would be here. I can only think

None from the kernel's perspective. It's intended semantics that
userspace relies upon (For example, if you have an fdstore then you very
much want that persistence.).

We could zap all files and make the fds cataonic. But that's just insane
from the start. It would be a drastic deviation from basic semantics
we've had since forever. It would also be extremly difficult to get
right and performant because you'd need algorithms to find all of them.
Keeping a global map for all files open on a filesystem instance is just
insane so is walking all processes and inspecting their fdtables.

I don't believe we need to do anything here unless you want some tmpfs
specific black magic where you can issue a shutdown ioctl on tmpfs that
magically frees memory. And I'd still expect that this would fsck
userspace over that doesn't expect this behavior.

> of blocking exit(2) for the last process in the namespace until all
> filesystems are cleanly unmounted, but that is not very informative
> either.

If you just wanted the ability to wait for a filesystem to have gone
away completely that's an easier thing to achieve. Inotify has support
for this via IN_UNMOUNT. But as you say I doubt that's actually helpful.