lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGudoHHuBBX_FWKp96TZV7vs2xvxkFNkukt4wysx7K3OZDsLDw@mail.gmail.com>
Date: Mon, 30 Jun 2025 13:35:08 +0200
From: Mateusz Guzik <mjguzik@...il.com>
To: Sasha Levin <sashal@...nel.org>
Cc: viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz, 
	akpm@...ux-foundation.org, dada1@...mosbay.com, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] fs: Prevent file descriptor table allocations exceeding INT_MAX

On Mon, Jun 30, 2025 at 5:13 AM Sasha Levin <sashal@...nel.org> wrote:
>
> On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
> >On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
> >> When sysctl_nr_open is set to a very high value (for example, 1073741816
> >> as set by systemd), processes attempting to use file descriptors near
> >> the limit can trigger massive memory allocation attempts that exceed
> >> INT_MAX, resulting in a WARNING in mm/slub.c:
> >>
> >>   WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
> >>
> >> This happens because kvmalloc_array() and kvmalloc() check if the
> >> requested size exceeds INT_MAX and emit a warning when the allocation is
> >> not flagged with __GFP_NOWARN.
> >>
> >> Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
> >> process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
> >> - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
> >> - Multiple bitmaps: ~400MB
> >> - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
> >>
> >> Reproducer:
> >> 1. Set /proc/sys/fs/nr_open to 1073741816:
> >>    # echo 1073741816 > /proc/sys/fs/nr_open
> >>
> >> 2. Run a program that uses a high file descriptor:
> >>    #include <unistd.h>
> >>    #include <sys/resource.h>
> >>
> >>    int main() {
> >>        struct rlimit rlim = {1073741824, 1073741824};
> >>        setrlimit(RLIMIT_NOFILE, &rlim);
> >>        dup2(2, 1073741880);  // Triggers the warning
> >>        return 0;
> >>    }
> >>
> >> 3. Observe WARNING in dmesg at mm/slub.c:5027
> >>
> >> systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
> >> maximum possible value. The rationale was that systems with memory
> >> control groups (memcg) no longer need separate file descriptor limits
> >> since memory is properly accounted. However, this change overlooked
> >> that:
> >>
> >> 1. The kernel's allocation functions still enforce INT_MAX as a maximum
> >>    size regardless of memcg accounting
> >> 2. Programs and tests that legitimately test file descriptor limits can
> >>    inadvertently trigger massive allocations
> >> 3. The resulting allocations (>8GB) are impractical and will always fail
> >>
> >
> >alloc_fdtable() seems like the wrong place to do it.
> >
> >If there is an explicit de facto limit, the machinery which alters
> >fs.nr_open should validate against it.
> >
> >I understand this might result in systemd setting a new value which
> >significantly lower than what it uses now which technically is a change
> >in behavior, but I don't think it's a big deal.
> >
> >I'm assuming the kernel can't just set the value to something very high
> >by default.
> >
> >But in that case perhaps it could expose the max settable value? Then
> >systemd would not have to guess.
>
> The patch is in alloc_fdtable() because it's addressing a memory
> allocator limitation, not a fundamental file descriptor limitation.
>
> The INT_MAX restriction comes from kvmalloc(), not from any inherent
> constraint on how many FDs a process can have. If we implemented sparse
> FD tables or if kvmalloc() later supports larger allocations, the same
> nr_open value could become usable without any changes to FD handling
> code.
>
> Putting the check at the sysctl layer would codify a temporary
> implementation detail of the memory allocator as if it were a
> fundamental FD limit. By keeping it at the allocation point, the check
> reflects what it actually is - a current limitation of how large a
> contiguous allocation we can make.
>
> This placement also means the limit naturally adjusts if the underlying
> implementation changes, rather than requiring coordinated updates
> between the sysctl validation and the allocator capabilities.
>
> I don't have a strong opinion either way...
>

Allowing privileged userspace to set a limit which the kernel knows it
cannot reach sounds like a bug to me.

Indeed the limitation is an artifact of the current implementation, I
don't understand the logic behind pretending it's not there.

Regardless, not my call :)
-- 
Mateusz Guzik <mjguzik gmail.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ