linux-kernel - Re: [PATCH] fs: Prevent file descriptor table allocations exceeding INT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aGIA18cgkzv-05A2@lappy>
Date: Sun, 29 Jun 2025 23:13:27 -0400
From: Sasha Levin <sashal@...nel.org>
To: Mateusz Guzik <mjguzik@...il.com>
Cc: viro@...iv.linux.org.uk, brauner@...nel.org, jack@...e.cz,
	akpm@...ux-foundation.org, dada1@...mosbay.com,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH] fs: Prevent file descriptor table allocations exceeding
 INT_MAX

On Sun, Jun 29, 2025 at 09:58:12PM +0200, Mateusz Guzik wrote:
>On Sun, Jun 29, 2025 at 03:40:21AM -0400, Sasha Levin wrote:
>> When sysctl_nr_open is set to a very high value (for example, 1073741816
>> as set by systemd), processes attempting to use file descriptors near
>> the limit can trigger massive memory allocation attempts that exceed
>> INT_MAX, resulting in a WARNING in mm/slub.c:
>>
>>   WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288
>>
>> This happens because kvmalloc_array() and kvmalloc() check if the
>> requested size exceeds INT_MAX and emit a warning when the allocation is
>> not flagged with __GFP_NOWARN.
>>
>> Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a
>> process calls dup2(oldfd, 1073741880), the kernel attempts to allocate:
>> - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes
>> - Multiple bitmaps: ~400MB
>> - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647)
>>
>> Reproducer:
>> 1. Set /proc/sys/fs/nr_open to 1073741816:
>>    # echo 1073741816 > /proc/sys/fs/nr_open
>>
>> 2. Run a program that uses a high file descriptor:
>>    #include <unistd.h>
>>    #include <sys/resource.h>
>>
>>    int main() {
>>        struct rlimit rlim = {1073741824, 1073741824};
>>        setrlimit(RLIMIT_NOFILE, &rlim);
>>        dup2(2, 1073741880);  // Triggers the warning
>>        return 0;
>>    }
>>
>> 3. Observe WARNING in dmesg at mm/slub.c:5027
>>
>> systemd commit a8b627a introduced automatic bumping of fs.nr_open to the
>> maximum possible value. The rationale was that systems with memory
>> control groups (memcg) no longer need separate file descriptor limits
>> since memory is properly accounted. However, this change overlooked
>> that:
>>
>> 1. The kernel's allocation functions still enforce INT_MAX as a maximum
>>    size regardless of memcg accounting
>> 2. Programs and tests that legitimately test file descriptor limits can
>>    inadvertently trigger massive allocations
>> 3. The resulting allocations (>8GB) are impractical and will always fail
>>
>
>alloc_fdtable() seems like the wrong place to do it.
>
>If there is an explicit de facto limit, the machinery which alters
>fs.nr_open should validate against it.
>
>I understand this might result in systemd setting a new value which
>significantly lower than what it uses now which technically is a change
>in behavior, but I don't think it's a big deal.
>
>I'm assuming the kernel can't just set the value to something very high
>by default.
>
>But in that case perhaps it could expose the max settable value? Then
>systemd would not have to guess.

The patch is in alloc_fdtable() because it's addressing a memory
allocator limitation, not a fundamental file descriptor limitation.

The INT_MAX restriction comes from kvmalloc(), not from any inherent
constraint on how many FDs a process can have. If we implemented sparse
FD tables or if kvmalloc() later supports larger allocations, the same
nr_open value could become usable without any changes to FD handling
code.

Putting the check at the sysctl layer would codify a temporary
implementation detail of the memory allocator as if it were a
fundamental FD limit. By keeping it at the allocation point, the check
reflects what it actually is - a current limitation of how large a
contiguous allocation we can make.

This placement also means the limit naturally adjusts if the underlying
implementation changes, rather than requiring coordinated updates
between the sysctl validation and the allocator capabilities.

I don't have a strong opinion either way...

-- 
Thanks,
Sasha