linux-kernel - Re: [PATCH] io_uring: make overflowing cqe subject to OOM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d2e534de-f478-4dc6-8f17-a080275c2c5f@kernel.dk>
Date: Tue, 30 Dec 2025 09:01:46 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Alexandre Negrel <alexandre@...rel.dev>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] io_uring: make overflowing cqe subject to OOM

On 12/30/25 7:50 AM, Alexandre Negrel wrote:
>> I'm assuming the issue here is that memcg will look at __GFP_HIGH
>> somehow and allow it to proceed?
> 
> Exactly, the allocation succeed even though it exceed cgroup limits.
> After digging through try_charge_memcg(), it seems that OOM killer
> isn't involved unless __GFP_DIRECT_RECLAIM bit is set (see
> gfpflags_allow_blocking).
> 
> https://github.com/torvalds/linux/blob/8640b74557fc8b4c300030f6ccb8cd078f665ec8/mm/memcontrol.c#L2329
> https://github.com/torvalds/linux/blob/8640b74557fc8b4c300030f6ccb8cd078f665ec8/include/linux/gfp.h#L38
> 
>> In any case, then below should then do the same. Can you test?
> 
> I tried it and it seems to fix the issue but in a different way.
> try_charge_memcg now returns -ENOMEM and the allocation failed. The
> completion queue entry is "dropped on the floor" in
> io_cqring_add_overflow.
>
> So I see 3 options here:
> * use GFP_NOWAIT if dropping CQE is ok

We're utterly out of memory at that point, so something has to give. We
can't invent memory out of thin air. Hence dropping the event, and
logging it as such, is imho the way to go. Same thing would've happened
with GFP_ATOMIC, just a bit earlier in the process.

It's worth noting that this is extreme circumstances - the kernel is
completely out of memory, and this will cause various spurious failures
to complete syscalls or other events. Additionally, this is the non
DEFER_TASKRUN case, which is what people should be using anyway.

> * allocate using GFP_KERNEL_ACCOUNT without holding the lock then adding
>   overflowing entries while holding the completion_lock (iterating twice over
>   compl_reqs)

Only viable way to do that would be to allocate it upfront, which is a
huge waste of time for the normal case where the CQ ring isn't
overflowing. We should not optimize for the slow/broken case, where
userspace overflows the ring.

> * charge memory after releasing the lock. I don't know if this is possible but
>   doing kfree(kmalloc(1, GFP_KERNEL_ACCOUNT)) after releasing the lock does the
>   job (even though it's dirty).

And that's definitely a no-go as well.

-- 
Jens Axboe