lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2e534de-f478-4dc6-8f17-a080275c2c5f@kernel.dk>
Date: Tue, 30 Dec 2025 09:01:46 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Alexandre Negrel <alexandre@...rel.dev>
Cc: io-uring@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] io_uring: make overflowing cqe subject to OOM

On 12/30/25 7:50 AM, Alexandre Negrel wrote:
>> I'm assuming the issue here is that memcg will look at __GFP_HIGH
>> somehow and allow it to proceed?
> 
> Exactly, the allocation succeed even though it exceed cgroup limits.
> After digging through try_charge_memcg(), it seems that OOM killer
> isn't involved unless __GFP_DIRECT_RECLAIM bit is set (see
> gfpflags_allow_blocking).
> 
> https://github.com/torvalds/linux/blob/8640b74557fc8b4c300030f6ccb8cd078f665ec8/mm/memcontrol.c#L2329
> https://github.com/torvalds/linux/blob/8640b74557fc8b4c300030f6ccb8cd078f665ec8/include/linux/gfp.h#L38
> 
>> In any case, then below should then do the same. Can you test?
> 
> I tried it and it seems to fix the issue but in a different way.
> try_charge_memcg now returns -ENOMEM and the allocation failed. The
> completion queue entry is "dropped on the floor" in
> io_cqring_add_overflow.
>
> So I see 3 options here:
> * use GFP_NOWAIT if dropping CQE is ok

We're utterly out of memory at that point, so something has to give. We
can't invent memory out of thin air. Hence dropping the event, and
logging it as such, is imho the way to go. Same thing would've happened
with GFP_ATOMIC, just a bit earlier in the process.

It's worth noting that this is extreme circumstances - the kernel is
completely out of memory, and this will cause various spurious failures
to complete syscalls or other events. Additionally, this is the non
DEFER_TASKRUN case, which is what people should be using anyway.

> * allocate using GFP_KERNEL_ACCOUNT without holding the lock then adding
>   overflowing entries while holding the completion_lock (iterating twice over
>   compl_reqs)

Only viable way to do that would be to allocate it upfront, which is a
huge waste of time for the normal case where the CQ ring isn't
overflowing. We should not optimize for the slow/broken case, where
userspace overflows the ring.

> * charge memory after releasing the lock. I don't know if this is possible but
>   doing kfree(kmalloc(1, GFP_KERNEL_ACCOUNT)) after releasing the lock does the
>   job (even though it's dirty).

And that's definitely a no-go as well.

-- 
Jens Axboe

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ