[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3e201de2-bed2-6f7d-0783-700d095142e0@colorfullife.com>
Date: Thu, 29 Mar 2018 10:47:45 +0200
From: Manfred Spraul <manfred@...orfullife.com>
To: Davidlohr Bueso <dave@...olabs.net>,
Waiman Long <longman@...hat.com>,
Michael Kerrisk <mtk.manpages@...il.com>
Cc: "Eric W. Biederman" <ebiederm@...ssion.com>,
"Luis R. Rodriguez" <mcgrof@...nel.org>,
Kees Cook <keescook@...omium.org>,
linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Al Viro <viro@...iv.linux.org.uk>,
Matthew Wilcox <willy@...radead.org>,
Stanislav Kinsbursky <skinsbursky@...allels.com>,
Linux Containers <containers@...ts.linux-foundation.org>,
linux-api@...r.kernel.org
Subject: Re: [RFC][PATCH] ipc: Remove IPCMNI
Hello together,
On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:
> Cc'ing mtk, Manfred and linux-api.
>
> See below.
>
> On Thu, 15 Mar 2018, Waiman Long wrote:
>
>> On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
>>> Waiman Long <longman@...hat.com> writes:
>>>
>>>> On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
>>>>> The define IPCMNI was originally the size of a statically sized
>>>>> array in
>>>>> the kernel and that has long since been removed. Therefore there
>>>>> is no
>>>>> fundamental reason for IPCMNI.
>>>>>
>>>>> The only remaining use IPCMNI serves is as a convoluted way to format
>>>>> the ipc id to userspace. It does not appear that anything except for
>>>>> the CHECKPOINT_RESTORE code even cares about this variety of
>>>>> assignment
>>>>> and the CHECKPOINT_RESTORE code only cares about this weirdness
>>>>> because
>>>>> it has to restore these peculiar ids.
>>>>>
My assumption is that if an array is recreated, it should get a
different id.
a=semget(1234,,);
semctl(a,,IPC_RMID);
b=semget(1234,,);
now a!=b.
Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old"
array and the new array has the same id, then the locking gets corrupted.
>>>>> Therefore make the assignment of ipc ids match the description in
>>>>> Advanced Programming in the Unix Environment and assign the next id
>>>>> until INT_MAX is hit then loop around to the lower ids.
>>>>>
Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen
after 128k RMID calls.
>>>>> This can be implemented trivially with the current code using
>>>>> idr_alloc_cyclic.
>>>>>
Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?
semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.
@Davidlohr:
Do you know if there are application that frequently call semop() and it
doesn't have to sleep?
From the scalability that was pushed into the kernel, I assume that
this exists.
I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)
>>>>> To make it possible to keep checkpoint/restore working I have renamed
>>>>> the sysctls from xxx_next_id to xxx_nextid. That is enough change
>>>>> that
>>>>> a smart CRIU implementation can see that what is exported has
>>>>> changed,
>>>>> and act accordingly. New kernels will be able to restore the old
>>>>> id's.
>>>>>
>>>>> This code still needs some real world testing to verify my
>>>>> assumptions.
>>>>> And some work with the CRIU implementations to actually add the code
>>>>> that deals with the new for of id assignment.
>>>>>
It means that all existing checkpoint/restore application will not work
with a new kernel.
Everyone must first update the checkpoint/restore application, then
update the kernel.
Is this acceptable?
--
Manfred
Powered by blists - more mailing lists