linux-kernel - Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <196418a9-9c9d-fdb5-f5a1-9abc391adc83@redhat.com>
Date:   Thu, 14 Dec 2017 13:11:46 -0500
From:   Daniel Walsh <dwalsh@...hat.com>
To:     Casey Schaufler <casey@...aufler-ca.com>,
        Stephen Smalley <sds@...ho.nsa.gov>,
        yangjihong <yangjihong1@...wei.com>,
        "paul@...l-moore.com" <paul@...l-moore.com>,
        "eparis@...isplace.org" <eparis@...isplace.org>,
        "selinux@...ho.nsa.gov" <selinux@...ho.nsa.gov>,
        Lukas Vrabec <lvrabec@...hat.com>,
        Petr Lautrbach <plautrba@...hat.com>
Cc:     "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long
 time because of too many sidtab context node

On 12/14/2017 12:42 PM, Casey Schaufler wrote:
> On 12/14/2017 9:15 AM, Stephen Smalley wrote:
>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote:
>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote:
>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote:
>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote:
>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to
>>>>>>> constantly starting numbers of docker ontainers with selinux
>>>>>>> enabled,
>>>>>>> and after about 2 days, the kernel softlockup panic:
>>>>>>>   <IRQ>  [<ffffffff810bb778>] sched_show_task+0xb8/0x120
>>>>>>>   [<ffffffff8116133f>] show_lock_info+0x20f/0x3a0
>>>>>>>   [<ffffffff811226aa>] watchdog_timer_fn+0x1da/0x2f0
>>>>>>>   [<ffffffff811224d0>] ?
>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40
>>>>>>>   [<ffffffff810abf82>] __hrtimer_run_queues+0xd2/0x260
>>>>>>>   [<ffffffff810ac520>] hrtimer_interrupt+0xb0/0x1e0
>>>>>>>   [<ffffffff8104a477>] local_apic_timer_interrupt+0x37/0x60
>>>>>>>   [<ffffffff8166fd90>] smp_apic_timer_interrupt+0x50/0x140
>>>>>>>   [<ffffffff8166e1dd>] apic_timer_interrupt+0x6d/0x80
>>>>>>>   <EOI>  [<ffffffff812b4193>] ?
>>>>>>> sidtab_context_to_sid+0xb3/0x480
>>>>>>>   [<ffffffff812b41f0>] ? sidtab_context_to_sid+0x110/0x480
>>>>>>>   [<ffffffff812c0d15>] ? mls_setup_user_range+0x145/0x250
>>>>>>>   [<ffffffff812bd477>] security_get_user_sids+0x3f7/0x550
>>>>>>>   [<ffffffff812b1a8b>] sel_write_user+0x12b/0x210
>>>>>>>   [<ffffffff812b1960>] ? sel_write_member+0x200/0x200
>>>>>>>   [<ffffffff812b01d8>] selinux_transaction_write+0x48/0x80
>>>>>>>   [<ffffffff811f444d>] vfs_write+0xbd/0x1e0
>>>>>>>   [<ffffffff811f4eef>] SyS_write+0x7f/0xe0
>>>>>>>   [<ffffffff8166d433>] system_call_fastpath+0x16/0x1b
>>>>>>>
>>>>>>> My opinion:
>>>>>>> when the docker container starts, it would mount overlay
>>>>>>> filesystem
>>>>>>> with different selinux context, mount point such as:
>>>>>>> overlay on
>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f
>>>>>>> 07b4
>>>>>>> bc32
>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay
>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>> s0:c
>>>>>>> 414,
>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP
>>>>>>> ARHH
>>>>>>> WY7:
>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li
>>>>>>> b/do
>>>>>>> cker
>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock
>>>>>>> er/o
>>>>>>> verl
>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f
>>>>>>> c9dd
>>>>>>> b66/
>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e
>>>>>>> 952e
>>>>>>> ae4f
>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work)
>>>>>>> shm on
>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9
>>>>>>> 1327
>>>>>>> ca57
>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs
>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>> virt
>>>>>>> _san
>>>>>>> dbox_file_t:s0:c414,c873",size=65536k)
>>>>>>> overlay on
>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb
>>>>>>> 7258
>>>>>>> cbca
>>>>>>> 14ff6d165b94353eefab/merged type overlay
>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t:
>>>>>>> s0:c
>>>>>>> 431,
>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP
>>>>>>> AVRC
>>>>>>> RSS:
>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi
>>>>>>> r=/v
>>>>>>> ar/l
>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb
>>>>>>> ca14
>>>>>>> ff6d
>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d
>>>>>>> 0801
>>>>>>> 45c7
>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work)
>>>>>>> shm on
>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce
>>>>>>> dc1d
>>>>>>> cf05
>>>>>>> a65866458523ffd4a71614/shm type tmpfs
>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s
>>>>>>> virt
>>>>>>> _san
>>>>>>> dbox_file_t:s0:c431,c651",size=65536k)
>>>>>>>
>>>>>>> sidtab_search_context check the context whether is in the
>>>>>>> sidtab
>>>>>>> list, If not found, a new node is generated and insert into
>>>>>>> the
>>>>>>> list,
>>>>>>> As the number of containers is increasing,  context nodes are
>>>>>>> also
>>>>>>> more and more, we tested the final number of nodes reached
>>>>>>> 300,000 +,
>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will
>>>>>>> lead to
>>>>>>> the
>>>>>>> system softlockup.
>>>>>>>
>>>>>>> Is this a selinux bug? When filesystem umount, why context
>>>>>>> node
>>>>>>> is
>>>>>>> not deleted?  I cannot find the relevant function to delete
>>>>>>> the
>>>>>>> node
>>>>>>> in sidtab.c
>>>>>>>
>>>>>>> Thanks for reading and looking forward to your reply.
>>>>>> So, does docker just keep allocating a unique category set for
>>>>>> every
>>>>>> new container, never reusing them even if the container is
>>>>>> destroyed?
>>>>>> That would be a bug in docker IMHO.  Or are you creating an
>>>>>> unbounded
>>>>>> number of containers and never destroying the older ones?
>>>>> You can't reuse the security context. A process in ContainerA
>>>>> sends
>>>>> a labeled packet to MachineB. ContainerA goes away and its
>>>>> context
>>>>> is recycled in ContainerC. MachineB responds some time later,
>>>>> again
>>>>> with a labeled packet. ContainerC gets information intended for
>>>>> ContainerA, and uses the information to take over the Elbonian
>>>>> government.
>>>> Docker isn't using labeled networking (nor is anything else by
>>>> default;
>>>> it is only enabled if explicitly configured).
>>> If labeled networking weren't an issue we'd have full security
>>> module stacking by now. Yes, it's an edge case. If you want to
>>> use labeled NFS or a local filesystem that gets mounted in each
>>> container (don't tell me that nobody would do that) you've got
>>> the same problem.
>> Even if someone were to configure labeled networking, Docker is not
>> presently relying on that or SELinux network enforcement for any
>> security properties, so it really doesn't matter.
> True enough. I can imagine a use case, but as you point out, it
> would be a very complex configuration and coordination exercise
> using SELinux.
>
>> And if they wanted
>> to do that, they'd have to coordinate category assignments across all
>> systems involved, for which no facility exists AFAIK.  If you have two
>> docker instances running on different hosts, I'd wager that they can
>> hand out the same category sets today to different containers.
>>
>> With respect to labeled NFS, that's also not the default for nfs
>> mounts, so again it is a custom configuration and Docker isn't relying
>> on it for any guarantees today.  For local filesystems, they would
>> normally be context-mounted or using genfscon rather than xattrs in
>> order to be accessible to the container, thus no persistent storage of
>> the category sets.
Well Kubernetes and OpenShift do set the labels to be the same within a 
project, and they can manage
across nodes.  But yes we are not using labeled networking at this point.
> I know that is the intended configuration, but I see people do
> all sorts of stoopid things for what they believe are good reasons.
> Unfortunately, lots of people count on containers to provide
> isolation, but create "solutions" for data sharing that defeat it.
>
>> Certainly docker could provide an option to not reuse category sets,
>> but making that the default is not sane and just guarantees exhaustion
>> of the SID and context space (just create and tear down lots of
>> containers every day or more frequently).
> It seems that Docker might have a similar issue with UIDs,
> but it takes longer to run out of UIDs than sidtab entries.
>
>>>>>> On the selinux userspace side, we'd also like to eliminate the
>>>>>> use
>>>>>> of
>>>>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids)
>>>>>> entirely, which is what triggered this for you.
>>>>>>
>>>>>> We cannot currently delete a sidtab node because we have no way
>>>>>> of
>>>>>> knowing if there are any lingering references to the
>>>>>> SID.  Fixing
>>>>>> that
>>>>>> would require reference-counted SIDs, which goes beyond just
>>>>>> SELinux
>>>>>> since SIDs/secids are returned by LSM hooks and cached in other
>>>>>> kernel
>>>>>> data structures.
>>>>> You could delete a sidtab node. The code already deals with
>>>>> unfindable
>>>>> SIDs. The issue is that eventually you run out of SIDs. Then you
>>>>> are
>>>>> forced to recycle SIDs, which leads to the overthrow of the
>>>>> Elbonian
>>>>> government.
>>>> We don't know when we can safely delete a sidtab node since SIDs
>>>> aren't
>>>> reference counted and we can't know whether it is still in use
>>>> somewhere in the kernel.  Doing so prematurely would lead to the
>>>> SID
>>>> being remapped to the unlabeled context, and then likely to
>>>> undesired
>>>> denials.
>>> I would suggest that if you delete a sidtab node and someone
>>> comes along later and tries to use it that denial is exactly
>>> what you would desire. I don't see any other rational action.
>> Yes, if we know that the SID wasn't in use at the time we tore it down.
>>   But if we're just randomly deleting sidtab entries based on age or
>> something (since we have no reference count), we'll almost certainly
>> encounter situations where a SID hasn't been accessed in a long time
>> but is still being legitimately cached somewhere.  Just a file that
>> hasn't been accessed in a while might have that SID still cached in its
>> inode security blob, or anywhere else.
>>
>>>>>> sidtab_search_context() could no doubt be optimized for the
>>>>>> negative
>>>>>> case; there was an earlier optimization for the positive case
>>>>>> by
>>>>>> adding
>>>>>> a cache to sidtab_context_to_sid() prior to calling it.  It's a
>>>>>> reverse
>>>>>> lookup in the sidtab.
>>>>> This seems like a bad idea.
>>>> Not sure what you mean, but it can certainly be changed to at least
>>>> use
>>>> a hash table for these reverse lookups.
>>>>
>>>>
>
>
>