[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <8b1fff6b-1acc-4582-0abe-86f948ac4b4a@ozlabs.ru>
Date: Sat, 23 Jan 2021 17:35:03 +1100
From: Alexey Kardashevskiy <aik@...abs.ru>
To: Hillf Danton <hdanton@...a.com>
Cc: Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
Dmitry Vyukov <dvyukov@...gle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Tejun Heo <tj@...nel.org>
Subject: Re: BUG: MAX_LOCKDEP_KEYS too low!
On 23/01/2021 17:01, Hillf Danton wrote:
> On Sat, 23 Jan 2021 09:53:42 +1100 Alexey Kardashevskiy wrote:
>> On 23/01/2021 02:30, Tetsuo Handa wrote:
>>> On 2021/01/22 22:28, Tetsuo Handa wrote:
>>>> On 2021/01/22 21:10, Dmitry Vyukov wrote:
>>>>> On Fri, Jan 22, 2021 at 1:03 PM Alexey Kardashevskiy <aik@...abs.ru> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 22/01/2021 21:30, Tetsuo Handa wrote:
>>>>>>> On 2021/01/22 18:16, Dmitry Vyukov wrote:
>>>>>>>> The reproducer only does 2 bpf syscalls, so something is slowly leaking in bpf.
>>>>>>>> My first suspect would be one of these. Since workqueue is async, it
>>>>>>>> may cause such slow drain that happens only when tasks are spawned
>>>>>>>> fast. I don't know if there is a procfs/debugfs introspection file to
>>>>>>>> monitor workqueue lengths to verify this hypothesis.
>>>>>>>
>>>>>>> If you can reproduce locally, you can call show_workqueue_state()
>>>>>>> (part of SysRq-t) when hitting the limit.
>>>>>>>
>>>>>>> --- a/kernel/locking/lockdep.c
>>>>>>> +++ b/kernel/locking/lockdep.c
>>>>>>> @@ -1277,6 +1277,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
>>>>>>>
>>>>>>> print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
>>>>>>> dump_stack();
>>>>>>> + show_workqueue_state();
>>>>>>> return NULL;
>>>>>>> }
>>>>>>> nr_lock_classes++;
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is the result:
>>>>>> https://pastebin.com/rPn0Cytu
>>>>>
>>>>> Do you mind posting this publicly?
>>>>> Yes, it seems that bpf_map_free_deferred is the problem (11138
>>>>> outstanding callbacks).
>
> Need to set up a local queue for releasing bpf maps if 10,000 means
> flooding.
>
>>>>>
>>>>
>>>> Wow. Horribly stuck queue. I guess BPF wants to try WQ created by
>>>>
>>>> alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
>>>>
>>>> rather than system_wq . You can add Tejun Heo <tj@...nel.org> for WQ.
>>>>
>>>> Anyway, please post your result to ML.
>>
>> https://pastebin.com/JfrmzguK is with the patch below applied. Seems
>> less output. Interestingly when I almost hit "send", OOM kicked in and
>> tried killing a bunch of "maxlockdep" processes (my test case):
>>
>> [ 891.037315] [ 31007] 0 31007 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.037540] [ 31009] 0 31009 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.037760] [ 31012] 0 31012 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.037980] [ 31013] 0 31013 281 5 47104 0
>> 0 maxlockdep
>> [ 891.038210] [ 31014] 0 31014 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.038429] [ 31018] 0 31018 281 5 47104 0
>> 0 maxlockdep
>> [ 891.038652] [ 31019] 0 31019 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.038874] [ 31020] 0 31020 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.039095] [ 31021] 0 31021 281 5 49152 0
>> 1000 maxlockdep
>> [ 891.039317] [ 31022] 0 31022 281 5 47104 0
>> 0 maxlockdep
>>
>
> A local queue, the mix of list head and spin lock, helps to collapse
> the entities of map->work into one work in order to cut the risk of
> work flooding to WQ.
>
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -448,16 +448,40 @@ static void bpf_map_release_memcg(struct
> }
> #endif
>
> -/* called from workqueue */
> +static int worker_idle = 1;
> +static LIST_HEAD(bpf_map_work_list);
> +static DEFINE_SPINLOCK(bpf_map_work_lock);
> +
> static void bpf_map_free_deferred(struct work_struct *work)
> {
> - struct bpf_map *map = container_of(work, struct bpf_map, work);
> + struct bpf_map *map;
> +
> + worker_idle = 0;
> +again:
> + map = NULL;
> + spin_lock_irq(&bpf_map_work_lock);
> +
> + if (!list_empty(&bpf_map_work_list)) {
> + map = list_first_entry(&bpf_map_work_list, struct bpf_map,
> + work_list);
> + list_del_init(&map->work_list);
> + } else
> + worker_idle = 1;
> +
> + spin_unlock_irq(&bpf_map_work_lock);
> +
> + if (!map)
> + return;
>
> security_bpf_map_free(map);
> bpf_map_release_memcg(map);
> /* implementation dependent freeing */
> map->ops->map_free(map);
> +
> + cond_resched();
> + goto again;
> }
> +static DECLARE_WORK(bpf_map_release_work, bpf_map_free_deferred);
>
> static void bpf_map_put_uref(struct bpf_map *map)
> {
> @@ -473,11 +497,20 @@ static void bpf_map_put_uref(struct bpf_
> static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
> {
> if (atomic64_dec_and_test(&map->refcnt)) {
> + unsigned long flags;
> + int idle;
> +
> /* bpf_map_free_id() must be called first */
> bpf_map_free_id(map, do_idr_lock);
> btf_put(map->btf);
> - INIT_WORK(&map->work, bpf_map_free_deferred);
> - schedule_work(&map->work);
> +
> + spin_lock_irqsave(&bpf_map_work_lock, flags);
> + idle = worker_idle;
> + list_add(&map->work_list, &bpf_map_work_list);
> + spin_unlock_irqrestore(&bpf_map_work_lock, flags);
> +
> + if (idle)
> + queue_work(system_unbound_wq, &bpf_map_release_work);
> }
> }
>
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -149,7 +149,7 @@ struct bpf_map {
> */
> atomic64_t refcnt ____cacheline_aligned;
> atomic64_t usercnt;
> - struct work_struct work;
> + struct list_head work_list;
> struct mutex freeze_mutex;
> u64 writecnt; /* writable mmap cnt; protected by freeze_mutex */
> };
> --
>
this behaves quite different but still produces the message (i have
show_workqueue_state() right after the bug message):
[ 85.803991] BUG: MAX_LOCKDEP_KEYS too low!
[ 85.804338] turning off the locking correctness validator.
[ 85.804474] Showing busy workqueues and worker pools:
[ 85.804620] workqueue events_unbound: flags=0x2
[ 85.804764] pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3
[ 85.804965] in-flight: 81:bpf_map_free_deferred
[ 85.805229] workqueue events_power_efficient: flags=0x80
[ 85.805357] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 85.805558] in-flight: 57:gc_worker
[ 85.805877] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3
idle: 82 24
[ 85.806147] pool 16: cpus=0-7 flags=0x4 nice=0 hung=69s workers=3
idle: 7 251
^C[ 100.129747] maxlockdep (5104) used greatest stack depth: 8032 bytes
left
root@...dbg:~# grep "lock-classes" /proc/lockdep_stats
lock-classes: 8192 [max: 8192]
--
Alexey
Powered by blists - more mailing lists