lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <465cf5c6-46be-2701-1c26-3e90f31f05e4@mellanox.com>
Date:   Wed, 16 May 2018 10:04:37 +0300
From:   Tariq Toukan <tariqt@...lanox.com>
To:     Qing Huang <qing.huang@...cle.com>,
        Tariq Toukan <tariqt@...lanox.com>, davem@...emloft.net,
        haakon.bugge@...cle.com, yanjun.zhu@...cle.com
Cc:     netdev@...r.kernel.org, linux-rdma@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH V2] mlx4_core: allocate ICM memory in page size chunks



On 15/05/2018 9:53 PM, Qing Huang wrote:
> 
> 
> On 5/15/2018 2:19 AM, Tariq Toukan wrote:
>>
>>
>> On 14/05/2018 7:41 PM, Qing Huang wrote:
>>>
>>>
>>> On 5/13/2018 2:00 AM, Tariq Toukan wrote:
>>>>
>>>>
>>>> On 11/05/2018 10:23 PM, Qing Huang wrote:
>>>>> When a system is under memory presure (high usage with fragments),
>>>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>>>> memory management to enter slow path doing memory compact/migration
>>>>> ops in order to complete high order memory allocations.
>>>>>
>>>>> When that happens, user processes calling uverb APIs may get stuck
>>>>> for more than 120s easily even though there are a lot of free pages
>>>>> in smaller chunks available in the system.
>>>>>
>>>>> Syslog:
>>>>> ...
>>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>>>> ...
>>>>>
>>>>> With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.
>>>>>
>>>>> However in order to support smaller ICM chunk size, we need to fix
>>>>> another issue in large size kcalloc allocations.
>>>>>
>>>>> E.g.
>>>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for 
>>>>> each mtt
>>>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>>>
>>>>> The solution is to use vzalloc to replace kcalloc. There is no need
>>>>> for contiguous memory pages for a driver meta data structure (no need
>>>>> of DMA ops).
>>>>>
>>>>> Signed-off-by: Qing Huang <qing.huang@...cle.com>
>>>>> Acked-by: Daniel Jurgens <danielj@...lanox.com>
>>>>> Reviewed-by: Zhu Yanjun <yanjun.zhu@...cle.com>
>>>>> ---
>>>>> v2 -> v1: adjusted chunk size to reflect different architectures.
>>>>>
>>>>>   drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
>>>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>> index a822f7a..ccb62b8 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>>> @@ -43,12 +43,12 @@
>>>>>   #include "fw.h"
>>>>>     /*
>>>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>>>>> - * per chunk.
>>>>> + * We allocate in page size (default 4KB on many archs) chunks to 
>>>>> avoid high
>>>>> + * order memory allocations in fragmented/high usage memory 
>>>>> situation.
>>>>>    */
>>>>>   enum {
>>>>> -    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
>>>>> -    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
>>>>> +    MLX4_ICM_ALLOC_SIZE    = 1 << PAGE_SHIFT,
>>>>> +    MLX4_TABLE_CHUNK_SIZE    = 1 << PAGE_SHIFT
>>>>
>>>> Which is actually PAGE_SIZE.
>>>
>>> Yes, we wanted to avoid high order memory allocations.
>>>
>>
>> Then please use PAGE_SIZE instead.
> 
> PAGE_SIZE is usually defined as 1 << PAGE_SHIFT. So I think PAGE_SHIFT 
> is actually more appropriate here.
> 

Definition of PAGE_SIZE varies among different archs.
It is not always as simple as 1 << PAGE_SHIFT.
It might be:
PAGE_SIZE (1UL << PAGE_SHIFT)
PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
etc...

Please replace 1 << PAGE_SHIFT with PAGE_SIZE.

> 
>>
>>>> Also, please add a comma at the end of the last entry.
>>>
>>> Hmm..., followed the existing code style and checkpatch.pl didn't 
>>> complain about the comma.
>>>
>>
>> I am in favor of having a comma also after the last element, so that 
>> when another enum element is added we do not modify this line again, 
>> which would falsely affect git blame.
>>
>> I know it didn't exist before your patch, but once we're here, let's 
>> do it.
> 
> I'm okay either way. If adding an extra comma is preferred by many 
> people, someone should update checkpatch.pl to enforce it. :)
> 
I agree.
Until then, please use an extra comma in this patch.

>>
>>>>
>>>>>   };
>>>>>     static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct 
>>>>> mlx4_icm_chunk *chunk)
>>>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>>>> struct mlx4_icm_table *table,
>>>>>       obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>>>>>       num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>>>>   -    table->icm      = kcalloc(num_icm, sizeof(*table->icm), 
>>>>> GFP_KERNEL);
>>>>> +    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
>>>>
>>>> Why not kvzalloc ?
>>>
>>> I think table->icm pointer array doesn't really need physically 
>>> contiguous memory. Sometimes high order
>>> memory allocation by kmalloc variants may trigger slow path and cause 
>>> tasks to be blocked.
>>>
>>
>> This is control path so it is less latency-sensitive.
>> Let's not produce unnecessary degradation here, please call kvzalloc 
>> so we maintain a similar behavior when contiguous memory is available, 
>> and a fallback for resiliency.
> 
> No sure what exactly degradation is caused by vzalloc here. I think it's 
> better to keep physically contiguous pages
> to other requests which really need them. Besides slow path/mem 
> compacting can be really expensive.
> 

Degradation is expected when you replace a contig memory with non-contig 
memory, without any perf test.
We agree that when contig memory is not available, we should use 
non-contig instead of simply failing, and for this you can call kvzalloc.

>>
>>> Thanks,
>>> Qing
>>>
>>>>
>>>>>       if (!table->icm)
>>>>>           return -ENOMEM;
>>>>>       table->virt     = virt;
>>>>> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, 
>>>>> struct mlx4_icm_table *table,
>>>>>               mlx4_free_icm(dev, table->icm[i], use_coherent);
>>>>>           }
>>>>>   -    kfree(table->icm);
>>>>> +    vfree(table->icm);
>>>>>         return -ENOMEM;
>>>>>   }
>>>>> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev 
>>>>> *dev, struct mlx4_icm_table *table)
>>>>>               mlx4_free_icm(dev, table->icm[i], table->coherent);
>>>>>           }
>>>>>   -    kfree(table->icm);
>>>>> +    vfree(table->icm);
>>>>>   }
>>>>>
>>>>
>>>> Thanks for your patch.
>>>>
>>>> I need to verify there is no dramatic performance degradation here.
>>>> You can prepare and send a v3 in the meanwhile.
>>>>
>>>> Thanks,
>>>> Tariq
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>> linux-rdma" in
>>>> the body of a message to majordomo@...r.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ