[PATCH] NVMe: Avoid caculate cq head doorbel in nvme_process_cq()

Thu Sep 5 04:55:43 EDT 2013

On 2013/9/5 0:51, Matthew Wilcox wrote:
> On Wed, Sep 04, 2013 at 05:15:51PM +0800, Haiyan Hu wrote:
>> On 2013/9/3 23:53, Matthew Wilcox wrote:
>>> Have you been able to measure a difference?  If so, can you share
>>> relative numbers?
>>
>> Sorry for my inaccurate description.
>> Our test data shows that this patch does not change cmd process latency.
>> But I think it can improve code efficiency, maybe a little.
> 
> Right, so it's a trade-off.  You want to add 8 bytes to the queue data
> structure in order to save a few instructions from being executed at
> queue processing time.  We need to quantify the savings (since we know
> the costs).  I approximated the code by replacing:
> 
> -       writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
> +       writel(head, nvmeq->q_db);
> 
> Here's the code before:
> 
>      2ca:       48 8b 43 08             mov    0x8(%rbx),%rax
>      2ce:       48 8b 93 88 00 00 00    mov    0x88(%rbx),%rdx
>      2d5:       8b 48 40                mov    0x40(%rax),%ecx
>      2d8:       b8 01 00 00 00          mov    $0x1,%eax
>      2dd:       d3 e0                   shl    %cl,%eax
>      2df:       48 98                   cltq   
>      2e1:       48 8d 14 82             lea    (%rdx,%rax,4),%rdx
>      2e5:       41 0f b7 c4             movzwl %r12w,%eax
>      2e9:       89 02                   mov    %eax,(%rdx)
> 
> Here's the code after:
> 
>      2ca:       48 8b 93 88 00 00 00    mov    0x88(%rbx),%rdx
>      2d1:       41 0f b7 c4             movzwl %r12w,%eax
>      2d5:       89 02                   mov    %eax,(%rdx)
> 
> That's 6 instructions sved, and to be fair they have some pretty tight
> dependencies.  Still, on a 3GHz processor, that's maybe 2 nanoseconds.
> If we're doing a million IOPS, we could save 2ms/s, or 0.2%.  Pretty tough
> to measure, let alone justify.
> 
> Here's an alternative that doesn't require adding 8 bytes to the
> nvme_queue.  Instead of storing the shift in db_stride, we can store the
> actual stride.  Then there is less calculation to be done at
> completion time, and the code looks like this:
> 
> -       writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
> +       writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
> 
>      2ca:       48 8b 43 08             mov    0x8(%rbx),%rax
>      2ce:       48 63 50 40             movslq 0x40(%rax),%rdx
>      2d2:       48 8b 83 88 00 00 00    mov    0x88(%rbx),%rax
>      2d9:       48 8d 14 90             lea    (%rax,%rdx,4),%rdx
>      2dd:       41 0f b7 c4             movzwl %r12w,%eax
>      2e1:       89 02                   mov    %eax,(%rdx)
> 
> That saves half the instructions, for no increased data structure usage.
> 
> One other microoptimisation we can do is change the type of db_stride from
> int to unsigned.  Then the compiler produces:
> 
>      2ca:       48 8b 43 08             mov    0x8(%rbx),%rax
>      2ce:       8b 50 40                mov    0x40(%rax),%edx
>      2d1:       48 8b 83 88 00 00 00    mov    0x88(%rbx),%rax
>      2d8:       48 8d 14 90             lea    (%rax,%rdx,4),%rdx
>      2dc:       41 0f b7 c4             movzwl %r12w,%eax
>      2e0:       89 02                   mov    %eax,(%rdx)
> 
> which uses one fewer byte (at address 2ce, it uses mov instead of movslq).
> 
> 
> Would you like to send a patch which changes the type of db_stride to
> unsigned and changes the value stored there to be 1 << the current value?
> 
> Or you can try to persuade me again that the tradeoff is worth adding
> the extra 8 bytes to the data structure :-)
> .
> 
Your analysis is more comprehensive. Of course, I'd like to send a new patch that
you advised. Do you think the updated variable named "db_stride_shift"?

Thank you.