[PATCH] NVMe: Avoid caculate cq head doorbel in nvme_process_cq()
Haiyan Hu
huhaiyan at huawei.com
Thu Sep 5 04:55:43 EDT 2013
On 2013/9/5 0:51, Matthew Wilcox wrote:
> On Wed, Sep 04, 2013 at 05:15:51PM +0800, Haiyan Hu wrote:
>> On 2013/9/3 23:53, Matthew Wilcox wrote:
>>> Have you been able to measure a difference? If so, can you share
>>> relative numbers?
>>
>> Sorry for my inaccurate description.
>> Our test data shows that this patch does not change cmd process latency.
>> But I think it can improve code efficiency, maybe a little.
>
> Right, so it's a trade-off. You want to add 8 bytes to the queue data
> structure in order to save a few instructions from being executed at
> queue processing time. We need to quantify the savings (since we know
> the costs). I approximated the code by replacing:
>
> - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
> + writel(head, nvmeq->q_db);
>
> Here's the code before:
>
> 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
> 2ce: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx
> 2d5: 8b 48 40 mov 0x40(%rax),%ecx
> 2d8: b8 01 00 00 00 mov $0x1,%eax
> 2dd: d3 e0 shl %cl,%eax
> 2df: 48 98 cltq
> 2e1: 48 8d 14 82 lea (%rdx,%rax,4),%rdx
> 2e5: 41 0f b7 c4 movzwl %r12w,%eax
> 2e9: 89 02 mov %eax,(%rdx)
>
> Here's the code after:
>
> 2ca: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx
> 2d1: 41 0f b7 c4 movzwl %r12w,%eax
> 2d5: 89 02 mov %eax,(%rdx)
>
> That's 6 instructions sved, and to be fair they have some pretty tight
> dependencies. Still, on a 3GHz processor, that's maybe 2 nanoseconds.
> If we're doing a million IOPS, we could save 2ms/s, or 0.2%. Pretty tough
> to measure, let alone justify.
>
> Here's an alternative that doesn't require adding 8 bytes to the
> nvme_queue. Instead of storing the shift in db_stride, we can store the
> actual stride. Then there is less calculation to be done at
> completion time, and the code looks like this:
>
> - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
> + writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
>
> 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
> 2ce: 48 63 50 40 movslq 0x40(%rax),%rdx
> 2d2: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax
> 2d9: 48 8d 14 90 lea (%rax,%rdx,4),%rdx
> 2dd: 41 0f b7 c4 movzwl %r12w,%eax
> 2e1: 89 02 mov %eax,(%rdx)
>
> That saves half the instructions, for no increased data structure usage.
>
> One other microoptimisation we can do is change the type of db_stride from
> int to unsigned. Then the compiler produces:
>
> 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
> 2ce: 8b 50 40 mov 0x40(%rax),%edx
> 2d1: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax
> 2d8: 48 8d 14 90 lea (%rax,%rdx,4),%rdx
> 2dc: 41 0f b7 c4 movzwl %r12w,%eax
> 2e0: 89 02 mov %eax,(%rdx)
>
> which uses one fewer byte (at address 2ce, it uses mov instead of movslq).
>
>
> Would you like to send a patch which changes the type of db_stride to
> unsigned and changes the value stored there to be 1 << the current value?
>
> Or you can try to persuade me again that the tradeoff is worth adding
> the extra 8 bytes to the data structure :-)
> .
>
Your analysis is more comprehensive. Of course, I'd like to send a new patch that
you advised. Do you think the updated variable named "db_stride_shift"?
Thank you.
More information about the Linux-nvme
mailing list