[PATCH] NVMe: Avoid caculate cq head doorbel in nvme_process_cq()
Matthew Wilcox
willy at linux.intel.com
Wed Sep 4 12:51:42 EDT 2013
On Wed, Sep 04, 2013 at 05:15:51PM +0800, Haiyan Hu wrote:
> On 2013/9/3 23:53, Matthew Wilcox wrote:
> > Have you been able to measure a difference? If so, can you share
> > relative numbers?
>
> Sorry for my inaccurate description.
> Our test data shows that this patch does not change cmd process latency.
> But I think it can improve code efficiency, maybe a little.
Right, so it's a trade-off. You want to add 8 bytes to the queue data
structure in order to save a few instructions from being executed at
queue processing time. We need to quantify the savings (since we know
the costs). I approximated the code by replacing:
- writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
+ writel(head, nvmeq->q_db);
Here's the code before:
2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
2ce: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx
2d5: 8b 48 40 mov 0x40(%rax),%ecx
2d8: b8 01 00 00 00 mov $0x1,%eax
2dd: d3 e0 shl %cl,%eax
2df: 48 98 cltq
2e1: 48 8d 14 82 lea (%rdx,%rax,4),%rdx
2e5: 41 0f b7 c4 movzwl %r12w,%eax
2e9: 89 02 mov %eax,(%rdx)
Here's the code after:
2ca: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx
2d1: 41 0f b7 c4 movzwl %r12w,%eax
2d5: 89 02 mov %eax,(%rdx)
That's 6 instructions sved, and to be fair they have some pretty tight
dependencies. Still, on a 3GHz processor, that's maybe 2 nanoseconds.
If we're doing a million IOPS, we could save 2ms/s, or 0.2%. Pretty tough
to measure, let alone justify.
Here's an alternative that doesn't require adding 8 bytes to the
nvme_queue. Instead of storing the shift in db_stride, we can store the
actual stride. Then there is less calculation to be done at
completion time, and the code looks like this:
- writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride));
+ writel(head, nvmeq->q_db + nvmeq->dev->db_stride);
2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
2ce: 48 63 50 40 movslq 0x40(%rax),%rdx
2d2: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax
2d9: 48 8d 14 90 lea (%rax,%rdx,4),%rdx
2dd: 41 0f b7 c4 movzwl %r12w,%eax
2e1: 89 02 mov %eax,(%rdx)
That saves half the instructions, for no increased data structure usage.
One other microoptimisation we can do is change the type of db_stride from
int to unsigned. Then the compiler produces:
2ca: 48 8b 43 08 mov 0x8(%rbx),%rax
2ce: 8b 50 40 mov 0x40(%rax),%edx
2d1: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax
2d8: 48 8d 14 90 lea (%rax,%rdx,4),%rdx
2dc: 41 0f b7 c4 movzwl %r12w,%eax
2e0: 89 02 mov %eax,(%rdx)
which uses one fewer byte (at address 2ce, it uses mov instead of movslq).
Would you like to send a patch which changes the type of db_stride to
unsigned and changes the value stored there to be 1 << the current value?
Or you can try to persuade me again that the tradeoff is worth adding
the extra 8 bytes to the data structure :-)
More information about the Linux-nvme
mailing list