标题: 永远好奇的心--SHLD vs ROL

创建: 2017-08-14 14:14
链接: https://scz.617.cn/misc/201708141414.txt

在64-bits ntdll!KiUserApcDispatcher()中看到"shld rcx,rcx,0x20"，以前没有用
过这条指令。F5看到">> 0x20"，想当然地将shld理解成64-bits的逻辑右移，甚至没
有留意到l与r的区别。后来看汇编代码，如果是逻辑右移，之后的代码逻辑就不对劲
了。查了一下手册，原来"shld rcx,rcx,0x20"相当于"rol rcx,0x20"。好奇心发作，
为什么不直接用rol，偏要用个"不常见"的shld？

94年在长沙袁家岭新华书店买过一本80486的书，17块，很薄的一本。这是仅有的三
本未在北上南下巅沛流离之时被弃之书，居然熬到了革命胜利的那一天，活着进入了
如今的书架，在可预见的未来里，不再命运多舛，可以颐养天年了。昨日翻了一下，
里面写着80386已引入shld。这条指令是3操作数指令，回想之后蓦然惊觉，学生时代
从未用过3操作数指令。

参:

《Intel 64 and IA-32 Architectures Software Developer Manual: Vol 2》
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf

ROL的伪代码:

--------------------------------------------------------------------------
IF OperandSize = 64
    THEN COUNTMASK = 3FH;
    ELSE COUNTMASK = 1FH;
FI;
tempCOUNT <- (COUNT & COUNTMASK) MOD SIZE
WHILE (tempCOUNT <> 0)
    DO
        tempCF <- MSB(DEST);
        DEST <- (DEST * 2) + tempCF;
        tempCOUNT <- tempCOUNT - 1;
    OD;
ELIHW;
IF (COUNT & COUNTMASK) <> 0
    THEN CF <- LSB(DEST);
FI;
IF (COUNT & COUNTMASK) = 1
    THEN OF <- MSB(DEST) XOR CF;
    ELSE OF is undefined;
FI;
--------------------------------------------------------------------------

CF标志保存最后从最高位移出的比特。如果COUNT=1，OF标志有意义；若移位导致数
据符号位发生变化，OF=1；若移位前后数据符号位未发生变化，OF=0。如果COUNT>1，
OF标志无意义(未定义)。

SHLD的伪代码:

--------------------------------------------------------------------------
IF (In 64-Bit Mode and REX.W = 1)
    THEN COUNT <- COUNT MOD 64;
    ELSE COUNT <- COUNT MOD 32;
FI
SIZE <- OperandSize;
IF COUNT = 0
    THEN
        No operation;
    ELSE
        IF COUNT > SIZE
            THEN (* Bad parameters *)
                DEST is undefined;
                CF, OF, SF, ZF, AF, PF are undefined;
            ELSE (* Perform the shift *)
                CF <- BIT[DEST, SIZE - COUNT];
                (* Last bit shifted out on exit *)
                FOR i <- SIZE - 1 DOWN TO COUNT
                    DO
                        Bit(DEST, i) <- Bit(DEST, i - COUNT);
                    OD;
                FOR i <- COUNT - 1 DOWN TO 0
                    DO
                        BIT[DEST, i] <- BIT[SRC, i - COUNT + SIZE];
                    OD;
        FI;
FI;
--------------------------------------------------------------------------
If the count is 1 or greater, the CF flag is filled with the last bit
shifted out of the destination operand. For a 1-bit shift, the OF flag is
set if a sign change occurred; otherwise, it is cleared. If the count
operand is 0, flags are not affected.
--------------------------------------------------------------------------

CF标志保存最后从最高位移出的比特。上述伪代码没有给出OF的变化，Intel的文字
描述有提，变化同ROL。

一般人用shld，基本上op1、op2相异，但实际上op1、op2可以相同，此时相当于循环
移位。op1、op2相同时，不会出现"叠加"效应，可以理解成op2被赋给临时变量再操
作。

又测出一些rasm2的BUG。

$ rasm2 -a x86.olly -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20"
0fa4c920c1c120
$ rasm2 -a x86.olly -b 32 -s intel -o 0 -D 0fa4c920c1c120
0x00000000   4                 0fa4c920  shld ecx, ecx, 0x32
0x00000004   3                   c1c120  rol ecx, 0x32

x86的汇编引擎未正确支持shld指令:

$ rasm2 -a x86 -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20"
c1f100c1c120
$ rasm2 -a x86 -b 32 -s intel -o 0 -D c1f100c1c120
0x00000000   3                   c1f100  shl ecx, 0x0
0x00000003   3                   c1c120  rol ecx, 0x20
$ rasm2 -a x86 -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20" | rasm2 -a x86 -b 32 -s intel -o 0 -D -f -
0x00000000   3                   c1f100  shl ecx, 0x0
0x00000003   3                   c1c120  rol ecx, 0x20

rol是对的，shld被识别成shl，这是两条截然不同的指令。

x86.olly引擎完全不支持64-bits，无论汇编、反汇编。

x86引擎支持64-bits，但前面说了，其汇编引擎有BUG:

$ rasm2 -a x86 -b 64 -s intel -o 0 "shld rcx,rcx,0x20;rol rcx,0x20"
48c1f10048c1c120
$ rasm2 -a x86 -b 64 -s intel -o 0 -D 48c1f10048c1c120
0x00000000   4                 48c1f100  shl rcx, 0x0
0x00000004   4                 48c1c120  rol rcx, 0x20

正确的机器码是:

$ rasm2 -a x86 -b 64 -s intel -o 0 -D 480fa4c92048c1c120
0x00000000   5               480fa4c920  shld rcx, rcx, 0x20
0x00000005   4                 48c1c120  rol rcx, 0x20

对比32-bits、64-bits，后者机器码序列前部多了0x48。

有个网站提供在线汇编、反汇编:

https://defuse.ca/online-x86-assembler.htm

应急时但用无妨。

cdb.exe的汇编引擎不支持64-bits指令，但反汇编引擎支持，即a命令不支持64-bits
寄存器、u命令支持。windbg帮助里有:

--------------------------------------------------------------------------
The a command does not support 64-bit instruction mnemonics. However, the
a command is enabled regardless of whether you are debugging a 32-bit
target or a 64-bit target. Because of the similarities between x86 and x64
instructions, you can sometimes use the a command successfully when
debugging a 64-bit target.
--------------------------------------------------------------------------

如果临时想用cdb.exe测试，用eb命令输入机器码即可。

> .dvalloc 0x1000
Allocated 1000 bytes starting at 00000000`00060000
> r $t0=0`00060000
> !vprot @$t0
BaseAddress:       0000000000060000
AllocationBase:    0000000000060000
AllocationProtect: 00000040  PAGE_EXECUTE_READWRITE
RegionSize:        0000000000001000
State:             00001000  MEM_COMMIT
Protect:           00000040  PAGE_EXECUTE_READWRITE
Type:              00020000  MEM_PRIVATE
> eb @$t0 48 0f a4 c9 20 48 c1 c1 20
> u @$t0 l 2
00000000`00060000 480fa4c920      shld    rcx,rcx,20h
00000000`00060005 48c1c120        rol     rcx,20h
> r rcx=0x0123456789abcdef
> r rip=@$t0
> p
> r cf,rcx
cf=1 rcx=89abcdef01234567
> p
> r cf,rcx
cf=1 rcx=0123456789abcdef

windbg的.dvfree有BUG，无法释放。

bluerust就"shld vs rol"找来一些资料。

参:

《Intel 64 and IA-32 Architectures Optimization Reference Manual》
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

"3.5.1.5 Bitwise Rotation"小节有如下内容:

--------------------------------------------------------------------------
In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate
has 1-cycle throughput, SHLD/SHRD using the same register as source and
destination by an immediate constant has 1-cycle latency with 0.5 cycle
throughput. The "ROL/ROR reg, imm8" instruction has two micro-ops with the
latency of 1-cycle for the rotate register result and 2-cycles for the
flags, if used.

In Intel microarchitecture code name Ivy Bridge, The "ROL/ROR reg, imm8"
instruction with immediate greater than 1, is one micro-op with one-cycle
latency when the overflow flag result is used. When the immediate is one,
dependency on the overflow flag result of ROL/ROR by a subsequent
instruction will see the ROL/ROR instruction with two-cycle latency.
--------------------------------------------------------------------------

上面这段话对于不怎么接触硬件设计的程序员来说，有些拗口。

有篇文章对latency(延迟)、throughput(吞吐量)进行了不错的解释:

Understanding Latency versus Throughput - Sergio Ramirez [2010-09-13]
https://community.cadence.com/cadence_blogs_8/b/sd/archive/2010/09/13/understanding-latency-vs-throughput

--------------------------------------------------------------------------
Latency is the time required to perform some action or to produce some
result. Latency is measured in units of time -- hours, minutes, seconds,
nanoseconds or clock periods.

Throughput is the number of such actions executed or results produced per
unit of time. This is measured in units of whatever is being produced
(cars, motorcycles, I/O samples, memory words, iterations) per unit of
time. The term "memory bandwidth" is sometimes used to specify the
throughput of memory systems.
--------------------------------------------------------------------------

再回头看Intel的文档，结论可以简化成，SHLD/SHRD比ROL/ROR快。

基本没有任何正当理由关心这种细节，现代编译器的优化处理不是吹的。完全是本人
在逆向工程中好奇心发作，遂记之。