标题: 永远好奇的心--SHLD vs ROL 创建: 2017-08-14 14:14 链接: https://scz.617.cn/misc/201708141414.txt 在64-bits ntdll!KiUserApcDispatcher()中看到"shld rcx,rcx,0x20",以前没有用 过这条指令。F5看到">> 0x20",想当然地将shld理解成64-bits的逻辑右移,甚至没 有留意到l与r的区别。后来看汇编代码,如果是逻辑右移,之后的代码逻辑就不对劲 了。查了一下手册,原来"shld rcx,rcx,0x20"相当于"rol rcx,0x20"。好奇心发作, 为什么不直接用rol,偏要用个"不常见"的shld? 94年在长沙袁家岭新华书店买过一本80486的书,17块,很薄的一本。这是仅有的三 本未在北上南下巅沛流离之时被弃之书,居然熬到了革命胜利的那一天,活着进入了 如今的书架,在可预见的未来里,不再命运多舛,可以颐养天年了。昨日翻了一下, 里面写着80386已引入shld。这条指令是3操作数指令,回想之后蓦然惊觉,学生时代 从未用过3操作数指令。 参: 《Intel 64 and IA-32 Architectures Software Developer Manual: Vol 2》 https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-instruction-set-reference-manual-325383.pdf ROL的伪代码: -------------------------------------------------------------------------- IF OperandSize = 64 THEN COUNTMASK = 3FH; ELSE COUNTMASK = 1FH; FI; tempCOUNT <- (COUNT & COUNTMASK) MOD SIZE WHILE (tempCOUNT <> 0) DO tempCF <- MSB(DEST); DEST <- (DEST * 2) + tempCF; tempCOUNT <- tempCOUNT - 1; OD; ELIHW; IF (COUNT & COUNTMASK) <> 0 THEN CF <- LSB(DEST); FI; IF (COUNT & COUNTMASK) = 1 THEN OF <- MSB(DEST) XOR CF; ELSE OF is undefined; FI; -------------------------------------------------------------------------- CF标志保存最后从最高位移出的比特。如果COUNT=1,OF标志有意义;若移位导致数 据符号位发生变化,OF=1;若移位前后数据符号位未发生变化,OF=0。如果COUNT>1, OF标志无意义(未定义)。 SHLD的伪代码: -------------------------------------------------------------------------- IF (In 64-Bit Mode and REX.W = 1) THEN COUNT <- COUNT MOD 64; ELSE COUNT <- COUNT MOD 32; FI SIZE <- OperandSize; IF COUNT = 0 THEN No operation; ELSE IF COUNT > SIZE THEN (* Bad parameters *) DEST is undefined; CF, OF, SF, ZF, AF, PF are undefined; ELSE (* Perform the shift *) CF <- BIT[DEST, SIZE - COUNT]; (* Last bit shifted out on exit *) FOR i <- SIZE - 1 DOWN TO COUNT DO Bit(DEST, i) <- Bit(DEST, i - COUNT); OD; FOR i <- COUNT - 1 DOWN TO 0 DO BIT[DEST, i] <- BIT[SRC, i - COUNT + SIZE]; OD; FI; FI; -------------------------------------------------------------------------- If the count is 1 or greater, the CF flag is filled with the last bit shifted out of the destination operand. For a 1-bit shift, the OF flag is set if a sign change occurred; otherwise, it is cleared. If the count operand is 0, flags are not affected. -------------------------------------------------------------------------- CF标志保存最后从最高位移出的比特。上述伪代码没有给出OF的变化,Intel的文字 描述有提,变化同ROL。 一般人用shld,基本上op1、op2相异,但实际上op1、op2可以相同,此时相当于循环 移位。op1、op2相同时,不会出现"叠加"效应,可以理解成op2被赋给临时变量再操 作。 又测出一些rasm2的BUG。 $ rasm2 -a x86.olly -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20" 0fa4c920c1c120 $ rasm2 -a x86.olly -b 32 -s intel -o 0 -D 0fa4c920c1c120 0x00000000 4 0fa4c920 shld ecx, ecx, 0x32 0x00000004 3 c1c120 rol ecx, 0x32 x86的汇编引擎未正确支持shld指令: $ rasm2 -a x86 -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20" c1f100c1c120 $ rasm2 -a x86 -b 32 -s intel -o 0 -D c1f100c1c120 0x00000000 3 c1f100 shl ecx, 0x0 0x00000003 3 c1c120 rol ecx, 0x20 $ rasm2 -a x86 -b 32 -s intel -o 0 "shld ecx,ecx,0x20;rol ecx,0x20" | rasm2 -a x86 -b 32 -s intel -o 0 -D -f - 0x00000000 3 c1f100 shl ecx, 0x0 0x00000003 3 c1c120 rol ecx, 0x20 rol是对的,shld被识别成shl,这是两条截然不同的指令。 x86.olly引擎完全不支持64-bits,无论汇编、反汇编。 x86引擎支持64-bits,但前面说了,其汇编引擎有BUG: $ rasm2 -a x86 -b 64 -s intel -o 0 "shld rcx,rcx,0x20;rol rcx,0x20" 48c1f10048c1c120 $ rasm2 -a x86 -b 64 -s intel -o 0 -D 48c1f10048c1c120 0x00000000 4 48c1f100 shl rcx, 0x0 0x00000004 4 48c1c120 rol rcx, 0x20 正确的机器码是: $ rasm2 -a x86 -b 64 -s intel -o 0 -D 480fa4c92048c1c120 0x00000000 5 480fa4c920 shld rcx, rcx, 0x20 0x00000005 4 48c1c120 rol rcx, 0x20 对比32-bits、64-bits,后者机器码序列前部多了0x48。 有个网站提供在线汇编、反汇编: https://defuse.ca/online-x86-assembler.htm 应急时但用无妨。 cdb.exe的汇编引擎不支持64-bits指令,但反汇编引擎支持,即a命令不支持64-bits 寄存器、u命令支持。windbg帮助里有: -------------------------------------------------------------------------- The a command does not support 64-bit instruction mnemonics. However, the a command is enabled regardless of whether you are debugging a 32-bit target or a 64-bit target. Because of the similarities between x86 and x64 instructions, you can sometimes use the a command successfully when debugging a 64-bit target. -------------------------------------------------------------------------- 如果临时想用cdb.exe测试,用eb命令输入机器码即可。 > .dvalloc 0x1000 Allocated 1000 bytes starting at 00000000`00060000 > r $t0=0`00060000 > !vprot @$t0 BaseAddress: 0000000000060000 AllocationBase: 0000000000060000 AllocationProtect: 00000040 PAGE_EXECUTE_READWRITE RegionSize: 0000000000001000 State: 00001000 MEM_COMMIT Protect: 00000040 PAGE_EXECUTE_READWRITE Type: 00020000 MEM_PRIVATE > eb @$t0 48 0f a4 c9 20 48 c1 c1 20 > u @$t0 l 2 00000000`00060000 480fa4c920 shld rcx,rcx,20h 00000000`00060005 48c1c120 rol rcx,20h > r rcx=0x0123456789abcdef > r rip=@$t0 > p > r cf,rcx cf=1 rcx=89abcdef01234567 > p > r cf,rcx cf=1 rcx=0123456789abcdef windbg的.dvfree有BUG,无法释放。 bluerust就"shld vs rol"找来一些资料。 参: 《Intel 64 and IA-32 Architectures Optimization Reference Manual》 https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "3.5.1.5 Bitwise Rotation"小节有如下内容: -------------------------------------------------------------------------- In Intel microarchitecture code name Sandy Bridge, ROL/ROR by immediate has 1-cycle throughput, SHLD/SHRD using the same register as source and destination by an immediate constant has 1-cycle latency with 0.5 cycle throughput. The "ROL/ROR reg, imm8" instruction has two micro-ops with the latency of 1-cycle for the rotate register result and 2-cycles for the flags, if used. In Intel microarchitecture code name Ivy Bridge, The "ROL/ROR reg, imm8" instruction with immediate greater than 1, is one micro-op with one-cycle latency when the overflow flag result is used. When the immediate is one, dependency on the overflow flag result of ROL/ROR by a subsequent instruction will see the ROL/ROR instruction with two-cycle latency. -------------------------------------------------------------------------- 上面这段话对于不怎么接触硬件设计的程序员来说,有些拗口。 有篇文章对latency(延迟)、throughput(吞吐量)进行了不错的解释: Understanding Latency versus Throughput - Sergio Ramirez [2010-09-13] https://community.cadence.com/cadence_blogs_8/b/sd/archive/2010/09/13/understanding-latency-vs-throughput -------------------------------------------------------------------------- Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods. Throughput is the number of such actions executed or results produced per unit of time. This is measured in units of whatever is being produced (cars, motorcycles, I/O samples, memory words, iterations) per unit of time. The term "memory bandwidth" is sometimes used to specify the throughput of memory systems. -------------------------------------------------------------------------- 再回头看Intel的文档,结论可以简化成,SHLD/SHRD比ROL/ROR快。 基本没有任何正当理由关心这种细节,现代编译器的优化处理不是吹的。完全是本人 在逆向工程中好奇心发作,遂记之。