标题: 陈旧系统中如何用GDB调试子进程

创建: 2019-07-05 18:34
更新: 2019-07-09 09:22
链接: https://scz.617.cn/unix/201907051834.txt

本文未做科普，有强前置条件，要求读者对OS及GDB有清楚认知，小白莫要浪费时间。

Windows中windbg支持"-o"命令行参数，或者在交互式提示符中输入".childdbg 1"显
式允许调试子进程。

现代Linux中gdb对调试子进程有如下支持:

set follow-fork-mode child
set follow-exec-mode new
set detach-on-fork on

参看:

--------------------------------------------------------------------------
《如何用GDB调试子进程》
https://scz.617.cn/unix/201206011217.txt

《GDB启动被调试进程时如何尽早断下》
https://scz.617.cn/unix/201901161404.txt

《start the inferior without using a shell》
https://scz.617.cn/unix/201901171036.txt
--------------------------------------------------------------------------

对于各种*nix变体，gdb是否直接支持调试子进程，依赖于内核、libc、gdb的具体实
现，上述gdb设置未必有效。本文考虑陈旧系统中极端不便情况下对子进程的调试。

不考虑如下手段:

a) 拦截fork()修改返回值强制走子进程流程
b) 有源码，在子进程流程中插入sleep()，等待Attach

以x86/FreeBSD 6.1为测试系统，这个系统比较陈旧，但有一些设备以此为蓝本修改
而来，显然我们并不是坐在象牙塔中提出这个诡异的需求。设计实验，调试csh中执
行id的过程。

查看csh中fork()、exec()、sleep()相关的PLT项所在:

$ objdump -j .plt -d csh | grep -A3 "fork@plt>:"
08049ea4 <fork@plt>:
 8049ea4:       ff 25 c4 fa 08 08       jmp    *0x808fac4
 8049eaa:       68 98 02 00 00          push   $0x298
 8049eaf:       e9 b0 fa ff ff          jmp    8049964 <_init+0x14>
--
0804a054 <vfork@plt>:
 804a054:       ff 25 30 fb 08 08       jmp    *0x808fb30
 804a05a:       68 70 03 00 00          push   $0x370
 804a05f:       e9 00 f9 ff ff          jmp    8049964 <_init+0x14>

$ objdump -j .plt -d csh | grep -A3 "exec.*@plt>:"
08049eb4 <execv@plt>:
 8049eb4:       ff 25 c8 fa 08 08       jmp    *0x808fac8
 8049eba:       68 a0 02 00 00          push   $0x2a0
 8049ebf:       e9 a0 fa ff ff          jmp    8049964 <_init+0x14>

$ objdump -j .plt -d csh | grep -A3 "sleep@plt>:"
08049ba4 <sleep@plt>:
 8049ba4:       ff 25 04 fa 08 08       jmp    *0x808fa04
 8049baa:       68 18 01 00 00          push   $0x118
 8049baf:       e9 b0 fd ff ff          jmp    8049964 <_init+0x14>

这个版本的csh中有fork()、vfork()、execv()、sleep()。

查看csh的PID:

$ echo $$
28416

在另一个SSH会话中Attch它:

$ gdb-7.6 -q -nx -x /tmp/gdbinit_x86_bsd.txt -p 28416

(gdb) display/5i $pc

通过动态调试确认，csh中执行id时，用vfork()。用IDA分析vfork()主调函数附近流
程:

--------------------------------------------------------------------------
08064DB8 A1 38 08 09 08              mov     eax, ds:dword_8090838
...
08064DD8 85 C0                       test    eax, eax
...
08064E0E 0F 84 1A 03 00 00           jz      loc_806512E
08064E14 E8 8B 50 FE FF              call    _fork
...
0806512E E8 21 4F FE FF              call    _vfork
--------------------------------------------------------------------------
if ( dword_8090838 )
{
    pid = fork();
}
else
{
    pid = vfork();
}
--------------------------------------------------------------------------

下列代码片段位于pid为0的第一条指令处，意即vfork()后子进程流程的起始点:

--------------------------------------------------------------------------
08064E31 A1 E0 10 0A 08              mov     eax, ds:dword_80A10E0
08064E36 31 FF                       xor     edi, edi
08064E38 85 C0                       test    eax, eax
08064E3A 0F 85 9C 05 00 00           jnz     loc_80653DC
--------------------------------------------------------------------------

(gdb) b *0x8064e36
Breakpoint 1 at 0x8064e36
(gdb) c

如果直接对0x8064e36设断，在csh中执行id，效果如下:

$ id
Trace/BPT trap (core dumped)

这是因为子进程中有int3(0xcc)命中，产生SIGTRAP信号；理想情况下支持子进程调
试的gdb会捕获并处理该信号，但在"FreeBSD 6.1+GDB 7.6"场景中，gdb无法直接调
试子进程，gdb没有捕获并处理该信号，该信号被直接分发至子进程；子进程本身肯
定没有捕获并处理该信号，没有安装SIGTRAP信号句柄，此时该信号的缺省行为是使
子进程终止。

已知无法简单调试子进程。设想修改子进程流程中的代码片段，制造死循环，然后从
其他SSH会话中用GDB Attch子进程。Patch点就选0x8064e36。

$ kstoolex x32nasm "jmp ." 0 0x8064e36
0000000008064e36 [ eb fe ] jmp .
$ rasm2 -a x86 -b 32 -s intel -o 0x8064e36 -D "eb fe"
0x08064e36   2                     ebfe  jmp 0x8064e36

(gdb) x/3i 0x8064e36
   0x8064e36:   xor    edi,edi
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
(gdb) x/1wx 0x8064e36
0x8064e36:      0xc085ff31
(gdb) set *(short int*)0x8064e36=0xfeeb
(gdb) x/1i 0x8064e36
   0x8064e36:   jmp    0x8064e36
(gdb) x/1wx 0x8064e36
0x8064e36:      0xc085feeb

0x8064e36处出现死循环。

(gdb) c

在csh中执行id触发子进程的死循环。在其他SSH会话中查看子进程PID:

$ ps l | grep csh | grep 28416
 1000 28416 28601   0   8  0  4728  3632 ppwait DX+   p1    0:00.01 csh
 1000 28605 28416 172 117  0  4728  3632 -      RV+   p1    0:46.02 csh

$ ps uwx -p 28605
USER    PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
scz   28605 99.0  1.4  4728  3632  p1  RV+  11:25AM  13:00.37 csh

28605是子进程PID，CPU占用率99%，这是死循环的表象。

STAT列意义如下:

D   Marks a process in disk (or other short term, uninterruptible) wait.
I   Marks a process that is idle (sleeping for longer than about 20 seconds).
L   Marks a process that is waiting to acquire a lock.
R   Marks a runnable process.
S   Marks a process that is sleeping for less than about 20 seconds.
T   Marks a stopped process.
W   Marks an idle interrupt thread.
Z   Marks a dead process (a ``zombie'').

+   The process is in the foreground process group of its control terminal.
<   The process has raised CPU scheduling priority.
E   The process is trying to exit.
J   Marks a process which is in jail(2).  The hostname of the prison can be found in /proc/<pid>/status.
L   The process has pages locked in core (for example, for raw I/O).
N   The process has reduced CPU scheduling priority (see setpriority(2)).
s   The process is a session leader.
V   The process is suspended during a vfork(2).
W   The process is swapped out.
X   The process is being traced or debugged.

FreeBSD与Linux在STAT列的显示有区别，不要看Linux的man手册。

在其他SSH会话中Attach子进程:

$ gdb-7.6 -q -nx -x /tmp/gdbinit_x86_bsd.txt -p 28605

Attach之后没有看到想像中的"(gdb)"提示符，这个gdb僵那儿了，前一个gdb也僵了，
在两个gdb中Ctrl-C断不下来。我以为第二个gdb会断在0x8064e36附近，现在看来有
其他干挠。

$ ps uwx -p 28605
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
scz  28605 99.0  1.4  4728  3632  p1  RXV+ 11:25AM  18:49.31 csh

STAT列的X表明子进程正在被调试，为什么gdb没断下来？

$ kill -9 28605

第二个gdb看到:

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb)

第一个gdb不僵了，可以Ctrl-C打断。

既然死循环有幺蛾子，换sleep()试试。

$ echo -n -e "push 0x7fffffff\ncall 0x8049ba4" | kstoolex x32nasm - 0 0x8064e36
0000000008064e36 [ 68 ff ff ff 7f ] push 0x7fffffff
0000000008064e3b [ e8 64 4d fe ff ] call 0x8049ba4
$ echo -n -e "push 0x7fffffff\ncall 0x8049ba4" | kstoolex x32nasm - q 0x8064e36
68 ff ff ff 7f e8 64 4d fe ff
$ rasm2 -a x86 -b 32 -s intel -o 0x8064e36 -D "68 ff ff ff 7f e8 64 4d fe ff"
0x08064e36   5               68ffffff7f  push 0x7fffffff
0x08064e3b   5               e8644dfeff  call 0x8049ba4

查看原来的字节流:

(gdb) x/3i 0x8064e36
   0x8064e36:   xor    edi,edi
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
(gdb) x/3wx 0x8064e36
0x8064e36:      0xc085ff31      0x059c850f      0x94a10000
(gdb) db 0x8064e36 10
08064e36: 31 ff 85 c0 0f 85 9c 05 00 00                    1.........

Patch:

set *(int*)0x8064e36=0xffffff68
set *(int*)(0x8064e36+4)=0x4d64e87f
set *(short int*)(0x8064e36+8)=0xfffe

UnPatch:

set *(int*)0x8064e36=0xc085ff31
set *(int*)(0x8064e36+4)=0x059c850f
set *(int*)(0x8064e36+8)=0x94a10000

查看Patch后的代码:

(gdb) x/3i 0x8064e36
   0x8064e36:   push   0x7fffffff
   0x8064e3b:   call   0x8049ba4 <sleep@plt>
   0x8064e40:   mov    eax,ds:0x80d6194
(gdb) x/3wx 0x8064e36
0x8064e36:      0xffffff68      0x4d64e87f      0x94a1fffe
(gdb) db 0x8064e36 10
08064e36: 68 ff ff ff 7f e8 64 4d fe ff                    h.....dM..

0x8064e36在调用sleep(0x7fffffff)，sleep()的单位是秒，足够长。

(gdb) c

在csh中执行id触发子进程的sleep(0x7fffffff)。在其他SSH会话中查看子进程PID:

$ ps l | grep csh | grep 28416
 1000 28416 28601   0   8  0  4728  3632 ppwait DX+   p1    0:00.01 csh
 1000 28704 28416   0   8  0  4728  3632 nanslp IV+   p1    0:00.00 csh

$ ps uwx -p 28704
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
scz  28704  0.0  1.4  4728  3632  p1  IV+  12:01PM   0:00.00 csh

28704是子进程PID，STAT列的I表明子进程正在sleep()。

在其他SSH会话中Attach子进程:

$ gdb-7.6 -q -nx -x /tmp/gdbinit_x86_bsd.txt -p 28704

0x2804e3c8 in .rtld_start () from /libexec/ld-elf.so.1
(gdb) display/5i $pc
1: x/5i $pc
=> 0x2804e3c8 <.rtld_start>:    xor    ebp,ebp
   0x2804e3ca <.rtld_start+2>:  mov    eax,esp
   0x2804e3cc <.rtld_start+4>:  mov    esi,esp
   0x2804e3ce <.rtld_start+6>:  and    esp,0xfffffff0
   0x2804e3d1 <.rtld_start+9>:  sub    esp,0x10

与死循环不同，sleep()中的子进程被Attach后断下来了:

(gdb) x/3i 0x8064e36
   0x8064e36:   Cannot access memory at address 0x8064e36
(gdb) info file
Symbols from "/usr/bin/id".
Unix child process:
        Using the running image of attached process 28704.
        While running this, GDB does not access memory from...
Local exec file:
        `/usr/bin/id', file type elf32-i386-freebsd.
        Entry point: 0x80489ec
        ...

我以为子进程此刻仍然对应csh，谁知gdb显示对应id，并且无法访问内存，$PC也不
是我想像的sleep()或0x8064e36附近。

此时已切入id进程空间，但很早，流程尚未经过id的e_entry。对id的e_entry设置临
时断点会命中:

(gdb) tb *0x80489ec
Temporary breakpoint 1 at 0x80489ec
(gdb) c
Continuing.

Program received signal SIGSTOP, Stopped (signal).
0x2804e3c8 in .rtld_start () from /libexec/ld-elf.so.1
1: x/5i $pc
=> 0x2804e3c8 <.rtld_start>:    xor    ebp,ebp
   0x2804e3ca <.rtld_start+2>:  mov    eax,esp
   0x2804e3cc <.rtld_start+4>:  mov    esi,esp
   0x2804e3ce <.rtld_start+6>:  and    esp,0xfffffff0
   0x2804e3d1 <.rtld_start+9>:  sub    esp,0x10
(gdb) c
Continuing.

Temporary breakpoint 1, 0x080489ec in ?? ()
1: x/5i $pc
=> 0x80489ec:   push   ebp
   0x80489ed:   mov    ebp,esp
   0x80489ef:   push   edi
   0x80489f0:   push   esi
   0x80489f1:   push   ebx
(gdb) c
Continuing.
[Inferior 1 (process 28704) exited normally]

连续c，id会正常结束，有输出。

0x8064e36处的原始流程是:

--------------------------------------------------------------------------
v35 = 0;
if ( dword_80A10E0 )
{
    sigsetmask( ::mask );
    dword_80A10E0   = 0;
}
--------------------------------------------------------------------------

Patch成sleep()之后，上述代码片段得不到执行，但这个流程上的微小差异不影响后
续id的执行，所以前面连续c之后id会正常结束。

至此，脑海中闪出一个疑问，这种鬼现象是不是跟vfork()有关？

vfork(2)中有:

--------------------------------------------------------------------------
The vfork() system call can be used to create new processes without fully
copying the address space of the old process, which is horrendously
inefficient in a paged environment. It is useful when the purpose of
fork(2) would have been to create a new system context for an execve(2).
The vfork() system call differs from fork(2) in that the child borrows the
parent's memory and thread of control until a call to execve(2) or an exit
(either by a call to _exit(2) or abnormally). The parent process is
suspended while the child is using its resources.

The vfork() system call returns 0 in the child's context and (later) the
pid of the child in the parent's context.

The vfork() system call can normally be used just like fork(2). It does
not work, however, to return while running in the child's context from the
procedure that called vfork() since the eventual return from vfork() would
then return to a no longer existent stack frame. Be careful, also, to call
_exit(2) rather than exit(3) if you cannot execve(2), since exit(3) will
flush and close standard I/O channels, and thereby mess up the parent
processes standard I/O data structures. (Even with fork(2) it is wrong to
call exit(3) since buffered data would then be flushed twice.)

This system call will be eliminated when proper system sharing mechanisms
are implemented. Users should not depend on the memory sharing semantics
of vfork() as it will, in that case, be made synonymous to fork(2).

To avoid a possible deadlock situation, processes that are children in the
middle of a vfork() are never sent SIGTTOU or SIGTTIN signals; rather,
output or ioctl(2) calls are allowed and input attempts result in an
end-of-file indication.
--------------------------------------------------------------------------

W. Richard Stevens在APUE的8.4小节对比了fork()和vfork()。有两点要引起注意:

--------------------------------------------------------------------------
a)

vfork()得到的子进程在exec*()之前与父进程共用地址空间，在此期间子进程对内存
的修改将影响父进程，这里说的内存包括全局变量和stack。

b)

vfork()得到的子进程优先于父进程得到调度执行，子进程调用exec*()之后父进程才
有机会得到调度执行；在此期间子进程如果有依赖父进程的操作，会出现死锁。
--------------------------------------------------------------------------

猜测一下前述子进程中死循环、sleep()状态被Attach时的表现:

--------------------------------------------------------------------------
对于vfork()得到的子进程，GDB Attach时断在"(gdb)"提示符的时机是子进程调用
exec*()之后，在此之前的流程会被Attach操作影响，但不会断到"(gdb)"提示符。

死循环情形，子进程永远无法调用exec*()，Attch上去的gdb永远断不下来，第二个
gdb外在表现为僵死。父进程永远没有机会得到调度执行，第一个gdb外在表现为僵死。

slee()情形，子进程的nanosleep()被Attch操作打断，返回-1，errno被设成EINTR。
子进程流程从0x8064e40处继续，直至调用exec*()后断到"(gdb)"提示符。这可以解
释为什么"info file"看到的不是csh而是id，也可以解释为什么不能访问0x8064e36，
因为地址空间布局已经不是csh的了。
--------------------------------------------------------------------------

上面只是一种合理猜测，我没有调试FreeBSD内核及GDB 7.6代码。

0x8064db8处的代码表明，如果0x8090838处不为0，将调用fork()，否则调用vfork()。

(gdb) x/1wx 0x8090838
0x8090838:      0x00000000

在gdb里看了一下，0x8090838处为0。做个实验，将0x8090838处改成1，迫使父进程
调用fork()，看这次能否调试子进程。

set *(int*)0x8090838=1
set *(int*)0x8064e36=0xffffff68
set *(int*)(0x8064e36+4)=0x4d64e87f
set *(short int*)(0x8064e36+8)=0xfffe

测试sleep()情形。

(gdb) c

在csh中执行id触发子进程的sleep(0x7fffffff)。在其他SSH会话中查看子进程PID:

$ ps l | grep csh | grep 28416
 1000 28416 28601   0  20  0  4728  3632 pause  IX+   p1    0:00.02 csh
 1000 28888 28416   0   8  0  4728  3632 nanslp I+    p1    0:00.00 csh

$ ps uwx -p 28888
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
scz  28888  0.0  1.4  4728  3632  p1  I+    1:15PM   0:00.00 csh

28888是子进程PID。

在其他SSH会话中Attach子进程:

$ gdb-7.6 -q -nx -x /tmp/gdbinit_x86_bsd.txt -p 28888

0x281aa217 in nanosleep () from /lib/libc.so.6
(gdb) display/5i $pc
1: x/5i $pc
=> 0x281aa217 <nanosleep+7>:    jb     0x281aa1fc
   0x281aa219 <nanosleep+9>:    ret
   0x281aa21a <nanosleep+10>:   nop
   0x281aa21b <nanosleep+11>:   nop
   0x281aa21c <nanosleep+12>:   push   ebx
(gdb) bt
#0  0x281aa217 in nanosleep () from /lib/libc.so.6
#1  0x2818e669 in sleep () from /lib/libc.so.6
#2  0x08064e40 in ?? ()
#3  0x08064a24 in ?? ()
#4  0x0804a8f5 in ?? ()
#5  0x0804c8bc in ?? ()
#6  0x0804a26a in ?? ()
#7  0x00000001 in ?? ()
(gdb) info file
Symbols from "/bin/csh".
Unix child process:
        Using the running image of attached process 28888.
        While running this, GDB does not access memory from...
Local exec file:
        `/bin/csh', file type elf32-i386-freebsd.
        Entry point: 0x804a1f4
        ...

如预期般地断在nanosleep()中，调用栈回溯里有0x8064e40，子进程此刻仍然对应
csh。

在子进程中恢复到Patch之前的状态:

set *(int*)0x8090838=0
set *(int*)0x8064e36=0xc085ff31
set *(int*)(0x8064e36+4)=0x059c850f
set *(int*)(0x8064e36+8)=0x94a10000

(gdb) x/1wx 0x8090838
0x8090838:      0x00000000
(gdb) x/3i 0x8064e36
   0x8064e36:   xor    edi,edi
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc

修改保存在栈中的sleep()的RetAddr，使之指向0x8064e36:

(gdb) frame 1
#1  0x2818e669 in sleep () from /lib/libc.so.6
(gdb) x/2wx $ebp
0xbfbf1be4:     0xbfbf40d8      0x08064e40
(gdb) set *(int*)($ebp+4)=0x8064e36
(gdb) bt 3
#0  0x281aa217 in nanosleep () from /lib/libc.so.6
#1  0x2818e669 in sleep () from /lib/libc.so.6
#2  0x08064e36 in ?? ()
(More stack frames follow...)

在0x8064e36处设断临时断点并命中:

(gdb) tb *0x08064e36
Temporary breakpoint 1 at 0x8064e36
(gdb) c
Continuing.

Temporary breakpoint 1, 0x08064e36 in ?? ()
1: x/5i $pc
=> 0x8064e36:   xor    edi,edi
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
   0x8064e40:   mov    eax,ds:0x80d6194
   0x8064e45:   test   eax,eax
(gdb) i r eax esp
eax            0x7ffffff7       2147483639
esp            0xbfbf1bec       0xbfbf1bec

恢复eax:

(gdb) x/1wx 0x80a10e0
0x80a10e0:      0x00000001
(gdb) set $eax=1

Patch时0x8064e36处的"push 0x7fffffff"多消耗了栈上4字节，需要恢复esp:

(gdb) set $esp+=4

至此已经严格恢复到原始子进程状态，没有Patch的状态，并且断在vfork()后子进程流
程的起始点附近(0x8064e36)。

在子进程中拦截对execv()的调用:

(gdb) tb *0x8049eb4
(gdb) c
Continuing.

Temporary breakpoint 2, 0x08049eb4 in execv@plt ()
1: x/5i $pc
=> 0x8049eb4 <execv@plt>:       jmp    DWORD PTR ds:0x808fac8
   0x8049eba <execv@plt+6>:     push   0x2a0
   0x8049ebf <execv@plt+11>:    jmp    0x8049964

如愿断下，定位主调点:

(gdb) x/1wx $esp
0xbfbf1b48:     0x08052c57
(gdb) bt
#0  0x08049eb4 in execv@plt ()
#1  0x08052c57 in ?? ()
#2  0x08053219 in ?? ()
#3  0x080648c5 in ?? ()
#4  0x08064a24 in ?? ()
#5  0x0804a8f5 in ?? ()
#6  0x0804c8bc in ?? ()
#7  0x0804a26a in ?? ()
#8  0x00000001 in ?? ()

--------------------------------------------------------------------------
08052C52 E8 5D 72 FF FF              call    _execv
08052C57 C7 05 60 83 0F 08 00 00+    mov     ds:dword_80F8360, 0
--------------------------------------------------------------------------
080648C0 E8 AB E6 FE FF              call    sub_8052F70
080648C5 83 C4 10                    add     esp, 10h
--------------------------------------------------------------------------

0x80648c0(调用execv)与0x806512e(调用vfork)位于同一函数sub_80643B8中。

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x2804e3c8 in ?? ()
1: x/5i $pc
=> 0x2804e3c8:  xor    ebp,ebp
   0x2804e3ca:  mov    eax,esp
   0x2804e3cc:  mov    esi,esp
   0x2804e3ce:  and    esp,0xfffffff0
   0x2804e3d1:  sub    esp,0x10
(gdb) info file
Symbols from "/bin/csh".
Unix child process:
        Using the running image of attached process 28888.
        While running this, GDB does not access memory from...
Local exec file:
        `/bin/csh', file type elf32-i386-freebsd.
        Entry point: 0x804a1f4
        ...

c之后遭遇SIGTRAP信号是execv()引起的，当前进程已经由csh换成id，"info file"
未能正确反映这个变化，仍然错误显示成csh。用"ps uwx -p 28888"可以看到
COMMAND列已从csh变成id。

注意0x2804e3c8在前文出现过一次，当时对应符号".rtld_start"。

对id的e_entry设置临时断点会命中:

(gdb) tb *0x80489ec
Temporary breakpoint 3 at 0x80489ec
(gdb) c
Continuing.

Temporary breakpoint 3, 0x080489ec in ?? ()
2: x/5i $pc
=> 0x80489ec:   push   ebp
   0x80489ed:   mov    ebp,esp
   0x80489ef:   push   edi
   0x80489f0:   push   esi
   0x80489f1:   push   ebx

这个系统不支持"catch exec"，但前面的演示实际达到了"catch exec"的效果。

至此，强制fork()之后，成功调试子进程，无论流程位于execv()之前还是之后。强
制fork()之后，不用sleep()，就用死循环呢？

set *(int*)0x8090838=1
set *(short int*)0x8064e36=0xfeeb

测试死循环情形。

(gdb) c

在csh中执行id触发子进程的死循环。在其他SSH会话中查看子进程PID:

$ ps l | grep csh | grep 28416
 1000 28416 28601   0  20  0  4728  3632 pause  SX+   p1    0:00.03 csh
 1000 29526 28416 119 110  0  4728  3632 -      R+    p1    0:02.27 csh

$ ps uwx -p 29526
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
scz  29526 97.0  1.4  4728  3632  p1  R+    6:09PM   0:17.18 csh

29526是子进程PID。

在其他SSH会话中Attach子进程:

$ gdb-7.6 -q -nx -x /tmp/gdbinit_x86_bsd.txt -p 29526

0x08064e36 in ?? ()
(gdb) display/5i $pc
1: x/5i $pc
=> 0x8064e36:   jmp    0x8064e36
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
   0x8064e40:   mov    eax,ds:0x80d6194
   0x8064e45:   test   eax,eax

如愿断在0x8064e36处，Patch过的死循环所在，子进程此刻仍然对应csh。

在子进程中恢复到Patch之前的状态:

set *(int*)0x8090838=0
set *(int*)0x8064e36=0xc085ff31

单步跟踪正常:

(gdb) x/3i $pc
=> 0x8064e36:   xor    edi,edi
   0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
(gdb) i r eax esp
eax            0x1      1
esp            0xbfbf1bf0       0xbfbf1bf0
(gdb) si
0x08064e38 in ?? ()
1: x/5i $pc
=> 0x8064e38:   test   eax,eax
   0x8064e3a:   jne    0x80653dc
   0x8064e40:   mov    eax,ds:0x80d6194
   0x8064e45:   test   eax,eax
   0x8064e47:   je     0x8064e78

强制fork()之后，死循环比sleep()简便多了，不用恢复eax、esp。最开始的死循环
方案不能用，是vfork()造成的。

实际中肯定不需要对"csh中执行id"进行调试，我只是以此举例，演示在极端不便情
况下对子进程进行调试。文中之所以使用gdb-7.6，因为这是在FreeBSD 6.1上能编译
通过的最高版本。

小结一下要点:

--------------------------------------------------------------------------
a)

设法将vfork()换成fork()。考虑修改影响流程的全局变量，vfork()对应的GOT[i]
(.got.plt)或PLT[j](.plt)，甚至直接Patch .text。

b)

调试父进程时，在将来的子进程流程起始点附近(确认fork()返回值pid为0处)Patch
出死循环。这是CPU相关的操作，需要汇编语言功底。

c)

设法触发fork()子进程，子进程将陷入Patch出来的死循环。从其他SSH会话识别子进
程PID，用另一个GDB Attch子进程。

d)

恢复子进程中Patch点附近代码，此后可以调试fork()之后、execv()之前的子进程流
程。

e)

子进程调用execv()之后会收到SIGTRAP信号而断在"(gdb)"提示符，类似于
"catch exec"的效果，不要受"info file"误导，此后可以调试execv()之后的新进程。
--------------------------------------------------------------------------