看看实现吧 <- 中国开发网

3.4.1 The clone( ), fork( ), and vfork( ) System Calls
Lightweight processes are created in Linux by using a function named clone( ), which uses four parameters:

fn

Specifies a function to be executed by the new process; when the function returns, the child terminates. The function returns an integer, which represents the exit code for the child process.

arg

Points to data passed to the fn( ) function.

flags

Miscellaneous information. The low byte specifies the signal number to be sent to the parent process when the child terminates; the SIGCHLD signal is generally selected. The remaining three bytes encode a group of clone flags, which specify the resources to be shared between the parent and the child process as follows:

CLONE_VM

Shares the memory descriptor and all Page Tables (see Chapter 8).

CLONE_FS

Shares the table that identifies the root directory and the current working directory, as well as the value of the bitmask used to mask the initial file permissions of a new file (the so-called file umask).

CLONE_FILES

Shares the table that identifies the open files (see Chapter 12).

CLONE_PARENT

Sets the parent of the child (p_pptr and p_opptr fields in the process descriptor) to the parent of the calling process.

CLONE_PID

Shares the PID.[6]

[6] As we shall see later, the CLONE_PID flag can be used only by a process having a PID of 0; in a uniprocessor system, no two lightweight processes have the same PID.

CLONE_PTRACE

If a ptrace( ) system call is causing the parent process to be traced, the child will also be traced.

CLONE_SIGHAND

Shares the table that identifies the signal handlers (see Chapter 10).

CLONE_THREAD

Inserts the child into the same thread group of the parent, and the child's tgid field is set accordingly. If this flag is true, it implicitly enforces CLONE_PARENT.

CLONE_SIGNAL

Equivalent to setting both CLONE_SIGHAND and CLONE_THREAD, so that it is possible to send a signal to all threads of a multithreaded application.

CLONE_VFORK

Used for the vfork( ) system call (see later in this section).

child_stack

Specifies the User Mode stack pointer to be assigned to the esp register of the child process. If it is equal to 0, the kernel assigns the current parent stack pointer to the child. Therefore, the parent and child temporarily share the same User Mode stack. But thanks to the Copy On Write mechanism, they usually get separate copies of the User Mode stack as soon as one tries to change the stack. However, this parameter must have a non-null value if the child process shares the same address space as the parent.

clone( ) is actually a wrapper function defined in the C library (see Section 9.1), which in turn uses a clone( ) system call hidden to the programmer. This system call receives only the flags and child_stack parameters; the new process always starts its execution from the instruction following the system call invocation. When the system call returns to the clone( ) function, it determines whether it is in the parent or the child and forces the child to execute the fn( ) function.

The traditional fork( ) system call is implemented by Linux as a clone( ) system call whose flags parameter specifies both a SIGCHLD signal and all the clone flags cleared, and whose child_stack parameter is 0.

The vfork( ) system call, described in the previous section, is implemented by Linux as a clone( ) system call whose first parameter specifies both a SIGCHLD signal and the flags CLONE_VM and CLONE_VFORK, and whose second parameter is equal to 0.

When either a clone( ), fork( ), or vfork( ) system call is issued, the kernel invokes the do_fork( ) function, which executes the following steps:

1. If the CLONE_PID flag is specified, the do_fork( ) function checks whether the PID of the parent process is not 0; if so, it returns an error code. Only the swapper process is allowed to set CLONE_PID; this is required when initializing a multiprocessor system.

2. The alloc_task_struct( ) function is invoked to get a new 8 KB union task_union memory area to store the process descriptor and the Kernel Mode stack of the new process.

3. The function follows the current pointer to obtain the parent process descriptor and copies it into the new process descriptor in the memory area just allocated.

4. A few checks occur to make sure the user has the resources necessary to start a new process. First, the function checks whether current->rlim[RLIMIT_NPROC].rlim_cur is smaller than or equal to the current number of processes owned by the user. If so, an error code is returned, unless the process has root privileges. The function gets the current number of processes owned by the user from a per-user data structure named user_struct. This data structure can be found through a pointer in the user field of the process descriptor.

5. The function checks that the number of processes is smaller than the value of the max_threads variable. The initial value of this variable depends on the amount of RAM in the system. The general rule is that the space taken by all process descriptors and Kernel Mode stacks cannot exceed 1/8 of the physical memory. However, the system administrator may change this value by writing in the /proc/sys/kernel/threads-max file.

6. If the parent process uses any kernel modules, the function increments the corresponding reference counters. As we shall see in Appendix B, each kernel module has its own reference counter, which ensures that the module will not be unloaded while it is being used.

7. The function then updates some of the flags included in the flags field that have been copied from the parent process:

a. It clears the PF_SUPERPRIV flag, which indicates whether the process has used any of its superuser privileges.

b. It clears the PF_USEDFPU flag.

c. It sets the PF_FORKNOEXEC flag, which indicates that the child process has not yet issued an execve( ) system call.

8. Now the function has taken almost everything that it can use from the parent process; the rest of its activities focus on setting up new resources in the child and letting the kernel know that this new process has been born. First, the function invokes the get_pid( ) function to obtain a new PID, which will be assigned to the child process (unless the CLONE_PID flag is set).

9. The function then updates all the process descriptor fields that cannot be inherited from the parent process, such as the fields that specify the process parenthood relationships.

10. Unless specified differently by the flags parameter, it invokes copy_files( ), copy_fs( ), copy_sighand( ), and copy_mm( ) to create new data structures and copy into them the values of the corresponding parent process data structures.

11. The do_fork( ) function invokes copy_thread( ) to initialize the Kernel Mode stack of the child process with the values contained in the CPU registers when the clone( ) call was issued (these values have been saved in the Kernel Mode stack of the parent, as described in Chapter 9). However, the function forces the value 0 into the field corresponding to the eax register. The thread.esp field in the descriptor of the child process is initialized with the base address of the child's Kernel Mode stack, and the address of an assembly language function (ret_from_fork( )) is stored in the thread.eip field. The copy_thread( ) function also invokes unlazy_fpu( ) on the parent and duplicates the contents of the thread.i387 field.

12. If either CLONE_THREAD or CLONE_PARENT is set, the function copies the value of the p_opptr and p_pptr fields of the parent into the corresponding fields of the child. The parent of the child thus appears as the parent of the current process. Otherwise, the function stores the process descriptor address of current into the p_opptr and p_pptr fields of the child.

13. If the CLONE_PTRACE flag is not set, the function sets the ptrace field in the child process descriptor to 0. This field stores a few flags used when a process is being traced by another process. Even if the current process is being traced, the child will not.

14. Conversely, if the CLONE_PTRACE flag is set, the function checks whether the parent process is being traced because in this case, the child should be traced too. Therefore, if PT_PTRACED is set in current->ptrace, the function copies the current->p_pptr field into the corresponding field of the child.

15. The do_fork( ) function checks the value of CLONE_THREAD. If the flag is set, the function inserts the child in the thread group of the parent and copies in the tgid field the value of the parent's tgid; otherwise, the function sets the tgid field to the value of the pid field.

16. The function uses the SET_LINKS macro to insert the new process descriptor in the process list.

17. The function invokes hash_pid( ) to insert the new process descriptor in the pidhash hash table.

18. The function increments the values of nr_threads and current->user->processes.

19. If the child is being traced, the function sends a SIGSTOP signal to it so that the debugger has a chance to look at it before it starts the execution.

20. It invokes wake_up_process( ) to set the state field of the child process descriptor to TASK_RUNNING and to insert the child in the runqueue list.

21. If the CLONE_VFORK flag is specified, the function inserts the parent process in a wait queue and suspends it until the child releases its memory address space (that is, until the child either terminates or executes a new program).

22. The do_fork( ) function returns the PID of the child, which is eventually read by the parent process in User Mode.

Now we have a complete child process in the runnable state. But it isn't actually running. It is up to the scheduler to decide when to give the CPU to this child. At some future process switch, the schedule bestows this favor on the child process by loading a few CPU registers with the values of the thread field of the child's process descriptor. In particular, esp is loaded with thread.esp (that is, with the address of child's Kernel Mode stack), and eip is loaded with the address of ret_from_fork( ). This assembly language function, in turn, invokes the ret_from_sys_call( ) function (see Chapter 9), which reloads all other registers with the values stored in the stack and forces the CPU back to User Mode. The new process then starts its execution right at the end of the fork( ), vfork( ), or clone( ) system call. The value returned by the system call is contained in eax: the value is 0 for the child and equal to the PID for the child's parent.

The child process executes the same code as the parent, except that the fork returns a 0. The developer of the application can exploit this fact, in a manner familiar to Unix programmers, by inserting a conditional statement in the program based on the PID value that forces the child to behave differently from the parent process.

CNDEV.ORG

论坛

相关信息: