A major problem that can cause false-positives in multi-variant execution is asynchronous signal delivery. As an example, assume variant (process) p1 receives a signal and starts executing its signal handler. p1's signal handler invokes system call s1, causing the monitor to wait for the same system call from p2. Meanwhile, variant p2 has not received the signal and is still running its main program code. When p2 calls system call s2, the monitor will detect the difference between s1 and s2 and raise an alarm. Therefore, it is essential that the variants receive signals at the same state of execution.
Whenever a signal is delivered to a variant, the OS pauses the variant and notifies the monitor. The monitor has the choice to deliver the signal to the variant or ignore it. The monitor immediately delivers signals that terminate program execution, such as SIGTERM and SIGSEGV. Other signals are delivered to all variants synchronously, meaning that signals are delivered to all variants either before or after a synchronization point. If at least half of the variants receive a signal before making a system call, and the rest invoke the system call, the monitor makes the latter variants skip the system call by replacing it with a non-state-changing call and forces them to wait for the signal. The monitor then delivers the signal to all variants and then restores the system call in those variants that have been made to skip it. The variants that are forced to wait for a signal and do not receive it within a configurable amount of time are considered as non-complying.
If fewer than half of the variants receive a signal and the rest invoke a system call, the signal is ignored and the variants which are stopped by the signal are resumed. The monitor keeps a list of pending signals for each variant. The ignored signals are added to these lists. As more variants receive the signal, the monitor checks the lists and when half of the variants have received the signal, the signal is delivered using the method mentioned above. The only difference is that the signal has to be sent again to the variants that ignored it. The monitor sends the signal to these variants again and removes it from the variants' pending signal lists.
The following flowchart depicts the algorithm that removes false-positives caused by asynchronous signal delivery in multi-variant environments (Click for larger picture). One could call this method "semi-synchronous signal delivery".
Tuesday, November 11, 2008
Monday, May 12, 2008
Low-overhead access to the memory space of a traced process, Part III
I addressed low overhead access to the memory space of a traced process in a previous post. That technique was based on creating anonymous pipes between the tracer and the tracees. While that methods works without any problem, it cannot be used to access the memory space of child processes created by the tracees. The anonymous pipes can connect a parent process to its children and since the children of the tracees are not created by the tracer, the tracer cannot establish pipes to them.
Using named pipes (FIFOs) instead of anonymous pipes solves the problem. FIFOs can be established between processes, regardless of whether they are related or not. The mechanism used to read from or write to the FIFOs is the same as the one described for pipes. The only difference is that we postpone creation and connection to the FIFOs until they are needed.
The downside of using FIFOs is the higher security risk, since any process can connect to them and try to read their contents. When we create the FIFOs, their permissions are set so that only the user who has executed the monitor can read or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases are possible only when the system is already compromised.
We measured the performance of FIFOs versus ptrace and observed that for buffer sizes of 160 bytes or smaller, using ptrace is more efficient than using FIFOs, but the time needed to transfer buffers using ptrace increases linearly with the buffer sizes. Transferring a 4KB buffer using ptrace takes 16 times as much as it takes using FIFOs.
Using named pipes (FIFOs) instead of anonymous pipes solves the problem. FIFOs can be established between processes, regardless of whether they are related or not. The mechanism used to read from or write to the FIFOs is the same as the one described for pipes. The only difference is that we postpone creation and connection to the FIFOs until they are needed.
The downside of using FIFOs is the higher security risk, since any process can connect to them and try to read their contents. When we create the FIFOs, their permissions are set so that only the user who has executed the monitor can read or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases are possible only when the system is already compromised.
We measured the performance of FIFOs versus ptrace and observed that for buffer sizes of 160 bytes or smaller, using ptrace is more efficient than using FIFOs, but the time needed to transfer buffers using ptrace increases linearly with the buffer sizes. Transferring a 4KB buffer using ptrace takes 16 times as much as it takes using FIFOs.
Wednesday, March 26, 2008
Duplicates of stdio
Any file requested to be opened for writing is opened by the Monitor and all subsequent operations on this file are performed by the Monitor. Also, operations on the standard I/O are performed only by the Monitor. In order to detect operations on the standard I/O, checking the file descriptors is not enough, because the file descriptors can be duplicated or redirected. The Monitor needs to keep track of duplications and redirections of the standard I/O in order to perform these operations correctly.
Tuesday, March 18, 2008
Running User Mode Linux - Part I
Recently, I have been trying to run User Mode Linux (UML) on our monitor. Running UML is very challenging, because it performs many low level operations that need special handling. For example, UML reads CPU Time Stamp Counters. Reading these counters is done at user-level (using rdtsc) and without invoking any system call. Hence, the monitor is not informed of this operation and cannot send identical results to all the variants. This can cause discrepancies among the variants. We replace the rdtsc with gettimeofday so that we can catch the system call and send the same time to all variants.
Processes executed on top of UML are supervised using ptrace. In fact, UML runs processes in debugging mode. Therefore, it is essential that UML should be able to run ptrace successfully and attach to the processes. Our monitor used to attach to all the variants and all their children preventing UML from performing correctly. Now that the variants are UML, the monitor should not attach to the children. To support this, we have added a command line argument that can be used when running UML. The argument causes the monitor to trace the variants only and not their children.
(We should investigate if this kind of monitoring is enough.)
Another challenge when running UML is supporting mmap. Our monitor does most of the file operations itself and just sends the results to the variants, however, if variants try to mmap a file that is not opened by themselves, mmap fails. This is the case when loading dynamic libraries. They are opened as read-only and then are mmaped. To solve this, we allow the variants to open files only if the requested permission is read-only. This solves the problem of loading dynamic libraries, but UML goes further. It opens temporary files in /dev/shm/ with full permission (read/write/execute) and then mmaps them. Since the files are opened with full permission, they are opened only by the monitor not by the variants and, therefore, mmap fails. This is a problem that should be addressed.
Processes executed on top of UML are supervised using ptrace. In fact, UML runs processes in debugging mode. Therefore, it is essential that UML should be able to run ptrace successfully and attach to the processes. Our monitor used to attach to all the variants and all their children preventing UML from performing correctly. Now that the variants are UML, the monitor should not attach to the children. To support this, we have added a command line argument that can be used when running UML. The argument causes the monitor to trace the variants only and not their children.
(We should investigate if this kind of monitoring is enough.)
Another challenge when running UML is supporting mmap. Our monitor does most of the file operations itself and just sends the results to the variants, however, if variants try to mmap a file that is not opened by themselves, mmap fails. This is the case when loading dynamic libraries. They are opened as read-only and then are mmaped. To solve this, we allow the variants to open files only if the requested permission is read-only. This solves the problem of loading dynamic libraries, but UML goes further. It opens temporary files in /dev/shm/ with full permission (read/write/execute) and then mmaps them. Since the files are opened with full permission, they are opened only by the monitor not by the variants and, therefore, mmap fails. This is a problem that should be addressed.
Thursday, February 7, 2008
Sending Signals in The Multi-Variant Execution System
As mentioned in the previous posts, multi-variant execution must be transparent to the variants. Also as mentioned previously, the variants must call the same system calls with equal or equivalent arguments.
The system calls which obtain the process ID (PID) of a running process (getpid), or its parent (getppid) or its children (fork), return a different value for each variant when executed normally. These IDs may be used later as arguments to other system calls. Different IDs in each variant make equivalence checking on the syscall arguments difficult. In order to make the equivalence checking more efficient, the monitor intercepts the ID returning syscalls before they return the results to the variants and replaces their results as though the syscalls were invoked by one of the variants. For example, the return value of getpid is replaced by the ID of the first variant and it seems that getpid returns the same ID, no matter which variant has called it.
While this method makes the equivalence checking fast and efficient, it can cause confusion in the system calls that use process IDs, such as wait4, waitpid, kill, .... In our system, these system calls are called with the same ID by all the variants. For example, if the variants want to send signals to their parents, they invoke kill with the ID of the first variant's parent. Obviously, if the system call is invoked as is, only the parent of the first variant receives the signal. To avoid this situation, the monitor keeps the real PID of all variants and their parents and children and when any of the system calls that use PIDs is encountered, the PID is compared to the variants real PIDs. If the PID is equal to that of one of the variants, it is replaced in each variant by the real PID of that variant before letting the system call get executed. If the PID is equal to the PID of one of the variants' parents, it is replaced in each variant by the PID of that variants' parent. A similar action is taken if the PID is equal to the PID of one the children. The only difference is that each process can have many children and the PID which replaces the syscall argument must be the PID of one of the children that corresponds to the child whose PID is reported to all the variants and is currently passed as the syscall argument.
After replacing the PID, the monitor lets the variants continue and invoke the system call.
The system calls which obtain the process ID (PID) of a running process (getpid), or its parent (getppid) or its children (fork), return a different value for each variant when executed normally. These IDs may be used later as arguments to other system calls. Different IDs in each variant make equivalence checking on the syscall arguments difficult. In order to make the equivalence checking more efficient, the monitor intercepts the ID returning syscalls before they return the results to the variants and replaces their results as though the syscalls were invoked by one of the variants. For example, the return value of getpid is replaced by the ID of the first variant and it seems that getpid returns the same ID, no matter which variant has called it.
While this method makes the equivalence checking fast and efficient, it can cause confusion in the system calls that use process IDs, such as wait4, waitpid, kill, .... In our system, these system calls are called with the same ID by all the variants. For example, if the variants want to send signals to their parents, they invoke kill with the ID of the first variant's parent. Obviously, if the system call is invoked as is, only the parent of the first variant receives the signal. To avoid this situation, the monitor keeps the real PID of all variants and their parents and children and when any of the system calls that use PIDs is encountered, the PID is compared to the variants real PIDs. If the PID is equal to that of one of the variants, it is replaced in each variant by the real PID of that variant before letting the system call get executed. If the PID is equal to the PID of one of the variants' parents, it is replaced in each variant by the PID of that variants' parent. A similar action is taken if the PID is equal to the PID of one the children. The only difference is that each process can have many children and the PID which replaces the syscall argument must be the PID of one of the children that corresponds to the child whose PID is reported to all the variants and is currently passed as the syscall argument.
After replacing the PID, the monitor lets the variants continue and invoke the system call.
Thursday, January 31, 2008
Supporting "fork"
System calls that create child processes, such as fork, must be handled specifically. A child spawned by a variant must be synchronized to its corresponding children in other variants. Therefore, the monitor should group the children that correspond to each other for synchronization and supervision purposes. Monitoring the main variants and their children in a single threaded monitor can impose significant overhead. For example, consider this scenario: the main variants invoke a system call that needs to be executed by the monitor and takes a long time. In the meantime the children invoke a system call that just needs a quick approval from the monitor. The children have to wait a long time for the monitor to finish the execution of the main variants' system call.
To tackle this problem, our monitor spawns a new thread every time child processes are created by the variants and hands over the monitoring of the newly created children to the new thread. The new thread is terminated when the children terminate. Handing the control over to the new thread is not straightforward, since ptrace is not designed to be used in a multi-threaded debugger. The new thread is not allowed to trace the children unless the parent thread detaches from the children first and lets the new thread attach to them. When the parent thread detaches from the children the kernel sends a signal to the children and let them continue execution normally and without notifying the monitor at system call invocations. This would cause some system calls to escape the monitoring.
We solved the problem by letting the parent thread start monitoring the new child processes until they invoke the first system call. At this point, the parent thread saves the system call and its arguments and replaces it by sigsuspend. We block all the signals using sigsuspend. Since SIGSTOP and SIGKILL cannot be blocked, we still receive these two signals. Other than replacing the system call, we also decrement the instruction pointer by 2 so that the int 0x80 is executed again after the child is resumed. This is required to restore the original system call and run it. After these changes, the parent thread detaches from the children. The children run sigsuspend and get suspended. Then the parent thread spawns a new monitoring thread and passes process IDs of the child processes to it. The new monitoring thread attaches to the children; kernel sends SIGSTOP to the children and wakes them up. They run int 0x80 again which is intercepted by the new monitoring thread. The monitoring thread restores the original system call replaced by sigsuspend and starts monitoring the new children without missing any system call.
In a multi-threaded monitor, any thread can receive signals raised in any traced process; meaning that a thread can receive signals raised for the processes monitored by another thread. To solve this problem, we simply use wait4 instead of wait. This way each thread only receives signals raised in processes monitored by itself.
The above mechanism also works for tracing children created with clone. In order to receive signals received in cloned children, we should use __WALL with wait4.
To tackle this problem, our monitor spawns a new thread every time child processes are created by the variants and hands over the monitoring of the newly created children to the new thread. The new thread is terminated when the children terminate. Handing the control over to the new thread is not straightforward, since ptrace is not designed to be used in a multi-threaded debugger. The new thread is not allowed to trace the children unless the parent thread detaches from the children first and lets the new thread attach to them. When the parent thread detaches from the children the kernel sends a signal to the children and let them continue execution normally and without notifying the monitor at system call invocations. This would cause some system calls to escape the monitoring.
We solved the problem by letting the parent thread start monitoring the new child processes until they invoke the first system call. At this point, the parent thread saves the system call and its arguments and replaces it by sigsuspend. We block all the signals using sigsuspend. Since SIGSTOP and SIGKILL cannot be blocked, we still receive these two signals. Other than replacing the system call, we also decrement the instruction pointer by 2 so that the int 0x80 is executed again after the child is resumed. This is required to restore the original system call and run it. After these changes, the parent thread detaches from the children. The children run sigsuspend and get suspended. Then the parent thread spawns a new monitoring thread and passes process IDs of the child processes to it. The new monitoring thread attaches to the children; kernel sends SIGSTOP to the children and wakes them up. They run int 0x80 again which is intercepted by the new monitoring thread. The monitoring thread restores the original system call replaced by sigsuspend and starts monitoring the new children without missing any system call.
In a multi-threaded monitor, any thread can receive signals raised in any traced process; meaning that a thread can receive signals raised for the processes monitored by another thread. To solve this problem, we simply use wait4 instead of wait. This way each thread only receives signals raised in processes monitored by itself.
The above mechanism also works for tracing children created with clone. In order to receive signals received in cloned children, we should use __WALL with wait4.
Tuesday, January 8, 2008
Signal Handling in Reverse Stack Executables
One of the challenges in reverse stack manipulation is signal handling. If a signal handler is defined for a signal, when the signal is raised, the kernel sets up a signal frame and saves the context of the process on the stack and calls the corresponding handler. Since the kernel expects normal stack growth direction, e.g. downwards in x86, the context saved by the kernel overwrites data on a reverse growing stack. To tackle this problem, we allocate a small block of memory and call sigaltstack to notify the OS that it must use this memory block as the signal stack to set up the signal frame and save the process' context.
The problem is that the handler which is defined by the programmer, is compiled for reverse stack. Now when the signal rises, the kernel saves the context on the stack and calls the handler. The handler uses the same signal handling stack and when it starts execution, the stack pointer is located just below the context saved by the OS. Therefore, a handler compiled for reverse growing stack can overwrite and destroy the context of the process, causing a crash when the handler returns.
To solve this problem, we change the interface to sigaction system call in libc. sigaction registers a new handler for a specified signal number. We change the interface to the system call so that whenever it is called, the new interface sets the new handler to a wrapper function that we have defined in libc. The wrapper function increments the stack pointer to bypass the area used for saving the process' context and then calls the user-defined function. After the user-defined function returns, the wrapper decrements the stack pointer to point back to its original location and returns. Using this method, the saved context remains intact and the kernel is able to restore it while everything remains transparent to the kernel.
As mentioned above, we allocate a block of memory to use as the alternative stack. We pass a pointer close to the beginning of the block to sigaltstack. The kernel uses this pointer as the beginning of the alternative stack and saves the context down this point and towards the beginning of the block. The pointer is far enough from the start of the block to provide adequate room for saving the context. After saving the context, our wrapper function increments the stack pointer to go past the context and uses the rest of the memory block as an upward growing stack for the signal handler.
The problem is that the handler which is defined by the programmer, is compiled for reverse stack. Now when the signal rises, the kernel saves the context on the stack and calls the handler. The handler uses the same signal handling stack and when it starts execution, the stack pointer is located just below the context saved by the OS. Therefore, a handler compiled for reverse growing stack can overwrite and destroy the context of the process, causing a crash when the handler returns.
To solve this problem, we change the interface to sigaction system call in libc. sigaction registers a new handler for a specified signal number. We change the interface to the system call so that whenever it is called, the new interface sets the new handler to a wrapper function that we have defined in libc. The wrapper function increments the stack pointer to bypass the area used for saving the process' context and then calls the user-defined function. After the user-defined function returns, the wrapper decrements the stack pointer to point back to its original location and returns. Using this method, the saved context remains intact and the kernel is able to restore it while everything remains transparent to the kernel.
As mentioned above, we allocate a block of memory to use as the alternative stack. We pass a pointer close to the beginning of the block to sigaltstack. The kernel uses this pointer as the beginning of the alternative stack and saves the context down this point and towards the beginning of the block. The pointer is far enough from the start of the block to provide adequate room for saving the context. After saving the context, our wrapper function increments the stack pointer to go past the context and uses the rest of the memory block as an upward growing stack for the signal handler.
Subscribe to:
Posts (Atom)