A flowchart of our synchronous signal delivery algorithm was given in a previous post.
This post provides techniques to implement the algorithm. Our monitor sometimes needs to make the variants skip a system call temporarily in order to deliver signals synchronously. After skipping the system call, the monitor has to make the variant wait for the signal. A small tight loop is used for this purpose. The monitor injects the code of the loop to the memory space of the variant and changes the instruction pointer of the variant to point to this small loop. The variant starts executing the loop immediately after skipping the system call.
The number of iterations of this loop determines the maximum wait time for a signal. It can be configured, but we always use one billion iterations in our prototype system. Normally, not all of the iterations are executed. The monitor is notified as soon as the variant receives the signal.
After being notified, the monitor restores the original system call in the variant and the remaining iterations of the loop are skipped.
When the signal is not received after all loop iterations are executed, the variant is considered non-compliant. We insert a system call invocation instruction (int 0x80) after the loop to dispatch control back to the monitor when the loop finishes execution. Execution of this instruction indicates that the variant has not received the signal in the alloted time period and is non-compliant.
Our evaluations show that our algorithm causes less than 0.5 mili-second delay on average in delivering signals. Previously, other researchers had proposed delivering signals at system calls in order to deliver them synchronously. That approach could cause hundreds of mili-seconds delay in CPU-bound programs which might not be acceptable for certain signals, such as timer signals. The delay could also cause noticeable jitter in certain applications.
We used SPEC CPU2000 benchmark to evaluate our synchronous signal delivery mechanism and set a timer signal to be delivered to the tests frequently and then computed the average delay.
Friday, June 12, 2009
Thursday, February 5, 2009
Low-overhead access to the memory space of a traced process, Part IV
Low-overhead access to the memory space of a traced process is a major challenge when using a user-space tracer. I have devoted a few posts to mechanisms that allow us access a traced process's memory with low overhead. My last post on low-overhead memory access suggests using FIFOs. However, FIFOs have a limitation: size of FIFOs is often small and not configurable without recompiling the Linux kernel. For example, it is 4KB in Ubuntu 8.10. This means that multiple iterations of FIFOs are needed to transfer buffers whose sizes are larger than 4KB. Each iteration needs multiple context switches which increase the overhead.
Recently, I replaced the FIFOs by shared memory and obtained very good results. Shared memory size is by default several mega bytes (use "cat /proc/sys/kernel/shmmax" in Linux to see its default size). The default size is more than enough to read system call arguments from traced processes and to write back system call results to the processes.
Similar to FIFOs, the downside of using shared memory is the security risk, since any process can connect to them and try to access their contents. However, each shared memory block has a key and processes are allowed to attach a block only if they have the correct key. When we create shared memory blocks, their permissions are set so that only the user who has executed the monitor can read from or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases would be possible only when the system is already compromised.
Evaluating performance of shared memory versus FIFOs versus ptrace, I observed that shared memory is about 20 times faster than FIFOs and 900 faster than ptrace when transferring a 128KB buffer.
Recently, I replaced the FIFOs by shared memory and obtained very good results. Shared memory size is by default several mega bytes (use "cat /proc/sys/kernel/shmmax" in Linux to see its default size). The default size is more than enough to read system call arguments from traced processes and to write back system call results to the processes.
Similar to FIFOs, the downside of using shared memory is the security risk, since any process can connect to them and try to access their contents. However, each shared memory block has a key and processes are allowed to attach a block only if they have the correct key. When we create shared memory blocks, their permissions are set so that only the user who has executed the monitor can read from or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases would be possible only when the system is already compromised.
Evaluating performance of shared memory versus FIFOs versus ptrace, I observed that shared memory is about 20 times faster than FIFOs and 900 faster than ptrace when transferring a 128KB buffer.
Labels:
FIFO,
memory access,
Multi-variant Execution,
ptrace,
shared memory
Tuesday, November 11, 2008
Dealing with Asynchronous Signal Delivery in Multi-Variant Execution
A major problem that can cause false-positives in multi-variant execution is asynchronous signal delivery. As an example, assume variant (process) p1 receives a signal and starts executing its signal handler. p1's signal handler invokes system call s1, causing the monitor to wait for the same system call from p2. Meanwhile, variant p2 has not received the signal and is still running its main program code. When p2 calls system call s2, the monitor will detect the difference between s1 and s2 and raise an alarm. Therefore, it is essential that the variants receive signals at the same state of execution.
Whenever a signal is delivered to a variant, the OS pauses the variant and notifies the monitor. The monitor has the choice to deliver the signal to the variant or ignore it. The monitor immediately delivers signals that terminate program execution, such as SIGTERM and SIGSEGV. Other signals are delivered to all variants synchronously, meaning that signals are delivered to all variants either before or after a synchronization point. If at least half of the variants receive a signal before making a system call, and the rest invoke the system call, the monitor makes the latter variants skip the system call by replacing it with a non-state-changing call and forces them to wait for the signal. The monitor then delivers the signal to all variants and then restores the system call in those variants that have been made to skip it. The variants that are forced to wait for a signal and do not receive it within a configurable amount of time are considered as non-complying.
If fewer than half of the variants receive a signal and the rest invoke a system call, the signal is ignored and the variants which are stopped by the signal are resumed. The monitor keeps a list of pending signals for each variant. The ignored signals are added to these lists. As more variants receive the signal, the monitor checks the lists and when half of the variants have received the signal, the signal is delivered using the method mentioned above. The only difference is that the signal has to be sent again to the variants that ignored it. The monitor sends the signal to these variants again and removes it from the variants' pending signal lists.
The following flowchart depicts the algorithm that removes false-positives caused by asynchronous signal delivery in multi-variant environments (Click for larger picture). One could call this method "semi-synchronous signal delivery".
Whenever a signal is delivered to a variant, the OS pauses the variant and notifies the monitor. The monitor has the choice to deliver the signal to the variant or ignore it. The monitor immediately delivers signals that terminate program execution, such as SIGTERM and SIGSEGV. Other signals are delivered to all variants synchronously, meaning that signals are delivered to all variants either before or after a synchronization point. If at least half of the variants receive a signal before making a system call, and the rest invoke the system call, the monitor makes the latter variants skip the system call by replacing it with a non-state-changing call and forces them to wait for the signal. The monitor then delivers the signal to all variants and then restores the system call in those variants that have been made to skip it. The variants that are forced to wait for a signal and do not receive it within a configurable amount of time are considered as non-complying.
If fewer than half of the variants receive a signal and the rest invoke a system call, the signal is ignored and the variants which are stopped by the signal are resumed. The monitor keeps a list of pending signals for each variant. The ignored signals are added to these lists. As more variants receive the signal, the monitor checks the lists and when half of the variants have received the signal, the signal is delivered using the method mentioned above. The only difference is that the signal has to be sent again to the variants that ignored it. The monitor sends the signal to these variants again and removes it from the variants' pending signal lists.
The following flowchart depicts the algorithm that removes false-positives caused by asynchronous signal delivery in multi-variant environments (Click for larger picture). One could call this method "semi-synchronous signal delivery".
Monday, May 12, 2008
Low-overhead access to the memory space of a traced process, Part III
I addressed low overhead access to the memory space of a traced process in a previous post. That technique was based on creating anonymous pipes between the tracer and the tracees. While that methods works without any problem, it cannot be used to access the memory space of child processes created by the tracees. The anonymous pipes can connect a parent process to its children and since the children of the tracees are not created by the tracer, the tracer cannot establish pipes to them.
Using named pipes (FIFOs) instead of anonymous pipes solves the problem. FIFOs can be established between processes, regardless of whether they are related or not. The mechanism used to read from or write to the FIFOs is the same as the one described for pipes. The only difference is that we postpone creation and connection to the FIFOs until they are needed.
The downside of using FIFOs is the higher security risk, since any process can connect to them and try to read their contents. When we create the FIFOs, their permissions are set so that only the user who has executed the monitor can read or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases are possible only when the system is already compromised.
We measured the performance of FIFOs versus ptrace and observed that for buffer sizes of 160 bytes or smaller, using ptrace is more efficient than using FIFOs, but the time needed to transfer buffers using ptrace increases linearly with the buffer sizes. Transferring a 4KB buffer using ptrace takes 16 times as much as it takes using FIFOs.
Using named pipes (FIFOs) instead of anonymous pipes solves the problem. FIFOs can be established between processes, regardless of whether they are related or not. The mechanism used to read from or write to the FIFOs is the same as the one described for pipes. The only difference is that we postpone creation and connection to the FIFOs until they are needed.
The downside of using FIFOs is the higher security risk, since any process can connect to them and try to read their contents. When we create the FIFOs, their permissions are set so that only the user who has executed the monitor can read or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases are possible only when the system is already compromised.
We measured the performance of FIFOs versus ptrace and observed that for buffer sizes of 160 bytes or smaller, using ptrace is more efficient than using FIFOs, but the time needed to transfer buffers using ptrace increases linearly with the buffer sizes. Transferring a 4KB buffer using ptrace takes 16 times as much as it takes using FIFOs.
Wednesday, March 26, 2008
Duplicates of stdio
Any file requested to be opened for writing is opened by the Monitor and all subsequent operations on this file are performed by the Monitor. Also, operations on the standard I/O are performed only by the Monitor. In order to detect operations on the standard I/O, checking the file descriptors is not enough, because the file descriptors can be duplicated or redirected. The Monitor needs to keep track of duplications and redirections of the standard I/O in order to perform these operations correctly.
Tuesday, March 18, 2008
Running User Mode Linux - Part I
Recently, I have been trying to run User Mode Linux (UML) on our monitor. Running UML is very challenging, because it performs many low level operations that need special handling. For example, UML reads CPU Time Stamp Counters. Reading these counters is done at user-level (using rdtsc) and without invoking any system call. Hence, the monitor is not informed of this operation and cannot send identical results to all the variants. This can cause discrepancies among the variants. We replace the rdtsc with gettimeofday so that we can catch the system call and send the same time to all variants.
Processes executed on top of UML are supervised using ptrace. In fact, UML runs processes in debugging mode. Therefore, it is essential that UML should be able to run ptrace successfully and attach to the processes. Our monitor used to attach to all the variants and all their children preventing UML from performing correctly. Now that the variants are UML, the monitor should not attach to the children. To support this, we have added a command line argument that can be used when running UML. The argument causes the monitor to trace the variants only and not their children.
(We should investigate if this kind of monitoring is enough.)
Another challenge when running UML is supporting mmap. Our monitor does most of the file operations itself and just sends the results to the variants, however, if variants try to mmap a file that is not opened by themselves, mmap fails. This is the case when loading dynamic libraries. They are opened as read-only and then are mmaped. To solve this, we allow the variants to open files only if the requested permission is read-only. This solves the problem of loading dynamic libraries, but UML goes further. It opens temporary files in /dev/shm/ with full permission (read/write/execute) and then mmaps them. Since the files are opened with full permission, they are opened only by the monitor not by the variants and, therefore, mmap fails. This is a problem that should be addressed.
Processes executed on top of UML are supervised using ptrace. In fact, UML runs processes in debugging mode. Therefore, it is essential that UML should be able to run ptrace successfully and attach to the processes. Our monitor used to attach to all the variants and all their children preventing UML from performing correctly. Now that the variants are UML, the monitor should not attach to the children. To support this, we have added a command line argument that can be used when running UML. The argument causes the monitor to trace the variants only and not their children.
(We should investigate if this kind of monitoring is enough.)
Another challenge when running UML is supporting mmap. Our monitor does most of the file operations itself and just sends the results to the variants, however, if variants try to mmap a file that is not opened by themselves, mmap fails. This is the case when loading dynamic libraries. They are opened as read-only and then are mmaped. To solve this, we allow the variants to open files only if the requested permission is read-only. This solves the problem of loading dynamic libraries, but UML goes further. It opens temporary files in /dev/shm/ with full permission (read/write/execute) and then mmaps them. Since the files are opened with full permission, they are opened only by the monitor not by the variants and, therefore, mmap fails. This is a problem that should be addressed.
Thursday, February 7, 2008
Sending Signals in The Multi-Variant Execution System
As mentioned in the previous posts, multi-variant execution must be transparent to the variants. Also as mentioned previously, the variants must call the same system calls with equal or equivalent arguments.
The system calls which obtain the process ID (PID) of a running process (getpid), or its parent (getppid) or its children (fork), return a different value for each variant when executed normally. These IDs may be used later as arguments to other system calls. Different IDs in each variant make equivalence checking on the syscall arguments difficult. In order to make the equivalence checking more efficient, the monitor intercepts the ID returning syscalls before they return the results to the variants and replaces their results as though the syscalls were invoked by one of the variants. For example, the return value of getpid is replaced by the ID of the first variant and it seems that getpid returns the same ID, no matter which variant has called it.
While this method makes the equivalence checking fast and efficient, it can cause confusion in the system calls that use process IDs, such as wait4, waitpid, kill, .... In our system, these system calls are called with the same ID by all the variants. For example, if the variants want to send signals to their parents, they invoke kill with the ID of the first variant's parent. Obviously, if the system call is invoked as is, only the parent of the first variant receives the signal. To avoid this situation, the monitor keeps the real PID of all variants and their parents and children and when any of the system calls that use PIDs is encountered, the PID is compared to the variants real PIDs. If the PID is equal to that of one of the variants, it is replaced in each variant by the real PID of that variant before letting the system call get executed. If the PID is equal to the PID of one of the variants' parents, it is replaced in each variant by the PID of that variants' parent. A similar action is taken if the PID is equal to the PID of one the children. The only difference is that each process can have many children and the PID which replaces the syscall argument must be the PID of one of the children that corresponds to the child whose PID is reported to all the variants and is currently passed as the syscall argument.
After replacing the PID, the monitor lets the variants continue and invoke the system call.
The system calls which obtain the process ID (PID) of a running process (getpid), or its parent (getppid) or its children (fork), return a different value for each variant when executed normally. These IDs may be used later as arguments to other system calls. Different IDs in each variant make equivalence checking on the syscall arguments difficult. In order to make the equivalence checking more efficient, the monitor intercepts the ID returning syscalls before they return the results to the variants and replaces their results as though the syscalls were invoked by one of the variants. For example, the return value of getpid is replaced by the ID of the first variant and it seems that getpid returns the same ID, no matter which variant has called it.
While this method makes the equivalence checking fast and efficient, it can cause confusion in the system calls that use process IDs, such as wait4, waitpid, kill, .... In our system, these system calls are called with the same ID by all the variants. For example, if the variants want to send signals to their parents, they invoke kill with the ID of the first variant's parent. Obviously, if the system call is invoked as is, only the parent of the first variant receives the signal. To avoid this situation, the monitor keeps the real PID of all variants and their parents and children and when any of the system calls that use PIDs is encountered, the PID is compared to the variants real PIDs. If the PID is equal to that of one of the variants, it is replaced in each variant by the real PID of that variant before letting the system call get executed. If the PID is equal to the PID of one of the variants' parents, it is replaced in each variant by the PID of that variants' parent. A similar action is taken if the PID is equal to the PID of one the children. The only difference is that each process can have many children and the PID which replaces the syscall argument must be the PID of one of the children that corresponds to the child whose PID is reported to all the variants and is currently passed as the syscall argument.
After replacing the PID, the monitor lets the variants continue and invoke the system call.
Subscribe to:
Posts (Atom)