Friday, June 12, 2009

Implementation of Synchronous Signal Delivery

A flowchart of our synchronous signal delivery algorithm was given in a previous post.
This post provides techniques to implement the algorithm. Our monitor sometimes needs to make the variants skip a system call temporarily in order to deliver signals synchronously. After skipping the system call, the monitor has to make the variant wait for the signal. A small tight loop is used for this purpose. The monitor injects the code of the loop to the memory space of the variant and changes the instruction pointer of the variant to point to this small loop. The variant starts executing the loop immediately after skipping the system call.

The number of iterations of this loop determines the maximum wait time for a signal. It can be configured, but we always use one billion iterations in our prototype system. Normally, not all of the iterations are executed. The monitor is notified as soon as the variant receives the signal.
After being notified, the monitor restores the original system call in the variant and the remaining iterations of the loop are skipped.

When the signal is not received after all loop iterations are executed, the variant is considered non-compliant. We insert a system call invocation instruction (int 0x80) after the loop to dispatch control back to the monitor when the loop finishes execution. Execution of this instruction indicates that the variant has not received the signal in the alloted time period and is non-compliant.

Our evaluations show that our algorithm causes less than 0.5 mili-second delay on average in delivering signals. Previously, other researchers had proposed delivering signals at system calls in order to deliver them synchronously. That approach could cause hundreds of mili-seconds delay in CPU-bound programs which might not be acceptable for certain signals, such as timer signals. The delay could also cause noticeable jitter in certain applications.

We used SPEC CPU2000 benchmark to evaluate our synchronous signal delivery mechanism and set a timer signal to be delivered to the tests frequently and then computed the average delay.

Thursday, February 5, 2009

Low-overhead access to the memory space of a traced process, Part IV

Low-overhead access to the memory space of a traced process is a major challenge when using a user-space tracer. I have devoted a few posts to mechanisms that allow us access a traced process's memory with low overhead. My last post on low-overhead memory access suggests using FIFOs. However, FIFOs have a limitation: size of FIFOs is often small and not configurable without recompiling the Linux kernel. For example, it is 4KB in Ubuntu 8.10. This means that multiple iterations of FIFOs are needed to transfer buffers whose sizes are larger than 4KB. Each iteration needs multiple context switches which increase the overhead.

Recently, I replaced the FIFOs by shared memory and obtained very good results. Shared memory size is by default several mega bytes (use "cat /proc/sys/kernel/shmmax" in Linux to see its default size). The default size is more than enough to read system call arguments from traced processes and to write back system call results to the processes.

Similar to FIFOs, the downside of using shared memory is the security risk, since any process can connect to them and try to access their contents. However, each shared memory block has a key and processes are allowed to attach a block only if they have the correct key. When we create shared memory blocks, their permissions are set so that only the user who has executed the monitor can read from or write to them. Therefore, the risk is limited to the case of a malicious program that is executed in the context of the same user or a super user. Both cases would be possible only when the system is already compromised.

Evaluating performance of shared memory versus FIFOs versus ptrace, I observed that shared memory is about 20 times faster than FIFOs and 900 faster than ptrace when transferring a 128KB buffer.