Thursday, November 15, 2007

Low-overhead access to the memory space of a traced process, part II

We found a solution to improve performance of accessing to a child process' memory. We create two pipes between the tracing process (monitor) and the traced process, one for reading (read-pipe) and one for writing (write-pipe). In order to start running the process, the monitor spawns a child using fork and then runs a given executable inside the child. The communication pipes are created after the child is spawned and before executing the executable. To keep the pipes open after the execution, the monitor uses execve to start the execution.

When the process is stopped at a system call and control is transfered back to the monitor, it replaces the original system call with a write to one of the pipes provided for reading from the process (read-pipe). It also gives the address of the buffer that the monitor needs to read and its length to the write. The process is resumed and the kernel executes the write and writes the contents of the process' memory to the read-pipe. The OS notifies the monitor after executing the call and the monitor reads the contents of the buffer from the pipe at once using a single read. Writing to the process' memory is done in a similar way, but the monitor first writes the data to the write-pipe and then the system call is replaced by a read from the pipe to a desired address with the specified length.

It is not always faster to read and write the memory of a traced process through the pipes. Our experiments have shown that for buffers shorter than 40 bytes in length, ptrace is more efficient and for buffers larger than 40 bytes, pipes are faster.

Friday, October 19, 2007

Files

One of the difficulties with multi-variant execution is file operation.
When variants try to open a file we can open the file in the platform and send the results to the variants. This solution fails if the variants try to mmap the file. They use a file descriptor that is sent to them by the platform and is not really open in their contexts. Therefore the mmap returns error. This situation happens frequently when variants map shared libraries.
Another solution is to open all files for all variants, but preventing them from writing to the files. We allow only one of the variants (or the platform) to write to a file. This solves the problem of the previous method and helps keep valid file descriptors in variants, but the locking mechanisms of the kernel don't allow us to open one file for writing in a few processes simultaneously.
An alternative solution is to open files that are opened as read-only in all variants and open the other files only in the platform. Obviously, this solution still has the problem of the first solution. Files that are not opened for all variants can not be mmapped, but it is not as severe as the former problem, because in most cases the files that are mmapped are opened as read-only and, therefore, are opened by all variants.
The only complication is that the files that are opened by the platform and those opened by the variants can have the same file descriptor. For example, a request to open file "a.txt" for reading and writing is received by the platform. The platform doesn't allow the variants to open the file and opens it itself. Let's assume it is assigned file descriptor #5. This file descriptor is sent to the variants and they use it when they request further operations on the file.
Now variants want to open "b.txt" as read-only. The platform let them run the syscall and open the file. This time they get the file descriptor directly from the kernel that can be 5 again! Now if they invoke "read" to read file descriptor #5, it will not be clear whether they want to read "a.txt" or "b.txt". To solve this problem, the platform adds a big number to all file descriptors that are opened by itself before sending them to the variants. In Linux 2.6, maximum number of files that can be opened by a process is 256 and therefore, the largest file descriptor is 256. If we add 256 to all file descriptors opened by the platform before sending them to the variants, we can distinguish between the files opened by the variants and those opened by the platform. Now all file operations requests sent by the variants that have a file descriptor larger than 256 should be performed on the files opened by the platform and those that have smaller file descriptors should be performed on the files opened directly by the variants.

Tuesday, October 2, 2007

Low-overhead access to the memory space of a traced process, Part I

The content of a traced process's memory can be read using ptrace, but ptrace only read 4-bytes (more precisely, a long int) at a time. So, if you want to read a big chunk of memory you should call ptrace multiple times. However, ptrace is executed in kernel space, which means that every time you call it, a context switch is made. Context switches are expensive operations which can take thousands of cpu cycles to complete. Therefore, in order to lower the overhead of our technique, it is necessary to find a better way for reading from the memory of a traced process.

Recent Linux kernels allow reading a traced process's memory by opening /proc/PID/mem file, where PID is the pid of the the traced process. This is only allowed if the traced process is suspended. Besides, you should use the open syscall to open this pseudo file instead of fopen or other library functions.
A problem that I found is that the read fails if the address falls into the memory range of the stack. I don't know whether it is intentional or just a bug, but it doesn't work anyway.

Now, I am trying to find another low-overhead method to read from the memory of a traced process.

Saturday, September 15, 2007

getpid for syscall emulation

A solution to the problem that I talked about in my previous post is to use PTRACE_SYSCALL and call another system-call that doesn't change the state of the program, such as getpid, instead of the requested system-call.
Here is how it works: A child process calls a system-call that shouldn't be executed by the child, e.g. open file for writing. PTRACE_SYSCALL returns to the parent process after execution of the int instruction and before running the syscall. The parent process replaces the syscall number (eax register) with getpid syscall number and lets the child continue. The kernel runs getpid instead of the open system-call and returns to the parent at the end of the syscall. Now the parent process, replaces the child's registers by appropriate values (perhaps the values that the parent has obtained by running or emulating the system-call itself) and returns to the child. The child continues execution using the values that the parent has given to it.
Using this method we can use PTRACE_SYSCALL and emulate system-calls for a traced process.

Monday, September 10, 2007

Multi-Variant Execution Environment

After modifying gcc to generate executables for upward growing stack and porting a C library for this purpose, it is time to prepare the multi-variant execution environment. The purpose of this environment is to check that whenever one of the variants calls a system call, all others also call the same system call with the same arguments. Besides, the whole multi-variant environment should impersonate a single process. Therefore, some system calls should be intercepted by the multi-variant environment and emulated rather than executed by all the variants. For example, when the variants try to open a file for writing, the file should be opened only once by the environment and its pointer should be sent back to all the variants.

Since I would prefer not to touch the Linux kernel, I decided to use ptrace with PTRACE_SYSCALL as the request to intercept the system calls. But, using this request causes the traced process to return to the tracer after executing int instruction, which is not what we need. As mentioned before, some system-calls should not be executed by the variants, but PTRACE_SYSCALL doesn't give us the ability to prevent them from calling the system-calls. Reading more about ptrace, I found out that PTRACE_SYSEMU was what I was looking for. Unfortunately, this is platform specific and, apparently, works well only on i386. Furthermore, our investigations show that it is not implemented in all Linux distributions. Thus, I am reluctant to use it.
So, the question remains open: "How can we intercept the syscalls and prevent the variants from executing them, with minimal performance overhead and without modifying the OS kernel?".

Monday, September 3, 2007

Sibling Calls in GCC

When we generate code for upward growing stack on x86 target, the stack pointer (SP) is adjusted at the beginning of all functions to bypass the return address (refer to this article). However, when a sibling function is called the control is transfered to the function using a jump instruction rather than a "call", which means that no return address is inserted on the stack. In these cases, we should "anti_adjust" (decrement, in our case) the SP before the jump in order to compensate the value added to the SP at the beginning of the function.

I made gcc insert the code to decrement the SP when emitting a call to a sibling function, but it muddled the stack offset computation in restoring saved registers. I removed the code and added it to the epilogue expansion. This works flawlessly.

Friday, August 31, 2007

The holy GCC

After running the benchmarks compiled using llvm-gcc and making sure that the technique was working, it was time to modify gcc. In order to take advantage of gcc's optimizations, I changed my strategy a little bit and modified the RTL generation phase to generate code for upward growing stack. Modifying code generation phase was still inevitable, because our target was x86 that only supported downward growing stack and some instructions should have been augmented with extra instructions to work properly when stack grew upward.

One of the instructions in x86 which is difficult to transform for upward growing stack, is RET <16-bit Integer>. This instruction pops the return address from the stack and internally increments the stack pointer (SP) by the value of its operand and then jumps to the popped return address. When stack grows upward, the stack pointer should be decremented rather than incremented. Now if we replace this instruction with an instruction that decrements the stack pointer and a normal (with no operand) RET instruction, the value that the RET reads is not the correct return address, because the stack pointer no longer points to the return address.

A solution to this problem is using a set of instructions to read the return address and store it in a temporary register, decrement the stack pointer and jump indirectly to the temporary register. We use ECX as the temporary register, because it is a volatile register that is assumed to be clobbered after a function call and it is not used to return values to the caller too.

The value that is decremented from the stack pointer should be <RET operand> minus 4. The reason is that the stack pointer is decremented by 4 after all CALL instructions (refer to this article). The SP is adjusted to compensate the amount added to it by a RET instruction. Now that we remove the RET instruction the SP should not be adjusted after CALLs to these functions. By subtracting 4 from the value, we eliminate the effect of the SP adjustment after the CALL instructions.

If you are interested in the details of instruction augmentation for reverse stack, you can take a look at this article.

Thursday, August 30, 2007

Compiler and Library

I modified LLVM to generate executables that write the stack in opposite direction. It took me several weeks till I had a customized llvm-gcc which could generate reverse-stack executables that ran correctly. My approach was to modify the code generator, patch stack manipulating instructions and change stack offset computations.

To generate reverse-stack executables modifying a compiler is not enough. You also need a reverse-stack library to link against. It may seem that when we have a compiler, we can compile a library for reverse stack growth, but life is not always easy! There is good amount of assembly code in all libraries that should be modified for reverse stack. I picked dietlibc and modified assembly files manually. Since my intent was to run SPEC CPU benchmarks, I had to modify some of the C sources as well to make the library compatible with the standard libc. Without these modifications, runspec would fail during output verification.

At the end, results exceeded my expectations. Less than one percent performance reduction on average, comparing to a normal executable!

Wednesday, August 29, 2007

The Project

I am trying to build a multi-variant execution environment which runs multiple variants of a single program on different processors/cores and monitors their outputs. Any divergence among the outputs raises an exception and interrupts the execution. The goal of this system is to make programs resilient against malware (viruses, internet worms, etc.).

More specifically we are targeting buffer overflow and similar vulnerabilities such as boundary error, format string, ... that give the opportunity to an attacker to overwrite activation records (return address or frame pointer) on the stack. Overwriting the activation records can lead to malicious code execution.

To fight activation record overwrites, we run two variants of the same program that write the stack in different directions. For example, in x86 stack is written downward and executables generated by compilers conform to this regulation. We have modified a compiler to generate executables that write the stack upward. Running a normal executable along with a reverse-stack executable in a multi-variant environment can disrupt all known buffer overflow exploits. The reason lies behind the fact that all inputs to the program is sent identically to both variants. Therefore, an attacker cannot send different attack vectors to each variant. An attack vector that overwrites activation records on one variant, has different effects on the other one which causes the outputs to diverge. Output divergence is detected by our multi-variant environment and execution is interrupted.