- Open Access
An adaptive approach for Linux memory analysis based on kernel code reconstruction
© The Author(s) 2016
- Received: 6 January 2016
- Accepted: 14 June 2016
- Published: 27 June 2016
Memory forensics plays an important role in security and forensic investigations. Hence, numerous studies have investigated Windows memory forensics, and considerable progress has been made. In contrast, research on Linux memory forensics is relatively sparse, and the current knowledge does not meet the requirements of forensic investigators. Existing solutions are not especially sophisticated, and their complicated operation and limited treatment range are unsatisfactory. This paper describes an adaptive approach for Linux memory analysis that can automatically identify the kernel version and recovery symbol information from an image. In particular, given a memory image or a memory snapshot without any additional information, the proposed technique can automatically reconstruct the kernel code, identify the kernel version, recover symbol table files, and extract live system information. Experimental results indicate that our method runs satisfactorily across a wide range of operating system versions.
- Memory forensics
- Linux memory analysis
- Kernel symbol
The physical memory of a computer is highly useful but can be a challenging resource for the collection of digital evidence. Physical memory may first appear to be a large, amorphous, and unstructured collection of data. In fact, by examining a memory image, we can extract details of volatile data, such as running processes, logged-in users, current network connections, users’ sessions, drivers, and open files. Although criminals tend to avoid leaving any evidence in a computer’s persistent storage, it is extremely hard for them to completely remove their footprints from the memory. In some cases, physical memory is the only place where evidence can be found. In a computer operating system (OS) that boots and runs completely from CD-ROM, nearly all of the valuable information exists in the physical memory of the computer. Therefore, memory forensics is becoming increasingly important.
Before 2005, the physical memory of a computer was mainly captured to retrieve strings, e.g., passwords, credit card numbers, fragments of chat conversations, IP addresses, or email addresses. In 2005, the Digital Forensics Research Workshop (DFRWS) organized a memory analysis challenge . Since then, the capture and analysis of the content of physical memory, known as memory forensics, has become an area of intense research and experimentation . Numerous studies have analyzed Windows memory images. The MemParser tool enables an examiner to load a physical memory dump of certain Windows systems, reconstruct process information, and extract data relating to specific processes . PTFinder is a proof-of-concept implementation with the ability to reveal hidden and terminated processes and threads . A method based on the Kernel Processor Control Region (KPCR) structure in Windows was proposed to determine the OS version and realize the translation from virtual address to physical address . Moreover, technical details related to memory analysis have been discussed, including address translation [6, 7], pool allocation , swap integration , carving out memory [9, 10], sensitive information extraction [11, 12], and malicious code detection . In short, memory analysis has been used in the wider context of digital forensics, virtual machine introspection, and malware detection .
Compared with Windows memory forensics, memory analysis of Linux systems presents some practical challenges. Current Linux memory analysis technologies require precise knowledge of the OS edition and kernel symbol information, which is generated at compile time. Kernel symbol information varies with the different OS editions. Furthermore, the Linux kernel is highly configurable. During the kernel build process, users may specify a large number of different options through the kernel’s configuration system. These options affect the kernel symbol information, resulting in distinct key structures for the same OS version. To obtain kernel symbol information, one must configure an environment in which the OS version and configuration options are exactly the same as those in the target system. For incident response applications, obtaining precise and relevant information is currently a slow, manual process, which limits its usefulness in rapid triaging.
We present a multi-aspect approach to automatically identify the precise kernel version when provided with only a physical memory dump. The approach is universal, and does not rely on any prior knowledge for particular OSs.
We devise a set of novel techniques to obtain kernel symbols from the physical memory dump instead of obtaining symbol information from the target kernel’s “System.map” file. Each time a new kernel is compiled, various symbols are assigned different addresses. New kernel versions of Linux are released frequently, and it is inconvenient to find all System.map files.
As the symbols in the System.map file are important, and symbols exported from modules are critical for investigators, a method of parsing symbols exported from modules is presented. To recover and analyze loaded kernel module information, it is essential to understand the relevant data structures used by the target OS. As the inclusion or exclusion of a kernel configuration option can cause the insertion or removal of several members of key structures, analysis methods that rely on a stable key structure layout are inadvisable. A method to accurately and dynamically build representative data structures is also presented.
Based on the above techniques, we develop a new Linux memory analysis system named RAMAnalyzer that can identify the OS version and acquire symbol information automatically. Live system information can subsequently be retrieved. We examine the performance of RAMAnalyzer on various recent Linux kernels, and show that it is an adaptive solution for the Linux memory analysis problem.
The remainder of this paper is organized as follows. Section 2 introduces some background information. The proposed techniques based on dynamic reconstruction are described thoroughly in Sections 3 and 4. In Section 5, we evaluate the proposed forensics tool in terms of effectiveness and performance. The final section summarizes this study and states our conclusions and indicates some opportunities for future research in this area.
Kernel version identification (KVI): this component allows the OS version to be detected in two ways: linux_banner content identification and vmcoreinfo_data content identification.
kallsyms location symbol values recovery (KLSR): the symbol table file can be recovered from memory using kallsyms location symbols such as kallsyms_addresses, kallsyms_num_syms, kallsyms_names, kallsyms_token_table, and kallsyms_token_index. Based on kernel code reconstruction, this component provides a method for discovering the above symbol values.
Symbol table file recovery (STFR): using the kallsyms location symbol values obtained from KLSR, the symbol table file content and kernel symbol information can be recovered.
Live system information extraction (LSIE): several key kernel symbols are selected to extract live system information, such as process information and module information. Furthermore, the symbols exported from modules can be parsed.
Database information extension: the addresses of the symbols are identical for identical kernel versions and compile configurations. To improve system efficiency, a database records symbol information for known kernel versions. When the kallsyms location symbol values of new versions are recovered from KLSR, these values are saved in the database.
Step I: given a physical memory image, identify the precise kernel version of the target system. Using the KVI module, the kernel version and the values of _stext and swapper_pg_dir can be obtained.
Step II: check the database for preexisting symbol information. If the processed kernel version exists, extract the symbol addresses from the database and go to step IV; otherwise, go to step III.
Step III: recover kallsyms location symbol values using the KLSR module. New acquisition data are recorded in the database.
Step IV: after the acquisition of kallsyms location symbol values, the symbol table file is recovered using the STFR module.
Step V: extract the _stext symbol from the symbol table file and compare its value with that obtained from step I. If these two values are equal, the kallsyms location symbol values are correct; go to step VI. Otherwise, go to step III, adjust the kallsyms_address candidate value and retrieve the kallsyms location symbol value again.
Step VI: using the kernel symbols in the symbol table file, live system information can be extracted. In particular, symbols exported by loaded kernel modules are analyzed.
From the above description, we can see that our solution has an adaptive ability to cope with different kernel versions. More details are introduced in the next section.
In this section, we describe the detailed processes of kernel version identification, kallsyms location symbol values recovery, symbol table file recovery, and live system information extraction.
4.1 Kernel version identification
There are two ways to obtain kernel version information from the memory image: vmcoreinfo_data content identification and linux_banner content identification.
4.1.1 linux_banner content identification
The start_kernel() function, which is called by the startup_32() function, initializes all of the data structures needed by the kernel, enables interrupts, and creates another kernel thread named process 1. Finally, linux_banner information is printed in the following format:
const char linux_banner  =
“Linux version ” UTS_RELEASE “ (" LINUX_ COMPILE_BY “@”
LINUX_COMPILE_HOST “) (” LINUX_COMPILER “) ”UTS_VERSION “\n”;
By searching for the characteristic character “Linux version”, kernel version information can be obtained.
4.1.2 vmcoreinfo_data Content Identification
The kernel version in linux_banner and vmcoreinfo_data content should be the same. The latter contains information about the _stext, swapper_pg_dir, vmlist, mem_map, and init_uts_ns symbols, which are also stored in the symbol table files. The values of these symbols are virtual addresses. Generally, if swapper_pg_dir has a length value of 8, the OS is 32 bit and the physical address of swapper_pg_dir is its virtual address minus 0 ×c0000000. If swapper_pg_dir has a length value of 16, the OS is 64 bit and its physical address is its virtual address minus 0 ×ffffffff80000000. swapper_pg_dir is the page global directory (pdg) for a process named “swapper” and can be used to translate between the virtual address and the physical address in the kernel address space. This symbol name differs between architectures in the symbol table file, being called swapper_pg_dir on both ×86 and PPC64, but it is named init_level4_pgt on ×86_64.
4.2 Kallsyms location symbol values recovery
The algorithm for the kallsyms location symbol values recovery has four main steps:
Step I: Kallsyms_address candidate value scanning. Because _stext is one of the kernel symbols, the values obtained from the procedure described in Section 4.1.2 can be used to locate the kallsyms_addresses. During the search procedure, the value of _stext may be found in multiple places. To enhance the efficiency of the algorithm, we impose some restrictions. For 32-bit systems, for example, the content before and after the found address are the addresses of kernel symbols, and so the values should be greater than 0xc0000000. Tracing back from the found address, we can obtain the value of the startup_32 symbol, which can be calculated from _stext&0 ×ffff0000. The first symbol in /proc/kallsyms is generally startup_32, and this is where the address of the startup_32 symbol resides. This provides a candidate physical address for kallsyms_addresses. However, in 64-bit systems, there are several symbols before the startup_64 symbol, and so the test times are higher than for 32-bit systems.
Step II: The kernel code disassembly. To obtain the other four symbol values, the kernel code in memory must be disassembled correctly. Unfortunately, because the instructions for different systems and architectures are of various lengths, starting from the wrong instruction location will disassemble a completely different instruction sequence. To address this challenge, we decompile the smallest amount of kernel code, instead of the whole block of required function calls.
Analyzing the source code of a Linux kernel, it is clear that operations related to the kernel symbols are mainly present in Linux/kernel/kallsyms.c. The symbols that will be re-linked against their real values during the second link stage are defined below:
extern const unsigned long kallsyms_addresses _weak;
extern const u8 kallsyms_names _weak;
extern const unsigned long kallsyms_num_syms;
extern const u8 kallsyms_token_table _weak;
extern const u16 kallsyms_token_index _weak;
Once the principle of the update_iter function is fully understood, we can use the candidate value of kallsyms_addresses obtained from step I to obtain the values of the other four symbols.
An image from the 3.6.10-4.fc18.i686.PAE system is used to illustrate the method. The value of the kallsyms_addresses symbol is found at offset 0xc4 for the update_iter function’s binary code in the image. Therefore, we step back three bytes and disassemble the binary code. As mentioned above, our principle is to reduce the amount of disassembled code by as much as possible and improve the precision of our method. Approximately 0 ×2a bytes are chosen to be decompiled, and the results are as follows:
Through a combined analysis of the disassembled output and the update_iter function’s source code, the values of kallsyms_names, kallsyms_token_index, and kallsyms_token_table can be obtained. In 32-bit systems, the value of kallsyms_num_syms is the value of kallsyms_names minus 4.
There are some differences in 64-bit systems. Bits 0–31 of the symbol addresses are used, and the value of kallsyms_num_syms is the value of kallsyms_names minus 8. Although some changes take place in the binary code of the update_iter function for different Linux systems, the differences in the code segment used here are minimal.
4.3 Symbol table file recovery
After acquiring the kallsyms location symbols, the STFR method proceeds as follows:
The kallsyms_num_syms symbol points to the kernel symbol “num” in /proc/kallsyms. The kallsyms_addresses symbol points to the addresses of all kernel symbols in order. Each symbol address has a length of 4 in 32-bit systems and 8 in 64-bit systems. To obtain the corresponding name, the kallsyms_names, kallsyms_token_table, and kallsyms_token_index symbols are needed. The kallsyms_names symbol points to a list of length-prefixed byte arrays that encode indexes into the token table. According to the length, the bytes of each array are acquired and used to construct a substring. Finally, the substrings are joined together to form the type and name of the required symbol.
We again use an image from the 3.6.10-4.fc18.i686.PAE system to describe the above method.
The addresses of the required symbols are determined from the database:
c09e8c0c R kallsyms_addresses
c0a247b0 R kallsyms_num_syms
c0a247b4 R kallsyms_names
c0ad9210 R kallsyms_token_table
c0ad95a0 R kallsyms_token_index
To verify the correctness of the symbol values, the value of _stext obtained from the process described in this section is compared with that obtained in Section 4.1.2. If the two values are the same, the obtained symbols are available. If not, we recover the symbols using the algorithm described in Section 4.2.
4.4 Live system information extraction
After determining the kernel symbols, several key symbols are selected to obtain the live system information.
4.4.1 Gathering offsets of structure members
Code fragments 1–3 show how equivalent statements can be compiled to form radically different instruction sequences. The C source code in code fragment 1 is from the module_get_kallsym() function within /kernel/module.c of the Linux kernel source base. This function was used to help find the offset of the num_symtab, symtab, and strtab members of the struct module.
In the above code fragments, the constant 0 ×114 within the indexed instructions is the offset for the num_symtab member. As the methods used by compilers can be very different, all possible instruction formats for various architectures must be clarified. Code fragment 4 is again from the module_get_kallsym() function, and fragments 5–8 illustrate the disassembly of the instruction that accesses the strtab and symtab members of the module for different architectures.
The module_get_kallsym function is exported as a symbol to /proc/kallsyms, and its address can be obtained from the process described in Section 4.3. In this way, the state, name, module_core, and source_list members of module structures can be analyzed based on the kdb_lsmod function defined in /kernel/debug/kdb/kdb_main.c.
4.4.2 Process information extraction
Every process is represented by a structure named task_struct, which is defined in the /usr/src/linxu-2.4/include/linux/sched.h file. The init_task symbol corresponds to the task_struct structure address of the swapper, where the PID is zero. The task_struct structures of all active processes are doubly linked to each other. By traversing the double-linked list, all of the running processes can be identified. Moreover, the task_struct structure includes some objects that correspond to information regarding the current state of a process, such as struct mm_struct *mm, struct fs_struct *fs, struct files_struct *files, and struct thread_struct thread. Using these objects, we can obtain information on the memory management, file, and thread of the processes.
4.4.3 Module information extraction
Similar to the process information, all module structures are doubly linked to each other. By acquiring a module using the module symbol, the other modules can be identified from this doubly linked list.
To link a module, the sys_init_module() service initializes the syms and gpl_syms fields of the module object so that they point to the in-memory tables of symbols exported by the module. Some special kernel symbol tables are used by the kernel to store the symbols that can be accessed by modules with their corresponding addresses. These are contained in three sections of the kernel code segment: the _kstrtab section includes the names of the symbols, the _ksymtab section includes the addresses of the symbols, and the _ksymtab_gpl section includes the addresses of the symbols that can be used by the modules released under a GPL-compatible license. Only the kernel symbols actually used by some existing modules are included in the table. Linked modules can also export their own symbols so that other modules can access them. Although these symbols are critical during an investigation, they have largely been neglected in previous research. For instance, the vm_list symbol exported by the kvm module can be used to analyze the virtual machine information running on the current physical machine.
To obtain the exported symbols from the memory image, some objects of the module structure can be used, such as const struct kernel_symbol *syms, const struct kernel_symbol *gpl_syms, Elf_Sym *symtab, and Char *strtab. Among these objects, the symtab object is particularly important because it assists in the recovery of the symbol and string tables for kallsyms.
As for other system information, the rt_hash_mask, rt_hash_table, and net_namespace_list symbols are used to obtain information about the network configuration and current network connections; the boot_cpu_data symbol is used to obtain CPU information; the log_buf symbol corresponds to system log and debug information; and the iomem_resource symbol reflects the available physical address space of the target computer.
Based on the techniques described above, we developed a Linux memory analysis system named RAMAnalyzer. In this section, we present our experimental results. An experiment to test the effectiveness of RAMAnalyzer with 26 Linux kernels (from 2.6.18 to 4.2.0) is described in Section 5.1, and the performance of the proposed tool is reported in Section 5.2. All of our experiments were performed on a host machine with an Intel Core i5-4210U CPU, 4-GB memory, and a 64-bit Windows 7 OS.
The following memory images were chosen: DFRWS 2008 forensics challenge, volatility memory samples, and memory snapshots from virtual machines running on the VMware Workstation. The test flow and execution results of RAMAnalyzer are described below.
5.2 Performance evaluation
Sample of kernel versions used for testing RAMAnalyzer
Oracle Linux 5
The experimental results prove that RAMAnalyzer can deal with memory images from a wide range of kernel versions and demonstrate that its execution time is acceptable.
Based on kernel code reconstruction, this paper has proposed an adaptive approach for Linux memory analysis that can address a Linux memory image without information about the kernel version or System.map file. We implemented a prototype named RAMAnalyzer that is made up of five main components: kernel version identification, symbol table file recovery, kallsyms location symbol value recovery, live system information extraction, and database information extension. Our experimental results with a number of Linux memory images show that RAMAnalyzer can automatically identify the kernel version and recovery kernel symbols. Based on the kernel symbols, RAMAnalyzer then extracts live system information about the target system at the time of memory acquisition.
The ability to deal with memory images without precise kernel version information and symbol information.
The ability to identify the precise kernel version and recovery kernel symbols automatically, which means it can deal with memory images from different kernel versions. Furthermore, kernel symbol information obtained in this way is more accurate because symbol information from identical kernel versions can vary under different configuration options.
As well as kernel symbol information, RAMAnalyzer can acquire the symbols exported from modules, which play an important role in the investigation procedure.
Based on the above techniques, RAMAnalyzer has adaptability to deal with mainstream Linux kernel memory images and has high execution efficiency.
From the advantages of RAMAnalyzer, we can see that our solution can provide a solution for the challenge described in Section 1 and meet the need of scenarios described in Section 2.1. With the advent of mobile cloud computing, the development of Linux is accelerating and its security is becoming increasingly crucial. The techniques proposed in this paper provide forensics researchers with a starting point to delve into Linux memory forensics, which plays an important role in security and forensics investigations. Furthermore, these techniques can be conveniently embedded into other forensics frameworks.
To enhance the performance of RAMAnalyzer, the following research will be undertaken: first, owing to the differences in the kallsyms configurations of Linux kernel versions, there are various initial /proc/kallsyms. To improve the processing speed, it is essential to scan the kalsyms_address candidate values. Second, to improve the adaptive capacity of cloud environments, RAMAnalyzer was verified to be effective using memory images from the KVM host machine. However, further experiments are required to verify its efficacy on memory images from Xen host machines.
This research is supported by the National Natural Science Foundation of China (61472225), Shandong Natural Science Foundation (ZR2014FM003), and Shandong Province Outstanding Young Scientists Research Award Fund Project(BS2014DX007).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- DFRWS 2005 Forensics Challenge. http://www.dfrws.org/conferences/dfrws-usa-2005.
- L Wang, L Xu, S Zhang, Network connections information extraction of 64-bit windows 7 memory images. Forensics in Telecommunications, Information, and Multimedia. 56:, 90–98 (2010).View ArticleGoogle Scholar
- DFRWS 2005 Forensics Challenge: Memparser. http://sourceforge.net/projects/memparser.
- A Schuster, Searching for processes and threads in microsoft windows memory dumps. Digit. Investig. 3:, 10–16 (2006).View ArticleGoogle Scholar
- R Zhang, L Wang, S Zhang, Windows memory analysis based on kpcr. Fifth International Conference On Information Assurance and Security. 2:, 677–680 (2009).View ArticleGoogle Scholar
- JD Kornblum, Using every part of the buffalo in windows memory analysis. Digit Investig. 4(1), 24–29 (2007).View ArticleGoogle Scholar
- A Schuster, The impact of microsoft windows pool allocation strategies on memory forensics. Digit. Investig.5:, 58–64 (2008).View ArticleGoogle Scholar
- S Zhang, L Wang, R Zhang, Q Guo, Exploratory study on memory analysis of windows 7 operating system. 3rd International Conference On Advanced Computer Theory and Engineering (ICACTE). 6:, 373–377 (2010).Google Scholar
- B Dolan-Gavitt, Forensic analysis of the windows registry in memory. Digit. Investig.5:, 26–32 (2008).View ArticleGoogle Scholar
- R Van Baar, W Alink, A Van Ballegooij, Forensic memory analysis: Files mapped in memory. Digit. Investig.5:, 52–57 (2008).View ArticleGoogle Scholar
- S Hejazi, C Talhi, M Debbabi, Extraction of forensically sensitive information from windows physical memory. Digit. Investig.6:, 121–131 (2009).View ArticleGoogle Scholar
- Q Zhao, T Cao, Collecting sensitive information from windows physical memory. J. Comput.4(1), 3–10 (2009).MathSciNetView ArticleGoogle Scholar
- M Sikorski, A Honig, Practical Malware Analysis: the Hands-on Guide to Dissecting Malicious Software (no starch press, Scn Francisco, 2012).Google Scholar
- A Ligh, MH Case, J Levy, A Walters, The Art of Memory Forensics: Detecting Malware and Threats in Windows, Linux, and Mac Memory (John Wiley & Sons, 2014).Google Scholar
- Volatility: Linux Memory Forensics. https://code.google.com/archive/p/volatility/wikis/LinuxMemoryForensics.wiki.
- Rekall Memory Forensic Framework. http://www.rekall-forensic.com/.
- MI Cohen, Characterization of the windows kernel version variability for accurate memory analysis. Digit. Investig.12:, 38–49 (2015).View ArticleGoogle Scholar
- A Petroni, NL Walters, T Fraser, WA Arbaugh, Fatkit: A framework for the extraction and analysis of digital forensic data from volatile system memory. Digit. Investig.3(4), 197–210 (2006).View ArticleGoogle Scholar
- SecondLook Linux Memory Dump Samples. https://www.forcepoint.com/.
- A Case, L Marziale, GG Richard, Dynamic recreation of kernel data structures for live forensics. Digit. Investig.7:, 32–40 (2010).View ArticleGoogle Scholar
- T Haruyama, H Suzuki. >https://media.blackhat.com/bh-eu-12/Haruyama/bh-eu-12-Haruyama-Memory_Forensic-Slides.pdf.