Home > General x86, General x86-64, Programming, Reverse Engineering > Writing a Primitive Debugger: Part 4 (Symbols)

Writing a Primitive Debugger: Part 4 (Symbols)

December 11th, 2014 Leave a comment Go to comments

Up to now, we have developed a debugger that can attach and detach from a process, set and remove breakpoints, print registers and a call stack, and modify control flow by changing the executing thread context. These are all pretty essential features of a debugger. The topic of  this post, debug symbols, is more of a “nice-to-have”. An application may or may not ship with debug symbols, but in the event that it does, i.e. it’s your own application, then the process of debugging becomes significantly more simple.

Debug Symbols

At its simplest definition, a debug symbol is a piece of information that shows how specific parts of a compiled program map back to the source level. For example, a debug symbol might tell information about the name of a variable at a memory address, or which line of code, and in which file, a series of assembly instructions map to. They are typically generated during debug builds and are used to provide some clarity to a developer that is debugging (or reverse engineering) a piece of code. There is no universal debug symbol format for a language, and they may vary between compilers. On the modern Windows platform, debug symbols come in the form of Program Database (PDB) files, ending with a .pdb extension.

These files hold a lot of useful information about the compiled executable or DLL. As mentioned above, they can contain information regarding which source file and line number (or which object file) a symbol at a certain address maps to. They can contain the names and types of global, static, and local variables, as well as classes and structs. They can also contain information compiler optimizations that were used when compiling the code. Some of these things may not be present if the code was compiled with stripped symbols. During a debugging session, the debugger will initialize a symbol handler and begin looking for, either recursively in common directories and/or user-specified directories, and parsing* matching PDB files. When a user is debugging, symbol information can be retrieved and names and source line numbers can be displayed to them (if available).
* This is a useful open source parser that can parse the proprietary format of PDB files.

Implementation

Microsoft provides a very rich set of APIs for handling symbols through the DbgHelp API. There are functions to load/enumerate symbols for a module, find a symbol by name or address, enumerate source file and line references found in PDBs, dynamically add or remove entries from the symbol table, interact with symbol stores, and much more. Given the very large API, I’ve only chosen to demonstrate implementation of the more common features. One thing to consider is that all functions in the DbgHelp API set are single threaded. The example code is single threaded, but does not have concurrency synchronization to ensure that it is only called from a single thread, meaning if you’re implementing something off of this code, make sure that you add concurrency synchronization.

Initializing a symbol handler is pretty straightforward: it merely involves calling SymInitialize. The function takes a process handle, which is opened by the debugger when it attaches. There is also a parameter for the user search path to locate PDB files, and a third parameter to specify whether the debugger is to enumerate all of the loaded modules in the process and load their symbols as well. For an attaching debugger, specifying that this behavior is dependent on the situation. There is a case, such as the debugger creating the target process to debug, or with delay-loaded DLLs, that can cause some symbols to not be loaded. Additionally, if this third parameter is set to true and the symbol handler is initialized prior to receiving all of the LOAD_DLL_DEBUG_EVENT events, then some symbols may not be loaded. The implementation sample code has been defaulted to false, and symbols for modules will be loaded in the CREATE_PROCESS_DEBUG_EVENT and LOAD_DLL_DEBUG_EVENT event handlers. This ensures that all symbol files for every module will be properly loaded.

Prior to initializing the symbol handler, the SymSetOptions function should be called, which configures how and what information the symbol handler will load. Simply put into code, the initialization routine looks like the following:

Symbols::Symbols(const HANDLE hProcess, const HANDLE hFile, const bool bLoadAll /*= false*/)
    : m_hProcess{ hProcess }, m_hFile{ hFile }
{
    (void)SymSetOptions(SYMOPT_CASE_INSENSITIVE | SYMOPT_DEFERRED_LOADS |
        SYMOPT_LOAD_LINES | SYMOPT_UNDNAME);
 
    const bool bSuccess = BOOLIFY(SymInitialize(hProcess, nullptr, bLoadAll));
    if (!bSuccess)
    {
        fprintf(stderr, "Could not initialize symbol handler. Error = %X.\n",
            GetLastError());
    }
}

The options here specify that symbol searches will be case insensitive, that symbols won’t be loaded until a reference is made (not to be confused with delay-loading  for DLLs that were mentioned above), that line information will be loaded, and that symbols will be displayed in an undecorated form. Case insensitivity and undecorated names are there for convenience; it would be annoying to search for exact symbol names such as “?f@@YAHD@Z” otherwise.

When the symbol handler is finished, i.e. the debugger is detaching from the process, a simple call to SymCleanup will terminate the symbol handler:

Symbols::~Symbols()
{
    const bool bSuccess = BOOLIFY(SymCleanup(m_hProcess));
    if (!bSuccess)
    {
        fprintf(stderr, "Could not terminate symbol handler. Error = %X.\n",
            GetLastError());
    }
}

That sets up the initialization and termination of the symbol handler. Time for everything in between.

Enumerating Symbols

One useful feature of a debugger might be to internally enumerate all symbols of a module. This can allow for storage and fast lookup at a later time. Or it can allow for a graphic display for the user and easy navigation to the symbol address from its name. Enumerating symbols is a two step process: first SymLoadModuleEx is called to load the symbol table for the module, then SymEnumSymbols can be called with the base address of the module. SymEnumSymbols takes a callback of type PSYM_ENUMERATESYMBOLS_CALLBACK as a parameter. This callback will be called for every symbol found in the modules symbol table and will have a SYMBOL_INFO structure that shows information about the symbol, such as its name, address, whether it is a register, what value it holds if its a constant, etc. Put in to code, this is rather straightforward:

const bool Symbols::EnumerateModuleSymbols(const char * const pModulePath, const DWORD64 dwBaseAddress)
{
    DWORD64 dwBaseOfDll = SymLoadModuleEx(m_hProcess, m_hFile, pModulePath, nullptr,
        dwBaseAddress, 0, nullptr, 0);
    if (dwBaseOfDll == 0)
    {
        fprintf(stderr, "Could not load modules for %s. Error = %X.\n",
            pModulePath, GetLastError());
        return false;
    }
 
    UserContext userContext = { this, pModulePath };
    const bool bSuccess = 
       BOOLIFY(SymEnumSymbols(m_hProcess, dwBaseOfDll, "*!*", SymEnumCallback, &userContext));
    if (!bSuccess)
    {
        fprintf(stderr, "Could not enumerate symbols for %s. Error = %X.\n",
            pModulePath, GetLastError());
    }
 
    return bSuccess;
}

Resolving Symbols

There are several ways to resolve symbols, but the two most common are by name and by address. This can be achieved by calling SymFromName and SymFromAddr respectively. Both of these populate a SYMBOL_INFO structure, just as calling SymEnumSymbols does. Invoking them is also rather straightforward:

const bool Symbols::SymbolFromAddress(const DWORD64 dwAddress, const SymbolInfo **pFullSymbolInfo)
{
    char pBuffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(char)] = { 0 };
    PSYMBOL_INFO pSymInfo = (PSYMBOL_INFO)pBuffer;
 
    pSymInfo->SizeOfStruct = sizeof(SYMBOL_INFO);
    pSymInfo->MaxNameLen = MAX_SYM_NAME;
 
    DWORD64 dwDisplacement = 0;
    const bool bSuccess = BOOLIFY(SymFromAddr(m_hProcess, dwAddress, &dwDisplacement, pSymInfo));
    if (!bSuccess)
    {
        fprintf(stderr, "Could not retrieve symbol from address %p. Error = %X.\n",
            (DWORD_PTR)dwAddress, GetLastError());
        return false;
    }
 
    fprintf(stderr, "Symbol found at %p. Name: %.*s. Base address of module: %p\n",
        (DWORD_PTR)dwAddress, pSymInfo->NameLen, pSymInfo->Name, (DWORD_PTR)pSymInfo->ModBase);
 
    *pFullSymbolInfo = FindSymbolByName(pSymInfo->Name);
 
    return bSuccess;
}
 
const bool Symbols::SymbolFromName(const char * const pName, const SymbolInfo **pFullSymbolInfo)
{
    char pBuffer[sizeof(SYMBOL_INFO) + MAX_SYM_NAME * sizeof(char)
        + sizeof(ULONG64) - 1 / sizeof(ULONG64)] = { 0 };
    PSYMBOL_INFO pSymInfo = (PSYMBOL_INFO)pBuffer;
 
    pSymInfo->SizeOfStruct = sizeof(SYMBOL_INFO);
    pSymInfo->MaxNameLen = MAX_SYM_NAME;
 
    const bool bSuccess = BOOLIFY(SymFromName(m_hProcess, pName, pSymInfo));
    if (!bSuccess)
    {
        fprintf(stderr, "Could not retrieve symbol for name %s. Error = %X.\n",
            pName, GetLastError());
        return false;
    }
 
    fprintf(stderr, "Symbol found for %s. Name: %.*s. Address: %p. Base address of module: %p\n",
        pName, pSymInfo->NameLen, pSymInfo->Name, (DWORD_PTR)pSymInfo->Address,
        (DWORD_PTR)pSymInfo->ModBase);
 
    *pFullSymbolInfo = FindSymbolByAddress((DWORD_PTR)pSymInfo->Address);
 
    return bSuccess;
}

with the SymbolInfo structure being an extended structure that holds information about source files and line numbers (see example code).

Testing the functionality

To test this functionality, we can take the sample program from the previous post (reproduced below) and see the difference in how call stacks look. The new functionality in this version has added the ability to resolve symbols for the addresses in the callstack. Also, the debugger was augmented to add two new abilities: to dump all symbols from a module, and to set/remove breakpoints on a symbol by name.

#include <cstdio>
 
void d()
{
    printf("d called.\n");
}
 
void c()
{
    printf("c called.\n");
    d();
}
 
void b()
{
    printf("b called.\n");
    c();
}
 
void a()
{
    printf("a called.\n");
    b();
}
 
int main(int argc, char *argv[])
{
    printf("Addresses: \n"
        "a: %p\n"
        "b: %p\n"
        "c: %p\n"
        "d: %p\n",
        a, b, c, d);
 
    getchar();
    while (true)
    {
        a();
        getchar();
    }
 
    return 0;
}

Setting a breakpoint on the d function and printing the call stacks shows the more useful functionality between the previous version of the debugger and this one. Entered commands are shown in red, while new symbol information is shown in orange.

a
[A]ddress or [s]ymbol name? s
Name: d
Received breakpoint at address 00401090.
Press c to continue or s to begin stepping.
l
Frame: 0
Execution address: 00401090
Stack address: 00000000
Frame address: 0018FDE8
Symbol name: d
Symbol address: 00401090
Address displacement: 0
Source file: c:\users\demo\desktop\demoapp\source.cpp
Line number: 4
Frame: 1
Execution address: 0040107C
Stack address: 00000000
Frame address: 0018FDEC
Symbol found at 0040107C. Name: c. Base address of module: 00400000
Symbol name: c
Symbol address: 00401060
Address displacement: 0
Source file: c:\users\demo\desktop\demoapp\source.cpp
Line number: 9
Frame: 2
Execution address: 0040104C
Stack address: 00000000
Frame address: 0018FE40
Symbol found at 0040104C. Name: b. Base address of module: 00400000
Symbol name: b
Symbol address: 00401030
Address displacement: 0
Source file: c:\users\demo\desktop\demoapp\source.cpp
Line number: 15
Frame: 3
Execution address: 0040101C
Stack address: 00000000
Frame address: 0018FE94
Symbol found at 0040101C. Name: a. Base address of module: 00400000
Symbol name: a
Symbol address: 00401000
Address displacement: 0
Source file: c:\users\demo\desktop\demoapp\source.cpp
Line number: 21
Frame: 4
Execution address: 004010EF
Stack address: 00000000
Frame address: 0018FEE8
Symbol found at 004010EF. Name: main. Base address of module: 00400000
Symbol name: main
Symbol address: 004010B0
Address displacement: 0
Source file: c:\users\demo\desktop\demoapp\source.cpp
Line number: 27
Frame: 5
Execution address: 004013A9
Stack address: 00000000
Frame address: 0018FF3C
Symbol found at 004013A9. Name: __tmainCRTStartup. Base address of module: 00400000
Symbol name: __tmainCRTStartup
Symbol address: 00401210
Address displacement: 0
Source file: f:\dd\vctools\crt\crtw32\dllstuff\crtexe.c
Line number: 473
Frame: 6
Execution address: 004014ED
Stack address: 00000000
Frame address: 0018FF8C
Symbol found at 004014ED. Name: mainCRTStartup. Base address of module: 00400000

Symbol name: mainCRTStartup
Symbol address: 004014E0
Address displacement: 0
Source file: f:\dd\vctools\crt\crtw32\dllstuff\crtexe.c
Line number: 456
Frame: 7
Execution address: 76AE919F
Stack address: 00000000
Frame address: 0018FF94
Symbol found at 76AE919F. Name: BaseThreadInitThunk. Base address of module: 00000000
Symbol name: BaseThreadInitThunk
Symbol address: 76AE9191
Address displacement: 0
Source file: (null)
Line number: 0
Frame: 8
Execution address: 77430BBB
Stack address: 00000000
Frame address: 0018FFA0
Symbol found at 77430BBB. Name: RtlInitializeExceptionChain. Base address of module: 00000000
Symbol name: RtlInitializeExceptionChain
Symbol address: 77430B37
Address displacement: 0
Source file: (null)
Line number: 0
Frame: 9
Execution address: 77430B91
Stack address: 00000000
Frame address: 0018FFE4
Symbol found at 77430B91. Name: RtlInitializeExceptionChain. Base address of module: 00000000
Symbol name: RtlInitializeExceptionChain
Symbol address: 77430B37
Address displacement: 0
Source file: (null)
Line number: 0
StackWalk64 finished.

This looks much more useful compared to just getting absolute addresses as in the previous version. Here, for some symbols, the source files can be found on the host machine and be presented to the user alongside the raw assembly. Additionally, symbols  can be printed for any module as shown below:

y
Enter in module name to dump symbols for: kernel32.dll
Symbol name: QuirkIsEnabledWorker
Symbol address: 76AE0010
Address displacement: 0
Source file: (null)
Line number: 0
Symbol name: EnumCalendarInfoExEx
Symbol address: 76AE03BD
Address displacement: 0
Source file: (null)
Line number: 0
Symbol name: GetFileMUIPath
Symbol address: 76AE03CE
Address displacement: 0
Source file: (null)
Line number: 0
...

That concludes the topic on symbols. The implementation presented here only scratched the surface of what is available in terms of the DbgHelp API, and I recommend that those interested further explore the MSDN documentation on the topics. The next article will conclude the series with a collection of miscellaneous features that debuggers typically possess. For that piece, it will probably include the ability to step over code (step into is currently implemented), present a disassembly listing to the user for x86 and x64, and allow for modification of arbitrary memory, instead of just registers and/or a thread context.

Article Roadmap
Future posts will be related on topics closely following the items below:

  • Basics
  • Adding/Removing Breakpoints, Single-stepping
  • Call Stack, Registers, Contexts
  • Symbols
  • Miscellaneous Features

The full source code relating to this can be found here. C++11 features were used, so MSVC 2012/2013 is most likely required.

  1. No comments yet.
  1. No trackbacks yet.