18 January 2009

Debug Symbols for MOSA #4 - PDB File Format

I'll continue my series with this posting about the PDB file format. There are some places on the net, which already describe the file format. However I've found some inaccuracies in most places - probably due to the fact that the format itself is not open and everyone performs some kind of interpretation of the data they see. Again, take this post with a grain of salt. I'm going to talk about the managed PDB files explicitly, so expect some differences compared to native code PDB files.

Test Code

Before we'll dive into the format itself I want to show you the sample code I'm going to use to describe the format with. The following C# source is compiled with debug support in order to generate a PDB file.

<span style="color:#999999;">1</span> <span style="color:#0000ff;">using</span><span style="color:#000000;"> System;<br /></span><span style="color:#999999;">2</span> <span style="color:#000000;"><br /></span><span style="color:#999999;">3</span> <span style="color:#000000;"></span><span style="color:#0000ff;">namespace</span><span style="color:#000000;"> PdbExample<br /></span><span style="color:#999999;">4</span> <span style="color:#000000;">{<br /></span><span style="color:#999999;">5</span> <span style="color:#000000;"></span><span style="color:#0000ff;">class</span><span style="color:#000000;"> Program<br /></span><span style="color:#999999;">6</span> <span style="color:#000000;">{<br /></span><span style="color:#999999;">7</span> <span style="color:#000000;"></span><span style="color:#0000ff;">static</span><span style="color:#000000;"> </span><span style="color:#0000ff;">void</span><span style="color:#000000;"> Main(</span><span style="color:#0000ff;">string</span><span style="color:#000000;">[] args)<br /></span><span style="color:#999999;">8</span> <span style="color:#000000;">{<br /></span><span style="color:#999999;">9</span> <span style="color:#000000;">Console.WriteLine(</span><span style="color:#800000;">"</span><span style="color:#800000;">Hello World!</span><span style="color:#800000;">"</span><span style="color:#000000;">);<br /></span><span style="color:#999999;">10</span> <span style="color:#000000;"></span><span style="color:#0000ff;">for</span><span style="color:#000000;"> (</span><span style="color:#0000ff;">int</span><span style="color:#000000;"> i </span><span style="color:#000000;">=</span><span style="color:#000000;"> </span><span style="color:#800080;">0</span><span style="color:#000000;">; i </span><span style="color:#000000;"><</span><span style="color:#000000;"> </span><span style="color:#800080;">10</span><span style="color:#000000;">; i</span><span style="color:#000000;">++</span><span style="color:#000000;">)<br /></span><span style="color:#999999;">11</span> <span style="color:#000000;">Console.WriteLine(</span><span style="color:#800000;">"</span><span style="color:#800000;">Count: {0}</span><span style="color:#800000;">"</span><span style="color:#000000;">, i);<br /></span><span style="color:#999999;">12</span> <span style="color:#000000;">}<br /></span><span style="color:#999999;">13</span> <span style="color:#000000;">}<br /></span><span style="color:#999999;">14</span> <span style="color:#000000;">}<br /></span><span style="color:#999999;">15</span> <span style="color:#000000;"></span><br />

The examples shown in this series use the C# 3.5 compiler, however I believe the format hasn't changed since C# 1.0.

PDB Header

PDB files start with a pretty large header of 32 bytes, which can be used to identify the file. The following dump shows a header of the resulting PDB file:

<span style="color:#000000;">00000000</span><span style="color:#000000;"> 4D </span><span style="color:#000000;">69</span><span style="color:#000000;"> </span><span style="color:#000000;">63</span><span style="color:#000000;"> </span><span style="color:#000000;">72</span><span style="color:#000000;"> 6F </span><span style="color:#000000;">73</span><span style="color:#000000;"> 6F </span><span style="color:#000000;">66</span><span style="color:#000000;"> </span><span style="color:#000000;">74</span><span style="color:#000000;"> </span><span style="color:#000000;">20</span><span style="color:#000000;"> </span><span style="color:#000000;">43</span><span style="color:#000000;"> 2F </span><span style="color:#000000;">43</span><span style="color:#000000;"> 2B 2B </span><span style="color:#000000;">20</span><span style="color:#000000;"> Microsoft C</span><span style="color:#000000;">/</span><span style="color:#000000;">C</span><span style="color:#000000;">++</span><span style="color:#000000;"><br /></span><span style="color:#000000;">00000010</span><span style="color:#000000;"> 4D </span><span style="color:#000000;">53</span><span style="color:#000000;"> </span><span style="color:#000000;">46</span><span style="color:#000000;"> </span><span style="color:#000000;">20</span><span style="color:#000000;"> </span><span style="color:#000000;">37</span><span style="color:#000000;"> 2E </span><span style="color:#000000;">30</span><span style="color:#000000;"> </span><span style="color:#000000;">30</span><span style="color:#000000;"> 0D 0A 1A </span><span style="color:#000000;">44</span><span style="color:#000000;"> </span><span style="color:#000000;">53</span><span style="color:#000000;"> </span><span style="color:#000000;">00</span><span style="color:#000000;"> </span><span style="color:#000000;">00</span><span style="color:#000000;"> </span><span style="color:#000000;">00</span><span style="color:#000000;"> MSF </span><span style="color:#000000;">7.00</span><span style="color:#000000;">...</span><span style="color:#000000;">DS</span><span style="color:#000000;">...</span><span style="color:#000000;"><br /></span><br />

The header has some interesting properties. First it is an ASCII string, which contains a line break and is zero terminated. This makes it possible to pass a PDB file to the DOS command 'type' and be able to see the version of the PDB file we have:

<span style="color:#000000;">D:</span><span style="color:#000000;">\</span><span style="color:#000000;">My Projects</span><span style="color:#000000;">\</span><span style="color:#000000;">Tests</span><span style="color:#000000;">\</span><span style="color:#000000;">CSharpConsoleBlog</span><span style="color:#000000;">\</span><span style="color:#000000;">CSharpConsoleBlog</span><span style="color:#000000;">></span><span style="color:#0000ff;">type</span><span style="color:#000000;"> Program10</span><span style="color:#000000;">.</span><span style="color:#000000;">pdb<br />Microsoft C</span><span style="color:#000000;">/</span><span style="color:#000000;">C</span><span style="color:#000000;">++</span><span style="color:#000000;"> MSF </span><span style="color:#000000;">7.00</span><span style="color:#000000;"><br /><br />D:</span><span style="color:#000000;">\</span><span style="color:#000000;">My Projects</span><span style="color:#000000;">\</span><span style="color:#000000;">Tests</span><span style="color:#000000;">\</span><span style="color:#000000;">CSharpConsoleBlog</span><span style="color:#000000;">\</span><span style="color:#000000;">CSharpConsoleBlog</span><span style="color:#000000;">></span><br />

You don't get the garbage displayed on the screen but still some valuable information from the file itself. The two letters DS in the header are the initials of Dan Spalding, who owned the linker and much of the PDB code for many years according to Andy Penell.

Next, at byte 0x00000020 starts a structure, which contains a lot of settings and provides a lot of information to a PDB reader:

FieldSizeMeaning
pageSize4The size of a page in the file.
bitmapPage4The page number of the bitmap page.
filePages4The number of pages in the file.
rootBytes4The number of bytes in the root stream.
reserved4Unused as far as I know.
indexPage4The page number of the index page.

Ok, so a PDB file is divided into fixed size pages (size in the pageSize field) and there's a bitmap that specifies if a page is in use or not. Sounds familiar? Well, yes it's the same strategy as used for the FAT file system, OLE compound files and in a lot of other areas.

The filePages field can be used to make sure the PDB file is completely available - multiply pageSize with filePages and you should have the size of the PDB file.

PDB Root Stream

Using the fields we have right now, we still can't unlock the contents of the PDB file. To get there we need to combine the rootBytes and indexPage fields. The indexPage field points to a page, which contains page numbers of the root stream. So it is an array of 4-byte page numbers, which hold the contents of the root stream in order. To determine the number of entries in the index, you must divide rootBytes by pageSize.

If you read all pages in order of the index, you've read the root stream. The root stream tells us, what is contained in the file. The root stream starts with 4 bytes, which tell the total number of streams in the file. An array of stream lengths is located after the stream count, e.g. each entry in this array corresponds to the length of that stream. What follows next is a page index for all streams, e.g. an array of page numbers for stream #1, an array of page numbers for stream #2 etc. The number of entries is again determined by the length of that stream divided by the page size.

The root stream gives us basically an index of all streams available and how they're spread accross the PDB file.

Starting with the next posting, I'll dive into the important streams in order.


17 January 2009

Updated my blog style again...

Ok, this is going to be the last update for some time. I'll add some more categories for postings I've planned, but this will be it - unless I'll find some more bugs somewhere.

16 January 2009

Debug Symbols for MOSA #3 - Accessing PDB files

In the last post, I wrote about the debug symbol formats used by Microsoft in recent years. This post is dedicated to tell you, where to look about accessing these files using official APIs. If the world consisted only of Windows, we could stop here. We wouldn't need to understand the file format itself or be able to read the files without the APIs described below. However the world isn't living in monoculture so I'll keep my goal to describe the PDB format in the next posts.

Essentially Microsoft only makes four APIs available to access debugging symbols:

and the only remaining API is in the .NET System.Diagnostics.SymbolStore namespace in mscorlib.

Of all of these APIs the Image Helper Library provides the most features, followed by the Debug Help Library. The later is used by the Microsoft Debuggers to load symbol information. While both of these libraries are regular Win32 DLLs with WINAPI entry points, the Debug Interface Access SDK provides COM objects to access the contents of symbol files. The library is very easy to work with.

As one can easily see these three options don't work outside the Microsoft world, well they don't except for maybe Wine or ReactOS. My first hope was that using the System.Diagnostics.SymbolStore namespace would be sufficient for our purpose of retrieving the symbol information, but again this quickly makes the code a Windows only option.

From Mike Stall's .NET Debugging Blog, I've taken the following snippet from the sample code of PDB2XML tool - a tool, which uses ISymbolReader to read a PDB file and writes it in an XML file.

        // We demand Unmanaged code permissions because we're reading from the file system and calling out to the Symbol Reader
// @TODO - make this more specific.
[System.Security.Permissions.SecurityPermission(System.Security.Permissions.SecurityAction.Demand,
Flags = System.Security.Permissions.
SecurityPermissionFlag.UnmanagedCode)]
public static ISymbolReader GetSymbolReaderForFile(SymbolBinder binder, string pathModule, string searchPath)
{
// Guids for imported metadata interfaces.
Guid dispenserClassID = new Guid(0xe5cb7a31, 0x7512, 0x11d2, 0x89, 0xce, 0x00, 0x80, 0xc7, 0x92, 0xe5, 0xd8); // CLSID_CorMetaDataDispenser
Guid dispenserIID = new Guid(0x809c652e, 0x7396, 0x11d2, 0x97, 0x71, 0x00, 0xa0, 0xc9, 0xb4, 0xd5, 0x0c); // IID_IMetaDataDispenser
Guid importerIID = new Guid(0x7dac8207, 0xd3ae, 0x4c75, 0x9b, 0x67, 0x92, 0x80, 0x1a, 0x49, 0x7d, 0x44); // IID_IMetaDataImport

// First create the Metadata dispenser.
object objDispenser;
NativeMethods.CoCreateInstance(ref dispenserClassID, null, 1, ref dispenserIID, out objDispenser);

// Now open an Importer on the given filename. We'll end up passing this importer straight
// through to the Binder.
object objImporter;
IMetaDataDispenser dispenser = (IMetaDataDispenser)objDispenser;
dispenser.OpenScope(pathModule, 0,
ref importerIID, out objImporter);

IntPtr importerPtr = IntPtr.Zero;
ISymbolReader reader;
try
{
// This will manually AddRef the underlying object, so we need to be very careful to Release it.
importerPtr = Marshal.GetComInterfaceForObject(objImporter, typeof(IMetadataImport));

reader = binder.GetReader(importerPtr, pathModule, searchPath);
}
finally
{
if (importerPtr != IntPtr.Zero)
{
Marshal.Release(importerPtr);
}
}
return reader;
}

Ouch, that was a lot of code just to get the symbol reader - but wait a minute, what are the Guids doing there and the CoCreateInstance call? This screams trouble for cross platform code... It turns out that ISymbolReader is not useful without an object, which implements the IMetadataImport interface. This is a COM interface implemented by mscoree.dll, the Microsoft .NET Runtime Execution Engine. And you can't get an ISymbolReader without a SymBinder, which is not even defined in the namespace. These GUIDs and the COM classes are defined in ISymWrapper.dll, a COM interop assembly.

But not enough, the ISymbolBinder1 (don't ask, look up ISymbolBinder and ISymbolBinder1 and figure out the reason for ISymbolBinder1) interface uses an IntPtr to access this IMetadataImport interface. Essentially you're passing a COM interface in an unmanaged pointer (or native int in CIL speak) to another unmanaged COM object.

Somehow this is messed up. Really messed up. It looks like symbol information was an afterthought in the development of .NET and hasn't received any priority ever since .NET 1.0 - this mess has been this way since .NET 1.0 was released. I hope that things get better with .NET 4.0, but for some reason I doubt that.

Now we've collected lots of unusable APIs and we still can't read those PDB files anywhere outside of Windows. However this gives us something else: All of those APIs have documented some structures to pass symbol information to the calling application. These structures are very likely to be similar to what is stored on disk - at least these APIs give us some hints that the format is more complex than one might think. And finally the .NET namespace gives us some design guidelines to realize a PDB reader/writer using plain .NET.

More in the next post.

New blog design

Ok, another blog design. The green didn't really look good so now I've been looking on the web for a better blog template and have found some nice ones made by Andreas Viklund. Thank you for sharing these templates.

15 January 2009

Debug Symbols for MOSA #2 - Debug Symbol Formats

This continues the series of Debug Symbols for MOSA, started in Why we need them.

As the title says, there's a multitude of debug symbol formats out there. It usually depends on the operating system, the compiler, compiler switches and linker of choice what kind of symbols are emitted or even worse if any at all. I'm only going to talk about the Microsoft formats, as these are the ones I've actively been working with.

Microsoft has created a multitude of symbol formats in the past, where even PDB files exist in multiple formats. This post sheds some light, which debug symbols are out there.

The general history for Microsoft Symbol Formats is that there are mainly three kinds:
  • Pre-CodeView
  • CodeView
  • PDB
I will not dive into the Pre-CodeView era mainly because I don't have much knowledge about it.

So why does Microsoft change the symbol format all the time? The answer was given by Matt Pietrek. As the compilers and debuggers advanced more information was stored in the file. Some changes were performed due to the 16->32 bit transition, but most changes can be attributed to advances in the debugger. Edit and Continue is an example of this.

CodeView

CodeView is format developed by Microsoft sometime along with the CodeView debugger, which was later integrated into the Microsoft C to become the Visual Studio we know today. There are several revisions of CodeView, which adopt the format to the specific compiler version in use.

There's even a public specification for the CodeView format available various places on the internet.

The CodeView format was stored in various containers (files) over the years, namely the *.dbg files upto Windows 2000 and it is still in use today in the *.pdb files emitted by Visual Studio compilers since around 1997 and more importantly for MOSA: It is also emitted by the .NET compilers.

PDB

PDB files are in use for quite some time now, but even this file format has went through at least three transitions. There's at least one format for managed symbols produced by csc, vbc and the other .NET compilers - yes another format. Again.

CILDB

Microsoft has submitted the Common Language Infrastructure to the ECMA for standardization. The latest standardized edition I'm aware of is ISO/IEC 23271, published on 2006-10-01. Partition V of this standardization defines a Debug Interchange Format, specifically called CILDB. The specification is available for download.

The specification introduction says this:
Portable CILDB files provide a standard way to interchange debugging information between CLI producers and consumers. This partition serves to fill in gaps not covered by metadata, notably the names of local variables and source line correspondences.

Even though Microsoft has pushed this format as part of the specification, no Microsoft tool included with the .NET Framework SDK, the Framework itself or Visual Studio is able to generate or consume these files. So the interchange aspect of this standard is not realized. There are both open and closed source apps, that are able to convert Microsoft PDB files to CILDB - all with a drawback I'll talk about in the next post.

More in the next post. However one last format for managed code remains:

MDB

This is the mono debug format - it is used by mdb and MonoDevelop. There's integrated support in mono using a SymbolWriter/SymbolReader to produce and consume these files from managed code. Talk about fun!

The MDB option is definitely one we should follow to debug applications on MOSA, but it is not one we are able to use for kernel debugging or debugging native code.

Basically this means that mosacl (our ahead of time compiler) must be able to read PDB, MDB and CILDB files in order to map the source code to appropriate places in the native code. Again - more in another post.

14 January 2009

Debug Symbols for MOSA #1 - Why we need them

The MOSA compiler is nearing its 0.1 alpha release and one of the things that has been bugging me since the start was creating debug symbols for compiled assemblies. The MOSA compiler converts CIL assemblies to native code for a specific target architecture. In the process however the mapping of source code to native code gets lost, unless the compiler is able to create new symbol information to map the native code back to the managed source code.

There are various reasons that support is needed for symbols, one of them is that it makes kernel debugging a whole lot easier if the debugger allows stepping in the source code and variable inspection (the Visual Studio experience.) The other point is that various .NET APIs allow creating debug symbols (CodeDOM or Reflection.Emit) or inspecting them using the System.Diagnostics.SymbolStore namespace.

As I'll explain in the posts later this is no easy task, but one that'll raise the productivity of kernel development a lot.