Clang Lex

Lexer

在cpp语言标准中,词法分析分为了多道。

  1. phase 1.操作系统将源文件里的每个字节从源文件编码转换为标准编码。主要包括平台相关的换行符替换、宽字节(UCN)映射和trigraph映射。

  2. phase 2.每个以\(\backslash\)为结束的行,都会与之后的那一行进行拼接。但是这里每一行只处理一次这个情况,末尾是\(\backslash\)\(\backslash\)且下一行是空行的情况下不会自动拼接到第二行。同时对于行拼接可能引发的UCN编码结果都是未定义行为。如果源文件最后一行是以\(\backslash\)结尾的,则这个也是一个未定义行为。

  3. phase 3.源文件内容被分为了:注释、空白和预处理词法单元。这些预处理此法单元包括:

    1. 在 #include 后面的头文件名字

    2. 标识符

    3. 预处理数字

    4. 字符常量和字符串常量,包括用户自定义类型

    5. 操作符和标点符号, 这里又引入了alternative tokens这种形式,

    6. 不属于上述归类的的字符串

    注意这里从cpp11开始规定了raw string literal,对于phase 1和phase 2中对于这个raw string literal的替换操作都要重新revert一下。

  4. phase 4. 预处理器开始执行。每个include的文件都要递归的走phase1-phase 4。在完整的执行完之后,文件中不能再包含预处理指令

  5. phase 5.源文件中的字符常量和字符串常量编码都会转换为执行期的字符串编码。此时会对字符进行转义。

  6. phase 6. 相邻的字符串常量会被拼接起来。

  7. phase 7.每个词法单元根据语法约束变为语法单元,最后转变为翻译单元(translation unit)。

  8. phase 8.每个翻译单元都会实例化所用的模板,并成为了实例单元(instantiation unit)

  9. phase 9. 将实例单元翻译单元以及引用的库组件合并为一个执行单元

这9道过程中,前6道都与词法分析相关,下面对这些道进行分析

phase 1-2 字符处理与行拼接

对于lexer的分析,首先要处理的是字符集。当前标准里规定了如下几类的字符处理:

  1. 普通ASCII码字符
  2. UCN(universal character name) 这个用来代表多字节编码的字符,形式为\(\backslash\) uXXXX和\(\backslash\)UXXXXXXXX。这里的X代表一个十六进制数,大写u和小写U代表了双字节和四字节.同时\(\backslash\)uXXXX等价于\(\backslash\)U0000XXXX。这里的值还有限制,不能落在0xD800-0xDFFF之内,因为这些值是用来规定代码点的。对于UCN的具体要求参见ISO/IEC 10646。这部分对应的代码在LiteralSupport.cpp里的clang::expandUCNS和ProcessUCNEscape。clang会把UCN编码为一个uint32\(\_\)t,也就是UTF32类型,但是后面好需要转换为UTF8/16等,这部分函数仍然在LiteralSupport里,MeasureUCNEscape和EncodeUCNEscape
  3. trigraph 三字符。形式为??开头的三个连续字符,最后一个字符只有9种选择,这三个字符合起来会转换为一个新的符号。这个部分已经被新的cpp标准所废弃,但是IBM那边为了兼容性活生生的拖到了cpp17才执行。相关代码在Lexer.cpp里的GetTrigraphCharForLetter,很简单

对于字符的处理,相关代码放在Lexer.h里的getAndAdvanceChar和ConsumeChar这两个函数里,涉及到的函数包括:

  1. isObviouslySimpleCharacter 其实就是判断一个字符是不是?或者\(\backslash\) 因为只有这两个开头的字符才会引起trigraph、ucn以及行拼接的转换

  2. getAndAdvanceChar 从buffer中读取下一个可以拿出的字符,如果下一个字符是简单字符(也就是isObviouslySimpleCharacter返回true),则直接返回下一个字符,否则调用getCharAndSizeSlow来处理可能的多字符的情况,处理完成之后对应的Buffer指针前进道下一个字符的位置。

  3. ConsumeChar 这个跟前一个函数差不多

  4. getCharAndSize 这个只peek下一个字符,并不修改buffer指针,逻辑也与之前的一样,都会过一下obvious simple判断。

  5. getEscapedNewLineSize 这个处理\(\backslash\)和下一个空行之间的白空间大小,这个会触发警告

  6. getCharAndSizeSlow 这个函数负责处理可能的行拼接和trigraph,但是拼接之后可能还是空白字符,所以需要递归调用。至于trigraph的处理则放到了DecodeTrigraphChar这个函数里面,但是处理trigrah也可能会涉及到行拼接的问题(虽然标准里说这是未定义的),此处使用了goto来避免行拼接之后再调用trigraph的处理。

  7. DecodeTrigraphChar 其实基本就是一个switch case语句来查询转换表,同时根据语言选项来给出trigraph的使用警告。

  8. tryReadUCN 这个会被用来处理UCN的读取并检查生成的代码点的有效性,内部会调用getCharSize来处理行拼接的情况,其实按照标准行拼接引起的ucn是未定义行为。

phase 3 基础词法

Literal Support

对于数值常量、字符常量、字符串常量部分的代码都在LiteralSupport之中。

数值常量

数值常量的处理定义在LiteralSupport的NumericLiteralParser里面。将初始位置和预处理器传进去构造一个这样的对象,在这个对象的构造函数里就会自动处理数值的parse问题。内部还定义了ParseNumberStartingWithZero 处理开头数字是0的数,包括基数为2,8,16的整数和一些小数。这个构造函数里并不直接计算数值,而是记录一些格式信息:是否遇到了小数点、指数,后缀是否合法(包括用户自定义后缀)等。真正获得值则需要调用GetIntegerValue,GetFloatValue这两个函数来处理。GetFloatValue这里其实并没有做多少工作,而是委托到了APFloat::convertFromString中去处理这段数值文本。

对于整数的处理这里还有一个优化,这个优化是用来避免判断数值是否溢出的:

static bool alwaysFitsInto64Bits(unsigned Radix, unsigned NumDigits)
{
    switch (Radix)
    {
    case 2:
        return NumDigits <= 64;
    case 8:
        return NumDigits <= 64 / 3; // Digits are groups of 3 bits.
    case 10:
        return NumDigits <= 19; // floor(log10(2^64))
    case 16:
        return NumDigits <= 64 / 4; // Digits are groups of 4 bits.
    default:
        llvm_unreachable("impossible Radix");
    }
}

这个函数计算出了每一种进制下的绝对不会溢出的数字长度值范围,获得值的时候则首先计算一下当前整数的字符长度是否超过了对应的范围。如果没超过则直接返回计算出来的值,如果超过了则需要每一步进行检查是否溢出。因为一般情况下我们提供的基本都是小整数,这样做就可以避免很多关于溢出的检测。

字符常量和字符串常量

字符常量也有一个对应的对象构造CharLiteralParser。这里主要是要小心多字节的字符和带UCN的字符,其他的没什么需要注意的。

对于字符串的处理这里还是比较复杂的,都放在StringLiteralParser的init函数里,会处理自定义字符串常量后缀、字符串内转义、多字节编码、Raw string等各种情况。

lex 辅助函数

在lexer中,定义了如下的一些辅助函数来专门处理各种类型的词法单元:

    // Helper functions to lex the remainder of a token of the specific type.
    bool LexIdentifier(Token &Result, const char *CurPtr);
    bool LexNumericConstant(Token &Result, const char *CurPtr);
    bool LexStringLiteral(Token &Result, const char *CurPtr,
    tok::TokenKind Kind);
    bool LexRawStringLiteral(Token &Result, const char *CurPtr,
    tok::TokenKind Kind);
    bool LexAngledStringLiteral(Token &Result, const char *CurPtr);
    bool LexCharConstant(Token &Result, const char *CurPtr,
    tok::TokenKind Kind);
    bool LexEndOfFile(Token &Result, const char *CurPtr);
    bool SkipWhitespace(Token &Result, const char *CurPtr,
    bool &TokAtPhysicalStartOfLine);
    bool SkipLineComment(Token &Result, const char *CurPtr,
    bool &TokAtPhysicalStartOfLine);
    bool SkipBlockComment(Token &Result, const char *CurPtr,
    bool &TokAtPhysicalStartOfLine);
    bool SaveLineComment(Token &Result, const char *CurPtr);
  1. LexIdentifier 这个函数用来lex一个identifier,内部分为了fast path和slow path。fast path是只处理ASCII符号,而slow path则需要处理宽字符集、UCN、trigraph等情况,这些情况都是由getCharAndSize来内部处理的。处理完slow path后会复用fast path的一些代码,这里为了避免代码重复就使用了goto。

  2. LexNumericConstant 由于数值部分可能有多个子部分都符合数值要求,所以这个函数是会递归的调用自己,把所有可能的数字形式都包容进来。这里有很多奇葩情况,如16进制的浮点,’分割的数字,含有ucn的用户自定义后缀等等。当无法递归下去的时候,则把开头的位置设置到token里。

  3. LexStringLiteral 这个倒没什么可说的,一路调用getAndAdvanceChar即可。

  4. LexRawStringLiteral 这里不能利用getAndAdvanceChar这个函数,因为cpp标准里规定在raw string里任何phase 1和phase 2所引起的转换都会被revert。

  5. LexAngledStringLiteral 这个是用来处理# include 头文件名称的。

  6. LexCharConstant 这个用来获得单个字符,也是调用getAndAdvanceChar即可

  7. SkipBlockComment 这个用来处理/*开头的注释,处理时就是一路扫描/字符,如果前一个字符是*则认为处理完毕。这里主要的代码在处理扫描/时的优化,可以使用里的函数来加快扫描,一次可以使用16个字符。

这些函数结束之后都会生成一个token。

lex入口函数

整个lexer的入口函数是Lexer::lex,调用之后会返回一个Token。但是其实这个函数几乎什么事都不干,完全转接到Lexer::LexTokenInternal这个函数里面。这个函数主要的函数体就是一个大switch case。根据开头的字母来启动上面的某个辅助lex函数:

  1. 0 这个是EOF符号,直接调用LexEndOfFile即可。

  2. 26 也就是\^ z字符,这个是很老的一种行尾符号,跟上面流程一样。

  3. 换行符相关,这里会直接终止当前行的预处理导言的处理。

  4. 空白符号,这里如果后面接着注释末尾,则直接把后面的注释符号吃掉,其实就是为了加速处理。

  5. 0-9 的数字, 调用LexNumericConstant

  6. u,这里会根据当前标准对字符串开头的编码标识来选择的将这个u归类为开始处理utf16的string literal还是归类为普通的identifier的开头

  7. U,跟上一个一样,只不过是UTF 32。

  8. R,根据语言选项来处理无编码的Raw string Literal 或者identifier

  9. L,这里也会涉及到是否是字符常量还是identifier的判断

  10. 其他可以用在identifier里的字母,调用LexIdentifier

  11. ‘ , 字符常量,LexCharConstant

  12. ", 字符串常量,LexStringLiteral

  13. ?[](),直接标识为普通标点符号

  14. ., 可能是LexNumericConstant

  15. 算数操作符相关的,都需要向后看才能确定符号类型

  16. #, 分为了##,#@, # 三种情况

  17. \(\backslash\), 可能是UCN,调用tryReadUCN和LexUnicode

Token

也称为词法单元,是将源代码经过操作之后所分隔出来的基本信息单位。在clang中,Token的数据成员如下:

class Token
{
    /// The location of the token. This is actually a SourceLocation.
    unsigned Loc;

    /// UintData - This holds either the length of the token text, when
    /// a normal token, or the end of the SourceRange when an annotation
    /// token.
    // 话说没必要啊,长度加偏移不就是end了么
    unsigned UintData;

    /// PtrData - This is a union of four different pointer types, which depends
    /// on what type of token this is:
    ///  Identifiers, keywords, etc:
    ///    This is an IdentifierInfo*, which contains the uniqued identifier
    ///    spelling.
    ///  Literals:  isLiteral() returns true.
    ///    This is a pointer to the start of the token in a text buffer, which
    ///    may be dirty (have trigraphs / escaped newlines).
    ///  Annotations (resolved type names, C++ scopes, etc): isAnnotation().
    ///    This is a pointer to sema-specific data for the annotation token.
    ///  Eof:
    //     This is a pointer to a Decl.
    ///  Other:
    ///    This is null.
    void *PtrData;

    /// Kind - The actual flavor of token this is.
    tok::TokenKind Kind;

    /// Flags - Bits we track about this token, members of the TokenFlags enum.
    unsigned short Flags;
}

需要注意的是这里的相关信息,主要包括如下几类:

  1. Identifier 或者Keywords;

  2. Literal,这个是常量值,包括字符常量、字符串常量、数值常量等;

  3. Annotations,这个包含的东西比较杂,如名字空间,带限定的类型说明,特化模板函数以及decltype等,剩下的就都是pragma。

  4. eof,代表结尾。

对于TokenKind的完整定义参见TokenKinds.def,这里把宏处理用的天花乱坠!

同时,这里的代表的是一个压缩的枚举标记,他的值由这些枚举值确定:

// Various flags set per token:
enum TokenFlags
{
    StartOfLine = 0x01,  // At start of line or only after whitespace
   // (considering the line after macro expansion).
    LeadingSpace = 0x02,  // Whitespace exists before this token (considering 
   // whitespace after macro expansion).
    DisableExpand = 0x04,  // This identifier may never be macro expanded.
    NeedsCleaning = 0x08,  // Contained an escaped newline or trigraph.
    LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
    HasUDSuffix = 0x20,    // This string or character literal has a ud-suffix.
    HasUCN = 0x40,         // This identifier contains a UCN.
    IgnoredComma = 0x80,   // This comma is not a macro argument separator (MS).
    StringifiedInMacro = 0x100, // This string or character literal is formed by
                                // macro stringizing or charizing operator.
};

这里绝大部分的标志都是用来处理Macro展开的。具体的意义需要参考C99中关于宏展开的章节来看,此外这里的代码还考虑了GCC和MSVC的各种预处理扩展。

预处理辅助结构

在预处理阶段,需要做头文件插入和宏展开操作。为了处理这些操作,clang提供了多个辅助结构:

  1. 符号管理 用来处理Identifier

  2. 头文件管理 用来处理头文件的插入

  3. 宏管理 用来处理宏展开

头文件管理

Multiple Include Optimization

这个文件处理的是头文件的多次插入的情况,这里的主要管理信息包括如下几个:

/// \brief Implements the simple state machine that the Lexer class uses to
/// detect files subject to the 'multiple-include' optimization.
///
/// The public methods in this class are triggered by various
/// events that occur when a file is lexed, and after the entire file is lexed,
/// information about which macro (if any) controls the header is returned.
class MultipleIncludeOpt
{
    /// ReadAnyTokens - This is set to false when a file is first opened and true
    /// any time a token is returned to the client or a (non-multiple-include)
    /// directive is parsed.  When the final \#endif is parsed this is reset back
    /// to false, that way any tokens before the first \#ifdef or after the last
    /// \#endif can be easily detected.
    bool ReadAnyTokens;

    /// ImmediatelyAfterTopLevelIfndef - This is true when the only tokens
    /// processed in the file so far is an #ifndef and an identifier.  Used in
    /// the detection of header guards in a file.
    bool ImmediatelyAfterTopLevelIfndef;

    /// ReadAnyTokens - This is set to false when a file is first opened and true
    /// any time a token is returned to the client or a (non-multiple-include)
    /// directive is parsed.  When the final #endif is parsed this is reset back
    /// to false, that way any tokens before the first #ifdef or after the last
    /// #endif can be easily detected.
    bool DidMacroExpansion;

    /// TheMacro - The controlling macro for a file, if valid.
    ///
    const IdentifierInfo *TheMacro;

    /// DefinedMacro - The macro defined right after TheMacro, if any.
    const IdentifierInfo *DefinedMacro;

    SourceLocation MacroLoc;
    SourceLocation DefinedLoc;
}
  1. ReadAnyTokens用来检测在Header Guard之外的词法单元。一个文件在打开时设置这个值为false,在lex过程中返回了一个token之后设置为true,在遇到#endif之后再设置为false。

  2. ImmediatelyAfterTopLevelIfndef用来表明我们刚处理了#ifndef Header Guard。

  3. DidMacroExpansion这个字段表示的是我们处理Header Guard对应#ifnd的期间是否进行了宏展开。如果进行了宏展开那就说明,那就说明当前Header Guard无法使用。

  4. TheMacro代表的是Header Guar的那个宏,这命名跟The One一样啊。

  5. DefinedMacro这个是在#ifnde后面的那个,如果没有#define就为空。

在初始化的时候,我们必须保守的设置这些值:

MultipleIncludeOpt()
{
    ReadAnyTokens = false;
    ImmediatelyAfterTopLevelIfndef = false;
    DidMacroExpansion = false;
    TheMacro = nullptr;
    DefinedMacro = nullptr;
}

如果我们发现当前文件无法使用Header Guard,则标记一下:

/// Invalidate - Permanently mark this file as not being suitable for the
/// include-file optimization.
void Invalidate()
{
    // If we have read tokens but have no controlling macro, the state-machine
    // below can never "accept".
    ReadAnyTokens = true;
    ImmediatelyAfterTopLevelIfndef = false;
    DefinedMacro = nullptr;
    TheMacro = nullptr;
}

然后整个Invalidate过程是由下面两个状态机函数控制的。一个是EnterTopLevelIfndef:

/// \brief Called when entering a top-level \#ifndef directive (or the
/// "\#if !defined" equivalent) without any preceding tokens.
///
/// Note, we don't care about the input value of 'ReadAnyTokens'.  The caller
/// ensures that this is only called if there are no tokens read before the
/// \#ifndef.  The caller is required to do this, because reading the \#if
/// line obviously reads in in tokens.
void EnterTopLevelIfndef(const IdentifierInfo *M, SourceLocation Loc)
{
    // If the macro is already set, this is after the top-level #endif.
    if (TheMacro)
    return Invalidate();

    // If we have already expanded a macro by the end of the #ifndef line, then
    // there is a macro expansion *in* the #ifndef line.  This means that the
    // condition could evaluate differently when subsequently #included.  Reject
    // this.
    if (DidMacroExpansion)
    return Invalidate();

    // Remember that we're in the #if and that we have the macro.
    ReadAnyTokens = true;
    ImmediatelyAfterTopLevelIfndef = true;
    TheMacro = M;
    MacroLoc = Loc;
}

另外一个是ExitTopLevelConditional:

/// \brief Called when the lexer exits the top-level conditional.
void ExitTopLevelConditional()
{
    // If we have a macro, that means the top of the file was ok.  Set our state
    // back to "not having read any tokens" so we can detect anything after the
    // #endif.
    if (!TheMacro) return Invalidate();

    // At this point, we haven't "read any tokens" but we do have a controlling
    // macro.
    ReadAnyTokens = false;
    ImmediatelyAfterTopLevelIfndef = false;
}

HeaderMap

这里的类是用来对头文件进行抽象的,代表了Apple header ma的概念,以抹平底层的文件系统。其数据成员如下:

/// This class represents an Apple concept known as a 'header map'.  To the
/// \#include file resolution process, it basically acts like a directory of
/// symlinks to files.  Its advantages are that it is dense and more efficient
/// to create and process than a directory of symlinks.
class HeaderMap
{
    std::unique_ptr<const llvm::MemoryBuffer> FileBuffer;
    bool NeedsBSwap;
}

一个代表文件内存区,一个表明是否需要处理字节序问题(大端或小端)。

在使用时,需要定义其他的结构来辅助HeaderMap的使用:

struct HMapBucket
{
    uint32_t Key;          // Offset (into strings) of key.

    uint32_t Prefix;     // Offset (into strings) of value prefix.
    uint32_t Suffix;     // Offset (into strings) of value suffix.
};

struct HMapHeader
{
    uint32_t Magic;           // Magic word, also indicates byte order.
    uint16_t Version;         // Version number -- currently 1.
    uint16_t Reserved;        // Reserved for future use - zero for now.
    uint32_t StringsOffset;   // Offset to start of string pool.
    uint32_t NumEntries;      // Number of entries in the string table.
    uint32_t NumBuckets;      // Number of buckets (always a power of 2).
    uint32_t MaxValueLength;  // Length of longest result path (excluding nul).
    // An array of 'NumBuckets' HMapBucket objects follows this header.
    // Strings follow the buckets, at StringsOffset.
};

这里的HMapHeader代表的是HeadMap序列化之后的文件头,如果从FileEntry反序列化出HeaderMap的话,首先判断这个文件开始的那些字节是否是一个合法的HMapHeader。这里有一个Magic和Version,他们的值定义在一个枚举之中:

enum
{
    HMAP_HeaderMagicNumber = ('h' << 24) | ('m' << 16) | ('a' << 8) | 'p',
    HMAP_HeaderVersion = 1,

    HMAP_EmptyBucketKey = 0
};

所以HeaderMap的静态构造函数可以以下方式实现,首先判断文件大小是否能够保存一个HMapHeader,然后就是版本号和MagicNumber的比较:

/// HeaderMap::Create - This attempts to load the specified file as a header
/// map.  If it doesn't look like a HeaderMap, it gives up and returns null.
/// If it looks like a HeaderMap but is obviously corrupted, it puts a reason
/// into the string error argument and returns null.
const HeaderMap *HeaderMap::Create(const FileEntry *FE, FileManager &FM)
{
    // If the file is too small to be a header map, ignore it.
    unsigned FileSize = FE->getSize();
    if (FileSize <= sizeof(HMapHeader)) return nullptr;

    auto FileBuffer = FM.getBufferForFile(FE);
    if (!FileBuffer) return nullptr;  // Unreadable file?
    const char *FileStart = (*FileBuffer)->getBufferStart();

    // We know the file is at least as big as the header, check it now.
    const HMapHeader *Header = reinterpret_cast<const HMapHeader*>(FileStart);

    // Sniff it to see if it's a headermap by checking the magic number and
    // version.
    bool NeedsByteSwap;
    if (Header->Magic == HMAP_HeaderMagicNumber &&
    Header->Version == HMAP_HeaderVersion)
    NeedsByteSwap = false;
    else if (Header->Magic == llvm::ByteSwap_32(HMAP_HeaderMagicNumber) &&
    Header->Version == llvm::ByteSwap_16(HMAP_HeaderVersion))
    NeedsByteSwap = true;  // Mixed endianness headermap.
    else
    return nullptr;  // Not a header map.

    if (Header->Reserved != 0) return nullptr;

    // Okay, everything looks good, create the header map.
    return new HeaderMap(std::move(*FileBuffer), NeedsByteSwap);
}

在这个HeaderMap中存储的是一个HashMap,这个哈希表中基本单位为HMapBucket,其结构定义如下。

struct HMapBucket
{
    uint32_t Key;          // Offset (into strings) of key.

    uint32_t Prefix;     // Offset (into strings) of value prefix.
    uint32_t Suffix;     // Offset (into strings) of value suffix.
};

此时基本单元的获取函数如下,很简单,就是首先获取哈希表的偏移,然后加上索引偏移,判断是否越界:

/// getBucket - Return the specified hash table bucket from the header map,
/// bswap'ing its fields as appropriate.  If the bucket number is not valid,
/// this return a bucket with an empty key (0).
HMapBucket HeaderMap::getBucket(unsigned BucketNo) const
{
    HMapBucket Result;
    Result.Key = HMAP_EmptyBucketKey;

    const HMapBucket *BucketArray =
    reinterpret_cast<const HMapBucket*>(FileBuffer->getBufferStart() +
    sizeof(HMapHeader));

    const HMapBucket *BucketPtr = BucketArray + BucketNo;
    if ((const char*)(BucketPtr + 1) > FileBuffer->getBufferEnd())
    {
        Result.Prefix = 0;
        Result.Suffix = 0;
        return Result;  // Invalid buffer, corrupt hmap.
    }

    // Otherwise, the bucket is valid.  Load the values, bswapping as needed.
    Result.Key = getEndianAdjustedWord(BucketPtr->Key);
    Result.Prefix = getEndianAdjustedWord(BucketPtr->Prefix);
    Result.Suffix = getEndianAdjustedWord(BucketPtr->Suffix);
    return Result;
}

整个HeaderMap相当于把头文件目录中的子文件名都存储于内存之中,这样每次查询头文件位置的时候就没有必要去访问文件系统了。在HeaderMap中查找一个文件名是否存在可以这样实现,就是一个简单hash的查询。:

StringRef HeaderMap::lookupFilename(StringRef Filename,
SmallVectorImpl<char> &DestPath) const
{
    const HMapHeader &Hdr = getHeader();
    unsigned NumBuckets = getEndianAdjustedWord(Hdr.NumBuckets);

    // If the number of buckets is not a power of two, the headermap is corrupt.
    // Don't probe infinitely.
    if (NumBuckets & (NumBuckets - 1))
    return StringRef();

    // Linearly probe the hash table.
    for (unsigned Bucket = HashHMapKey(Filename);; ++Bucket)
    {
        HMapBucket B = getBucket(Bucket & (NumBuckets - 1));
        if (B.Key == HMAP_EmptyBucketKey) return StringRef(); // Hash miss.

        // See if the key matches.  If not, probe on.
        if (!Filename.equals_lower(getString(B.Key)))
        continue;

        // If so, we have a match in the hash table.  Construct the destination
        // path.
        StringRef Prefix = getString(B.Prefix);
        StringRef Suffix = getString(B.Suffix);
        DestPath.clear();
        DestPath.append(Prefix.begin(), Prefix.end());
        DestPath.append(Suffix.begin(), Suffix.end());
        return StringRef(DestPath.begin(), DestPath.size());
    }
}

头文件定位

头文件定位相关操作在HeaderSearch.h中被描述,里面有一个比较大的类HeaderSearch。一部分一部分的分析吧。

首先是头文件信息相关的结构HeaderFileInfo,里面基本都是一些标志位:

struct HeaderFileInfo
{
    /// \brief True if this is a \#import'd or \#pragma once file.
    unsigned isImport : 1;

    /// \brief True if this is a \#pragma once file.
    unsigned isPragmaOnce : 1;

    /// DirInfo - Keep track of whether this is a system header, and if so,
    /// whether it is C++ clean or not.  This can be set by the include paths or
    /// by \#pragma gcc system_header.  This is an instance of
    /// SrcMgr::CharacteristicKind.
    unsigned DirInfo : 2;

    /// \brief Whether this header file info was supplied by an external source,
    /// and has not changed since.
    unsigned External : 1;

    /// \brief Whether this header is part of a module.
    unsigned isModuleHeader : 1;

    /// \brief Whether this header is part of the module that we are building.
    unsigned isCompilingModuleHeader : 1;

    /// \brief Whether this structure is considered to already have been
    /// "resolved", meaning that it was loaded from the external source.
    unsigned Resolved : 1;

    /// \brief Whether this is a header inside a framework that is currently
    /// being built. 
    ///
    /// When a framework is being built, the headers have not yet been placed
    /// into the appropriate framework subdirectories, and therefore are
    /// provided via a header map. This bit indicates when this is one of
    /// those framework headers.
    unsigned IndexHeaderMapHeader : 1;

    /// \brief Whether this file has been looked up as a header.
    unsigned IsValid : 1;

    /// \brief The number of times the file has been included already.
    unsigned short NumIncludes;

    /// \brief The ID number of the controlling macro.
    ///
    /// This ID number will be non-zero when there is a controlling
    /// macro whose IdentifierInfo may not yet have been loaded from
    /// external storage.
    unsigned ControllingMacroID;

    /// If this file has a \#ifndef XXX (or equivalent) guard that
    /// protects the entire contents of the file, this is the identifier
    /// for the macro that controls whether or not it has any effect.
    ///
    /// Note: Most clients should use getControllingMacro() to access
    /// the controlling macro of this header, since
    /// getControllingMacro() is able to load a controlling macro from
    /// external storage.
    const IdentifierInfo *ControllingMacro;
}

比较特殊的就是ControllingMacro,也就是常说的Header Guard。

该文件剩下的内容就是HeaderSearch相关的内容了,里面有很多我们并不需要的字段,因此只看重点:

/// \#include search path information.  Requests for \#include "x" search the
/// directory of the \#including file first, then each directory in SearchDirs
/// consecutively. Requests for <x> search the current dir first, then each
/// directory in SearchDirs, starting at AngledDirIdx, consecutively.  If
/// NoCurDirSearch is true, then the check for the file in the current
/// directory is suppressed.
std::vector<DirectoryLookup> SearchDirs;
unsigned AngledDirIdx;
unsigned SystemDirIdx;
bool NoCurDirSearch;

注释里面把搜索逻辑已经描述清楚了,所有的头文件路径存储于SearchDirs,相当于AngleDir系统头文件路径的开始索引。

其他比较重要的东西就是所有的头文件信息和头文件与FileEntry的映射:

/// \brief All of the preprocessor-specific data about files that are
/// included, indexed by the FileEntry's UID.
mutable std::vector<HeaderFileInfo> FileInfo;
/// HeaderMaps - This is a mapping from FileEntry -> HeaderMap, uniquing
/// headermaps.  This vector owns the headermap.
std::vector<std::pair<const FileEntry*, const HeaderMap*> > HeaderMaps;

剩下的都是Module和Framework以及Statistic相关的内容,不是核心问题,故略去。

真正的核心问题是查询函数:

const FileEntry *LookupFile(
StringRef Filename, SourceLocation IncludeLoc, bool isAngled,
const DirectoryLookup *FromDir, const DirectoryLookup *&CurDir,
ArrayRef<std::pair<const FileEntry *, const DirectoryEntry *>> Includers,
SmallVectorImpl<char> *SearchPath, SmallVectorImpl<char> *RelativePath,
Module *RequestingModule, ModuleMap::KnownHeader *SuggestedModule,
bool SkipCache = false);

这个函数相关的说明也是非常的长,大概有300多行实现代码。首先是判断文件路径是不是绝对路径,如果是绝对路径则直接判断该路径是否存在:

if (llvm::sys::path::is_absolute(Filename))
{
    CurDir = nullptr;

    // If this was an #include_next "/absolute/file", fail.
    if (FromDir) return nullptr;

    if (SearchPath)
    SearchPath->clear();
    if (RelativePath)
    {
        RelativePath->clear();
        RelativePath->append(Filename.begin(), Filename.end());
    }
    // Otherwise, just return the file.
    return getFileAndSuggestModule(Filename, nullptr,
    /*IsSystemHeaderDir*/false,
    RequestingModule, SuggestedModule);
}

如果是相对路径,则需要区分是否是尖括号头文件还是双引号头文件。如果是双引号头文件,则需要在所有头文件的路径中递归子文件夹来查找,而尖括号头文件则不需要在子文件夹中递归查找。

如果是尖括号头文件搜索,则还需要考虑微软的include_next扩展,具体的执行流程我们就不去深究了。

对于framework和module的支持当前我就先不考虑,目前不感兴趣。

预编译头文件

PTH 文件格式

这个PTH文件还有一个功能就是建立FileEntry与PTH offset之间的映射,实际映射时使用的都是uint32_t,所以里面存储的信息也是这样的:

class PTHFileData
{
    const uint32_t TokenOff;
    const uint32_t PPCondOff;
}

很简单,就两个索引字段,完整的信息需要配合文件管理器才能解释。读取文件的时候也非常简单:

static PTHFileData ReadData(const internal_key_type& k,
const unsigned char* d, unsigned)
{
    assert(k.first == 0x1 && "Only file lookups can match!");
    using namespace llvm::support;
    uint32_t x = endian::readNext<uint32_t, little, unaligned>(d);
    uint32_t y = endian::readNext<uint32_t, little, unaligned>(d);
    return PTHFileData(x, y);
}

PTHLexer

这个文件是用来处理Pre Tokenized input的,也就是用来处理预编译头文件!最重要的是,这个类继承自PreprocessorLexer:

class PTHLexer : public PreprocessorLexer
{
    SourceLocation FileStartLoc;

    /// TokBuf - Buffer from PTH file containing raw token data.
    const unsigned char* TokBuf;

    /// CurPtr - Pointer into current offset of the token buffer where
    ///  the next token will be read.
    const unsigned char* CurPtr;

    /// LastHashTokPtr - Pointer into TokBuf of the last processed '#'
    ///  token that appears at the start of a line.
    const unsigned char* LastHashTokPtr;

    /// PPCond - Pointer to a side table in the PTH file that provides a
    ///  a consise summary of the preproccessor conditional block structure.
    ///  This is used to perform quick skipping of conditional blocks.
    const unsigned char* PPCond;

    /// CurPPCondPtr - Pointer inside PPCond that refers to the next entry
    ///  to process when doing quick skipping of preprocessor blocks.
    const unsigned char* CurPPCondPtr;
}

这个类是用来处理Pretokenized Header的,这个文件格式是llvm自己设计的。所以我这也不想说太多只对PTH有效的功能,毕竟不是标准里面的东西。

当前头文件的实现文件约有900行,挑重点的几个函数说一下。以难易程度来说,最简单的是LexEndOfFile:

bool PTHLexer::LexEndOfFile(Token &Result)
{
    // If we hit the end of the file while parsing a preprocessor directive,
    // end the preprocessor directive first.  The next token returned will
    // then be the end of file.
    if (ParsingPreprocessorDirective)
    {
        ParsingPreprocessorDirective = false; // Done parsing the "line".
        return true;  // Have a token.
    }

    assert(!LexingRawMode);

    // If we are in a #if directive, emit an error.
    while (!ConditionalStack.empty())
    {
        if (PP->getCodeCompletionFileLoc() != FileStartLoc)
        PP->Diag(ConditionalStack.back().IfLoc,
        diag::err_pp_unterminated_conditional);
        ConditionalStack.pop_back();
    }

    // Finally, let the preprocessor handle this.
    return PP->HandleEndOfFile(Result);
}

这个函数处理的是:如果我们走到了一个文件的末尾怎么处理的问题。返回值用来表示我们是否应该继续调用Lex函数来获得下一个Token。如果在处理预处理导言(#line)时到了文件末尾,直接返回真。如果当前的条件编译栈中还有#if,向错误诊断中报告相应信息,并持续弹出。然后我们让预处理器来处理EOF问题,其签名如下:

/// \brief Callback invoked when the lexer hits the end of the current file.
///
/// This either returns the EOF token and returns true, or
/// pops a level off the include stack and returns false, at which point the
/// client should call lex again.
bool HandleEndOfFile(Token &Result, bool isEndOfMacro = false);

这个函数的大意就是进行头文件栈的弹栈,处理过程中我们还可以插入自定义的call_bacl。

处理了这种例外行为之后我们来处理普通的PTH.lex操作,该操作分为多步,第一步是从PTH中获得一个Token的相关信息:

//===--------------------------------------==//
// Read the raw token data.
//===--------------------------------------==//
using namespace llvm::support;

// Shadow CurPtr into an automatic variable.
const unsigned char *CurPtrShadow = CurPtr;

// Read in the data for the token.
unsigned Word0 = endian::readNext<uint32_t, little, aligned>(CurPtrShadow);
uint32_t IdentifierID =
endian::readNext<uint32_t, little, aligned>(CurPtrShadow);
uint32_t FileOffset =
endian::readNext<uint32_t, little, aligned>(CurPtrShadow);

tok::TokenKind TKind = (tok::TokenKind) (Word0 & 0xFF);
Token::TokenFlags TFlags = (Token::TokenFlags) ((Word0 >> 8) & 0xFF);
uint32_t Len = Word0 >> 16;

CurPtr = CurPtrShadow;

这里的代码展示了PTH文件的具体实现,三个uint32_t存储了Token的基本信息。剩下的就是根据这些信息构造Token:

//===--------------------------------------==//
// Construct the token itself.
//===--------------------------------------==//

Tok.startToken();
Tok.setKind(TKind);
Tok.setFlag(TFlags);
assert(!LexingRawMode);
Tok.setLocation(FileStartLoc.getLocWithOffset(FileOffset));
Tok.setLength(Len);

// Handle identifiers.
if (Tok.isLiteral())
{
    Tok.setLiteralData((const char*)(PTHMgr.SpellingBase + IdentifierID));
}
else if (IdentifierID)
{
    MIOpt.ReadToken();
    IdentifierInfo *II = PTHMgr.GetIdentifierInfo(IdentifierID - 1);

    Tok.setIdentifierInfo(II);

    // Change the kind of this identifier to the appropriate token kind, e.g.
    // turning "for" into a keyword.
    Tok.setKind(II->getTokenID());

    if (II->isHandleIdentifierCase())
    return PP->HandleIdentifier(Tok);

    return true;
}

这个Token所对应的IdentifierInfo是由PTHMgr管理的。

在构造完这个Token之后,我们再进行进一步的处理,主要是处理特殊TokenKind:

//===--------------------------------------==//
// Process the token.
//===--------------------------------------==//
if (TKind == tok::eof)
{
    // Save the end-of-file token.
    EofToken = Tok;

    assert(!ParsingPreprocessorDirective);
    assert(!LexingRawMode);

    return LexEndOfFile(Tok);
}

if (TKind == tok::hash && Tok.isAtStartOfLine())
{
    LastHashTokPtr = CurPtr - StoredTokenSize;
    assert(!LexingRawMode);
    PP->HandleDirective(Tok);

    return false;
}

if (TKind == tok::eod)
{
    assert(ParsingPreprocessorDirective);
    ParsingPreprocessorDirective = false;
    return true;
}

MIOpt.ReadToken();
return true;

这里的执行流程又调用了预处理器中HandleDirective的,这个是处理#开头的行的,其签名如下:

/// \brief Callback invoked when the lexer sees a # token at the start of a
/// line.
///
/// This consumes the directive, modifies the lexer/preprocessor state, and
/// advances the lexer(s) so that the next token read is the correct one
void HandleDirective(Token &Result);

在PTHLexer.cpp中还有一个比较重要的函数,来处理条件编译的跳过。该函数比较长,这里就不贴实现,只贴签名了。

/// SkipBlock - Used by Preprocessor to skip the current conditional block.
bool PTHLexer::SkipBlock()

PTHManager

这个PTHManager类似于之前的SourceManager,里面保留了大量的智能指针来管理相关的各种资源。同时提供由String到的IdentifierInfo映射接口,该接口继承自IdentifierInfoLookup类。完整的PTHManager数据成员见下:

class PTHManager : public IdentifierInfoLookup
{

    class PTHStringLookupTrait;
    class PTHFileLookupTrait;
    typedef llvm::OnDiskChainedHashTable<PTHStringLookupTrait> PTHStringIdLookup;
    typedef llvm::OnDiskChainedHashTable<PTHFileLookupTrait> PTHFileLookup;

    /// The memory mapped PTH file.
    std::unique_ptr<const llvm::MemoryBuffer> Buf;

    /// Alloc - Allocator used for IdentifierInfo objects.
    llvm::BumpPtrAllocator Alloc;

    /// IdMap - A lazily generated cache mapping from persistent identifiers to
    ///  IdentifierInfo*.
    std::unique_ptr<IdentifierInfo *[], llvm::FreeDeleter> PerIDCache;

    /// FileLookup - Abstract data structure used for mapping between files
    ///  and token data in the PTH file.
    std::unique_ptr<PTHFileLookup> FileLookup;

    /// IdDataTable - Array representing the mapping from persistent IDs to the
    ///  data offset within the PTH file containing the information to
    ///  reconsitute an IdentifierInfo.
    const unsigned char* const IdDataTable;

    /// SortedIdTable - Abstract data structure mapping from strings to
    ///  persistent IDs.  This is used by get().
    std::unique_ptr<PTHStringIdLookup> StringIdLookup;

    /// NumIds - The number of identifiers in the PTH file.
    const unsigned NumIds;

    /// PP - The Preprocessor object that will use this PTHManager to create
    ///  PTHLexer objects.
    Preprocessor* PP;

    /// SpellingBase - The base offset within the PTH memory buffer that
    ///  contains the cached spellings for literals.
    const unsigned char* const SpellingBase;

    /// OriginalSourceFile - A null-terminated C-string that specifies the name
    ///  if the file (if any) that was to used to generate the PTH cache.
    const char* OriginalSourceFile;
}

这里有两个用来管理资源的智能指针:

  1. Buf:保存了所有的PTH对应的MemoryBuffer;

  2. PerIDCache:一些IdentifierInfo的cache;

这里还有两个用来维持抽象查询接口的智能指针:

  1. FileLookUp:管理各个文件与Token之间的映射。

  2. StringIdLookup:管理ID与字符串之间的映射,虽然现在我还不知道这个ID是什么,看样子是PTH自己维护的一套ID。

对象构造

这个很复杂的类的构造被封装在了一个非常复杂的函数里面:

/// Create - This method creates PTHManager objects.  The 'file' argument
///  is the name of the PTH file.  This method returns NULL upon failure.
static PTHManager *Create(StringRef file, DiagnosticsEngine &Diags);

由于函数实现太长,这里就大概说一下步骤:

  1. 首先根据传入的File参数读取PTH文件,这里有一些格式检查。

  2. 过了格式检查之后,读取开头的ID->NameOffset的映射表,构建FileTable、spellingBase、StringIdTable和IDTable,这几个表的Offset是连在一起的,具体的格式参见OnDiskHashTable中的说明。

  3. 计算原始文件的长度

名字映射

在PTHManager中主要包含四个功能,首先就是get函数,用来从标识符名得到标识符,其签名如下:

/// get - Return the identifier token info for the specified named identifier.
///  Unlike the version in IdentifierTable, this returns a pointer instead
///  of a reference.  If the pointer is NULL then the IdentifierInfo cannot
///  be found.
IdentifierInfo *get(StringRef Name) override;

其实这里实现的时候,是通过一个中间层来做两层映射的:String->ID->IdentifierInfo。

// Double check our assumption that the last character isn't '\0'.
assert(Name.empty() || Name.back() != '\0');
PTHStringIdLookup::iterator I =
StringIdLookup->find(std::make_pair(Name.data(), Name.size()));
if (I == StringIdLookup->end()) // No identifier found?
return nullptr;

// Match found.  Return the identifier!
assert(*I > 0);
return GetIdentifierInfo(*I - 1);

这个StringIDLookup处理了第一层的映射,第二层的映射委托在了GetIdentifierInfo里面,其实现代码如下:

/// GetIdentifierInfo - Used to reconstruct IdentifierInfo objects from the
///  PTH file.
inline IdentifierInfo* GetIdentifierInfo(unsigned PersistentID)
{
    // Check if the IdentifierInfo has already been resolved.
    if (IdentifierInfo* II = PerIDCache[PersistentID])
    return II;
    return LazilyCreateIdentifierInfo(PersistentID);
}

这里又是一层调用,可以看出PerIDCache里面存储着ID->IdentifierInfo的映射,但是这个映射不一定已经装载进来了。为了装载我们还需要调用下面的函数:

using namespace llvm::support;
// Look in the PTH file for the string data for the IdentifierInfo object.
const unsigned char* TableEntry = IdDataTable + sizeof(uint32_t)*PersistentID;
const unsigned char *IDData =
(const unsigned char *)Buf->getBufferStart() +
endian::readNext<uint32_t, little, aligned>(TableEntry);
assert(IDData < (const unsigned char*)Buf->getBufferEnd());

// Allocate the object.
std::pair<IdentifierInfo, const unsigned char*> *Mem =
Alloc.Allocate<std::pair<IdentifierInfo, const unsigned char*> >();

Mem->second = IDData;
assert(IDData[0] != '\0');
IdentifierInfo *II = new ((void*)Mem) IdentifierInfo();

// Store the new IdentifierInfo in the cache.
PerIDCache[PersistentID] = II;
assert(II->getNameStart() && II->getNameStart()[0] != '\0');
return II;

这里的IdDataTable里面存的是每个ID对应的String在MemoryBuffer中的偏移值,然后分配相应的空间来构造pair\,并返回这个新构造的地址。总的来说,只是为了得到一个IdentifierInfo的坑,具体怎么填看后面的造化了。

CreateLexer

这个函数的作用就是根据特定的PTHFile构造出对应的Lexer,过程很直白,不解释:

PTHLexer *PTHManager::CreateLexer(FileID FID)
{
    const FileEntry *FE = PP->getSourceManager().getFileEntryForID(FID);
    if (!FE)
    return nullptr;

    using namespace llvm::support;

    // Lookup the FileEntry object in our file lookup data structure.  It will
    // return a variant that indicates whether or not there is an offset within
    // the PTH file that contains cached tokens.
    PTHFileLookup::iterator I = FileLookup->find(FE);

    if (I == FileLookup->end()) // No tokens available?
    return nullptr;

    const PTHFileData& FileData = *I;

    const unsigned char *BufStart = (const unsigned char *)Buf->getBufferStart();
    // Compute the offset of the token data within the buffer.
    const unsigned char* data = BufStart + FileData.getTokenOffset();

    // Get the location of pp-conditional table.
    const unsigned char* ppcond = BufStart + FileData.getPPCondOffset();
    uint32_t Len = endian::readNext<uint32_t, little, aligned>(ppcond);
    if (Len == 0) ppcond = nullptr;

    assert(PP && "No preprocessor set yet!");
    return new PTHLexer(*PP, FID, data, ppcond, *this);
}

符号管理

符号定义

用来处理标识符的类型叫Identifier,但是这么重要的类型定义是放在IdentifierTable中的,跟这个IdentifierTable符号表一起存储的。其实这个Identifier的数据成员不算很多,基本就是一些标志位:

class IdentifierInfo
{
    unsigned TokenID : 9; // Front-end token ID or tok::identifier.
    // Objective-C keyword ('protocol' in '@protocol') or builtin (__builtin_inf).
    // First NUM_OBJC_KEYWORDS values are for Objective-C, the remaining values
    // are for builtins.
    unsigned ObjCOrBuiltinID : 13;
    bool HasMacro : 1; // True if there is a #define for this.
    bool HadMacro : 1; // True if there was a #define for this.
    bool IsExtension : 1; // True if identifier is a lang extension.
    bool IsFutureCompatKeyword : 1; // True if identifier is a keyword in a
                                 // newer Standard or proposed Standard.
    bool IsPoisoned : 1; // True if identifier is poisoned.
    bool IsCPPOperatorKeyword : 1; // True if ident is a C++ operator keyword.
    bool NeedsHandleIdentifier : 1; // See "RecomputeNeedsHandleIdentifier".
    bool IsFromAST : 1; // True if identifier was loaded (at least 
                                 // partially) from an AST file.
    bool ChangedAfterLoad : 1; // True if identifier has changed from the
                                 // definition loaded from an AST file.
    bool RevertedTokenID : 1; // True if revertTokenIDToIdentifier was
                                 // called.
    bool OutOfDate : 1; // True if there may be additional
                                 // information about this identifier
                                 // stored externally.
    bool IsModulesImport : 1; // True if this is the 'import' contextual
                                 // keyword.
    // 30 bit left in 64-bit word.

    void *FETokenInfo;               // Managed by the language front-end.
    llvm::StringMapEntry<IdentifierInfo*> *Entry;
}

这里的TokenID其实就类似于TokenKind,这里也牵涉到了宏处理,重复利用了TokenKinds.def文件。isPoison的作用是用来指明当前标识符是否有问题,如果有问题则之后的使用都会爆错误或者警告。还有一个需要特别提到的地方就是Entry这个成员,它其实指向的是当前IdentifierInfo在StringMapEntry中的存储位点,也就是一个互指结构。根据当前类中所定义的一些函数即可猜测出IdentifierTable更多的的信息。例如下面这个获得标识符名称的实现:

/// \brief Return the beginning of the actual null-terminated string for this
/// identifier.
///
const char *getNameStart() const
{
    if (Entry) return Entry->getKeyData();
    // FIXME: This is gross. It would be best not to embed specific details
    // of the PTH file format here.
    // The 'this' pointer really points to a
    // std::pair<IdentifierInfo, const char*>, where internal pointer
    // points to the external string data.
    typedef std::pair<IdentifierInfo, const char*> actualtype;
    return ((const actualtype*) this)->second;
}

这里有两个路径,一个是当前标识符放在了IdentifierTable之中时直接返回Keydata,一个是当前标识符来自PTH于时直接返回当前对象后面的第一个字节。第二个执行路径暴露了PTH的实现,注释里面也说这么干不好。

同样的,getLength函数也复用了这个性质:

/// \brief Efficiently return the length of this identifier info.
///
unsigned getLength() const
{
    if (Entry) return Entry->getKeyLength();
    // FIXME: This is gross. It would be best not to embed specific details
    // of the PTH file format here.
    // The 'this' pointer really points to a
    // std::pair<IdentifierInfo, const char*>, where internal pointer
    // points to the external string data.
    typedef std::pair<IdentifierInfo, const char*> actualtype;
    const char* p = ((const actualtype*) this)->second - 2;
    return (((unsigned)p[0]) | (((unsigned)p[1]) << 8)) - 1;
}

这里的长度操作我只能说掉渣天,这个const char*的头两个字节居然编码了长度,虽然是逆序的。同时也暴露了一个问题,标识符长度藏在了this指针的两字节padding中,同时标识符长度上限是256*256,虽然正常情况下这不是一个问题。

符号表

符号表的类型是IdentifierTable,其基本数据成员就两个:

/// \brief Implements an efficient mapping from strings to IdentifierInfo nodes.
///
/// This has no other purpose, but this is an extremely performance-critical
/// piece of the code, as each occurrence of every identifier goes through
/// here when lexed.
class IdentifierTable
{
    // Shark shows that using MallocAllocator is *much* slower than using this
    // BumpPtrAllocator!
    typedef llvm::StringMap<IdentifierInfo*, llvm::BumpPtrAllocator> HashTableTy;
    HashTableTy HashTable;

    IdentifierInfoLookup* ExternalLookup;
}

这里的HashTable就是一个map\,所有本地的名称与标识符的映射信息就存在里面。如果是外部标识符,则需要经过ExternalLookUp这个位置来访问。所以,总的查询函数为:

/// \brief Return the identifier token info for the specified named
/// identifier.
IdentifierInfo &get(StringRef Name)
{
    auto &Entry = *HashTable.insert(std::make_pair(Name, nullptr)).first;

    IdentifierInfo *&II = Entry.second;
    if (II) return *II;

    // No entry; if we have an external lookup, look there first.
    if (ExternalLookup)
    {
        II = ExternalLookup->get(Name);
        if (II)
            return *II;
    }

    // Lookups failed, make a new IdentifierInfo.
    void *Mem = getAllocator().Allocate<IdentifierInfo>();
    II = new (Mem) IdentifierInfo();

    // Make sure getName() knows how to find the IdentifierInfo
    // contents.
    II->Entry = &Entry;

    return *II;
}

大概流程就是先查本地,然后查外部,如果都没有就在本地插入一个。注意最后的II->Entry = &Entry;,这个是用来维持自引用的。

对于查询方法还有另外一个版本,这个版本专门用来处理本地查询,相关代码与之前函数的基本一样,不再谈。

在符号表里面还有一个非常重要的类,DeclarationNameExtra。这个类存储了类内部函数的类型信息:

/// DeclarationNameExtra - Common base of the MultiKeywordSelector,
/// CXXSpecialName, and CXXOperatorIdName classes, all of which are
/// private classes that describe different kinds of names.
class DeclarationNameExtra
{
public:
    /// ExtraKind - The kind of "extra" information stored in the
    /// DeclarationName. See @c ExtraKindOrNumArgs for an explanation of
    /// how these enumerator values are used.
    enum ExtraKind
    {
        CXXConstructor = 0,
        CXXDestructor,
        CXXConversionFunction,
#define OVERLOADED_OPERATOR(Name,Spelling,Token,Unary,Binary,MemberOnly) \
CXXOperator##Name,
#include "clang/Basic/OperatorKinds.def"
        CXXLiteralOperator,
        CXXUsingDirective,
        NUM_EXTRA_KINDS
    };

    /// ExtraKindOrNumArgs - Either the kind of C++ special name or
    /// operator-id (if the value is one of the CXX* enumerators of
    /// ExtraKind), in which case the DeclarationNameExtra is also a
    /// CXXSpecialName, (for CXXConstructor, CXXDestructor, or
    /// CXXConversionFunction) CXXOperatorIdName, or CXXLiteralOperatorName,
    /// it may be also name common to C++ using-directives (CXXUsingDirective),
    /// otherwise it is NUM_EXTRA_KINDS+NumArgs, where NumArgs is the number of
    /// arguments in the Objective-C selector, in which case the
    /// DeclarationNameExtra is also a MultiKeywordSelector.
    unsigned ExtraKindOrNumArgs;
};

大家看到这里的#define没,简直魔性啊。利用这个宏定义和OperatorKinds.def文件就活生生的造出了100多个新的枚举值。同时OperatorKinds.def能够根据不同的宏定义衍生出不同的枚举值类型,所支持的宏定义都在文件头的Header guard中。

宏管理

在宏管理方面,主要有两个类,一个是MacroInfo,一个是MacroArgs。对于宏来说,主要分为两类:对象型和函数型,区别就在于有没有参数。MacroInfo描述的就是这些。而MacroArgs则存储了宏的参数列表。

MacroArgs

这个类的组成还是比较简单的:

class MacroArgs
{
    unsigned NumUnexpArgTokens;
    bool VarargsElided;
    std::vector<std::vector<Token> > PreExpArgTokens;
    std::vector<Token> StringifiedArgs;
    MacroArgs *ArgCache;
}
  1. NumUnexpArgTokens:宏参数的数量,实际的参数Token在内存中会紧接着当前的MacroArgs对象分配,每个宏实际参数都是以EOF来作为终结符的。
  2. VarargsElided: 这个是C99的变参宏的参数形式,不用管。
  3. PreExpArgTokens:这个是预展开的实际参数,由于展开之后的参数可能有多个token,所以每个参数是tokenvector,并以EOF为结尾,总的实参就是vector>,对于还没计算的参数展开,保留为空。
  4. StringifiedArgs:这个是处理#操作符的。
  5. ArgCache:这个是空闲的MacroArgs内存区域的头指针。

首先需要说一下UnexpArgTokens这个的存储区域,注释里面说这些token是存储于当前对象末尾的,导致了这些区域的内存访问变得非常诡异...

const Token *MacroArgs::getUnexpArgument(unsigned Arg) const
{
    // The unexpanded argument tokens start immediately after the MacroArgs object
    // in memory.
    //这又是一个依赖于实现的东西啊,内存对齐呢! 谁说当前对象末尾一定无缝隙链接Token啊
    const Token *Start = (const Token *)(this + 1);
    const Token *Result = Start;
    // Scan to find Arg.
    for (; Arg; ++Result)
    {
        assert(Result < Start + NumUnexpArgTokens && "Invalid arg #");
        if (Result->is(tok::eof))
            --Arg;
    }
    assert(Result < Start + NumUnexpArgTokens && "Invalid arg #");
    return Result;
}

强制了内存布局和对齐啊:(const Token )(this +1)。从这个实现中我们可以看出,每个UnexpArgument是一个以EOF分割的Token*数组,内存区域连续分配。

这个MacroArgs类型对象的生命周期是被预处理器Preprocessor托管的,所以他的创建需要下面的静态函数:

static MacroArgs *create(const MacroInfo *MI,
        ArrayRef<Token> UnexpArgTokens,
        bool VarargsElided, Preprocessor &PP);

其具体执行流程就是从预处理器的空闲内存列表&PP.MacroArgCache中找到能够满足条件*Entry)->NumUnexpArgTokens >= UnexpArgTokens.size()且内存区域最小的Entry,如果找不到就调用malloc,然后就是placement new。对象构造完成之后,将unexpArgTokens复制到result的内存区域后面,其实这样做很不好,暴露了Token的具体存储区!

对象的创建是通过Preprocessor来处理的,那么对象的销毁也是通过Preprocessor来处理。

void MacroArgs::destroy(Preprocessor &PP)
{
    StringifiedArgs.clear();

    // Don't clear PreExpArgTokens, just clear the entries.  Clearing the entries
    // would deallocate the element vectors.
    for (unsigned i = 0, e = PreExpArgTokens.size(); i != e; ++i)
        PreExpArgTokens[i].clear();

    // Add this to the preprocessor's free list.
    //总的空闲头节点存储在PP.MacroArgCache里面
    //感觉这样的话只能进行头节点处理啊,多节点分配就崩了
    ArgCache = PP.MacroArgCache;
    PP.MacroArgCache = this;
}

这里的destroy只是用来释放展开之后的宏参数token的,同时将当前节点挂载到预处理器的空闲链表头。

真正完整的销毁是这个:

MacroArgs *MacroArgs::deallocate()

MacroArgs *Next = ArgCache;

// Run the dtor to deallocate the vectors. this-> MacroArgs(); //

Release the memory for the object. free(this);

return Next;

free thisdelete this有异曲同工之妙啊,用错了可是蹦的相当惨,所以该函数只能在Preprocessor中调用。

对于展开后参数的访问,是与参数展开过程合并的,即参数展开只有在需要时才展开,是lazy的。

const std::vector<Token> &
MacroArgs::getPreExpArgument(unsigned Arg, const MacroInfo *MI,
    Preprocessor &PP)
{
    assert(Arg < MI->getNumArgs() && "Invalid argument number!");

    // If we have already computed this, return it.
    if (PreExpArgTokens.size() < MI->getNumArgs())
        PreExpArgTokens.resize(MI->getNumArgs());

    std::vector<Token> &Result = PreExpArgTokens[Arg];
    if (!Result.empty()) return Result;

    SaveAndRestore<bool> PreExpandingMacroArgs(PP.InMacroArgPreExpansion, true);

    const Token *AT = getUnexpArgument(Arg);
    unsigned NumToks = getArgLength(AT) + 1;  // Include the EOF.

    // Otherwise, we have to pre-expand this argument, populating Result.  To do
    // this, we set up a fake TokenLexer to lex from the unexpanded argument
    // list.  With this installed, we lex expanded tokens until we hit the EOF
    // token at the end of the unexp list.
    PP.EnterTokenStream(AT, NumToks, false /*disable expand*/,
        false /*owns tokens*/);
}

访问时,首先判断是否已经展开了:PreExpArgTokens[Arg],否则需要进行展开操作。这里用的是一个临时的预处理器,结果通过PP.lex()一个一个吐出来。最后把展开后的token存进来。

PP.EnterTokenStream(AT, NumToks, false /*disable expand*/,
    false /*owns tokens*/);
// Lex all of the macro-expanded tokens into Result.
do
{
    Result.push_back(Token());
    Token &Tok = Result.back();
    PP.Lex(Tok);
}
while (Result.back().isNot(tok::eof));

// Pop the token stream off the top of the stack.  We know that the internal
// pointer inside of it is to the "end" of the token stream, but the stack
// will not otherwise be popped until the next token is lexed.  The problem is
// that the token may be lexed sometime after the vector of tokens itself is
// destroyed, which would be badness.
if (PP.InCachingLexMode())
    PP.ExitCachingLexMode();
PP.RemoveTopOfLexerStack();
return Result;

这里还有一个重量级的函数stringinify,用来处理#连接的,将连接起来的token组合成一个string。代码逻辑比较扭曲,这里就不谈了。相关逻辑见C99标准的6.10.3.2p2

MacroInfo

这个类的格局就比MacroArgs大了很多,不过很多都是标识位,不用管太多。主要的数据成员如下:

class MacroInfo
{
    //===--------------------------------------------------------------------===//
    // State set when the macro is defined.

    /// \brief The location the macro is defined.
    SourceLocation Location;
    /// \brief The location of the last token in the macro.
    SourceLocation EndLocation;

    /// \brief The list of arguments for a function-like macro.
    ///
    /// ArgumentList points to the first of NumArguments pointers.
    ///
    /// This can be empty, for, e.g. "#define X()".  In a C99-style variadic
    /// macro, this includes the \c __VA_ARGS__ identifier on the list.
    IdentifierInfo **ArgumentList;

    /// \see ArgumentList
    unsigned NumArguments;

    /// \brief This is the list of tokens that the macro is defined to.
    SmallVector<Token, 8> ReplacementTokens;

    /// \brief Length in characters of the macro definition.
    mutable unsigned DefinitionLength;
    mutable bool IsDefinitionLengthCached : 1;
}

对于函数型宏来说,他是有参数列表的,所有参数的指针按序排在以ArgumentList的内存区域中,总共NumArguments个。整个宏的长度是DefinitionLength,这个是延迟计算的,紧接着的IsDefinitionLengthCached就是这个长度是否已经计算的标记值。这个延迟计算的代码也很直白,就是获得头Token和尾token,然后获得展开位置,计算差值:

unsigned MacroInfo::getDefinitionLengthSlow(SourceManager &SM) const
{
    assert(!IsDefinitionLengthCached);
    IsDefinitionLengthCached = true;

    if (ReplacementTokens.empty())
        return (DefinitionLength = 0);

    const Token &firstToken = ReplacementTokens.front();
    const Token &lastToken = ReplacementTokens.back();
    SourceLocation macroStart = firstToken.getLocation();
    SourceLocation macroEnd = lastToken.getLocation();
    assert(macroStart.isValid() && macroEnd.isValid());
    assert((macroStart.isFileID() || firstToken.is(tok::comment)) &&
        "Macro defined in macro?");
    assert((macroEnd.isFileID() || lastToken.is(tok::comment)) &&
        "Macro defined in macro?");
    std::pair<FileID, unsigned>
        startInfo = SM.getDecomposedExpansionLoc(macroStart);
    std::pair<FileID, unsigned>
        endInfo = SM.getDecomposedExpansionLoc(macroEnd);
    assert(startInfo.first == endInfo.first &&
        "Macro definition spanning multiple FileIDs ?");
    assert(startInfo.second <= endInfo.second);
    DefinitionLength = endInfo.second - startInfo.second;
    DefinitionLength += lastToken.getLength();

    return DefinitionLength;
}

剩下的就是一些位标记了,分为两种:一种是宏的自身属性,另外一种是使用属性。下面的是自身属性:

//是否是函数宏还是对象宏.
bool IsFunctionLike : 1;

// 是否是C99的变参宏
bool IsC99Varargs : 1;

//是否是GNU的变参宏
bool IsGNUVarargs : 1;

//是否是语言内置宏,如__FILE__ __LINE__等
bool IsBuiltinMacro : 1;

/// \brief Whether this macro contains the sequence ", ## __VA_ARGS__"
bool HasCommaPasting : 1;

下面的是使用属性:

//===--------------------------------------------------------------------===//
// State that changes as the macro is used.

/// \brief True if we have started an expansion of this macro already.
///
/// This disables recursive expansion, which would be quite bad for things
/// like \#define A A.
bool IsDisabled : 1;

/// \brief True if this macro is either defined in the main file and has
/// been used, or if it is not defined in the main file.
///
/// This is used to emit -Wunused-macros diagnostics.
bool IsUsed : 1;

/// \brief True if this macro can be redefined without emitting a warning.
bool IsAllowRedefinitionsWithoutWarning : 1;

/// \brief Must warn if the macro is unused at the end of translation unit.
bool IsWarnIfUnused : 1;

/// \brief Whether this macro info was loaded from an AST file.
unsigned FromASTFile : 1;

/// \brief Whether this macro was used as header guard.
bool UsedForHeaderGuard : 1;

类中还有很多函数来操作这些可更改的位,但是内容很简单,这里就不详谈了。

对于MacroInfo来说,比较重要的操作是判断两个宏是否相等。相等又分为两种,一种是语义上的相等,此时函数参数名可以变化;另外一种是词法上的相等,就是一个一个Token的比较。因此在比较的时候,如果函数体里面遇到函数参数,就查看函数参数的索引是否相等。

bool MacroInfo::isIdenticalTo(const MacroInfo &Other, Preprocessor &PP,
bool Syntactically) const
{
    bool Lexically = !Syntactically;

    // Check # tokens in replacement, number of args, and various flags all match.
    if (ReplacementTokens.size() != Other.ReplacementTokens.size() ||
        getNumArgs() != Other.getNumArgs() ||
        isFunctionLike() != Other.isFunctionLike() ||
        isC99Varargs() != Other.isC99Varargs() ||
        isGNUVarargs() != Other.isGNUVarargs())
        return false;

    if (Lexically)
    {
        // Check arguments.
        for (arg_iterator I = arg_begin(), OI = Other.arg_begin(), E = arg_end();
        I != E; ++I, ++OI)
            if (*I != *OI) return false;
    }

    // Check all the tokens.
    for (unsigned i = 0, e = ReplacementTokens.size(); i != e; ++i)
    {
        const Token &A = ReplacementTokens[i];
        const Token &B = Other.ReplacementTokens[i];
        if (A.getKind() != B.getKind())
            return false;

        // If this isn't the first first token, check that the whitespace and
        // start-of-line characteristics match.
        if (i != 0 &&
            (A.isAtStartOfLine() != B.isAtStartOfLine() ||
                A.hasLeadingSpace() != B.hasLeadingSpace()))
            return false;

        // If this is an identifier, it is easy.
        if (A.getIdentifierInfo() || B.getIdentifierInfo())
        {
            if (A.getIdentifierInfo() == B.getIdentifierInfo())
                continue;
            if (Lexically)
                return false;
            // With syntactic equivalence the parameter names can be different as long
            // as they are used in the same place.
            int AArgNum = getArgumentNum(A.getIdentifierInfo());
            if (AArgNum == -1)
                return false;
            if (AArgNum != Other.getArgumentNum(B.getIdentifierInfo()))
                return false;
            continue;
        }

        // Otherwise, check the spelling.
        if (PP.getSpelling(A) != PP.getSpelling(B))
            return false;
    }
}

MacroDirective

这个MacroDirective类似于名字空间的存在,用来界定宏的可见性。在宏的可见性方面,定义了以下的枚举类型:

enum Kind
{
    MD_Define, MD_Undefine, MD_Visibility
};

而整个MacroDirective的数据成员也比较少,包括如下:

class MacroDirective
{
    /// \brief Previous macro directive for the same identifier, or NULL.
    MacroDirective *Previous;

    SourceLocation Loc;

    /// \brief MacroDirective kind.
    unsigned MDKind : 2;

    /// \brief True if the macro directive was loaded from a PCH file.
    bool IsFromPCH : 1;

    // Used by VisibilityMacroDirective ----------------------------------------//

    /// \brief Whether the macro has public visibility (when described in a
    /// module).
    bool IsPublic : 1;
}

其实相关的内容也就三个:类型,位置,串联指针。没有什么特殊的,就不解释了。

同时根据MDKind的具体类型,特化了三种形式,即子类:DefMacroDirective, UndefMacroDirective, VisibilityMacroDirective。 这个是defineMacroDirective形式,所以他的定义是直接继承过来的:

class DefMacroDirective : public MacroDirective
{
    MacroInfo *Info;
}

里面还夹带了一个MacroInfo

类似的有UndefMacroDirective:

class UndefMacroDirective : public MacroDirective

这里就没有MacroInfo那个数据成员了,不科学。对于VisibilityMacroDirective来说也是如此:

class VisibilityMacroDirective : public MacroDirective

下面的这个寻找宏定义的函数就利用了这几种类型之间的动态类型转换。

MacroDirective::DefInfo MacroDirective::getDefinition()
{
    MacroDirective *MD = this;
    SourceLocation UndefLoc;
    Optional<bool> isPublic;
    for (; MD; MD = MD->getPrevious())
    {
        if (DefMacroDirective *DefMD = dyn_cast<DefMacroDirective>(MD))
            return DefInfo(DefMD, UndefLoc,
                !isPublic.hasValue() || isPublic.getValue());

        if (UndefMacroDirective *UndefMD = dyn_cast<UndefMacroDirective>(MD))
        {
            UndefLoc = UndefMD->getLocation();
            continue;
        }

        VisibilityMacroDirective *VisMD = cast<VisibilityMacroDirective>(MD);
        if (!isPublic.hasValue())
            isPublic = VisMD->isPublic();
    }

    return DefInfo(nullptr, UndefLoc,
        !isPublic.hasValue() || isPublic.getValue());
}

MuduleMacro

这个ModuleMacro代表的是在外部模块导入进来的宏导言。多个模块中可能会对同一个宏做多次定义,这里我们还要记录定义覆盖路径:

class ModuleMacro : public llvm::FoldingSetNode
{
    /// The name defined by the macro.
    IdentifierInfo *II;
    /// The body of the #define, or nullptr if this is a #undef.
    MacroInfo *Macro;
    /// The module that exports this macro.
    Module *OwningModule;
    /// The number of module macros that override this one.
    unsigned NumOverriddenBy;
    /// The number of modules whose macros are directly overridden by this one.
    unsigned NumOverrides;
    // ModuleMacro *OverriddenMacros[NumOverrides];
}

从这个的类型继承自FoldingSetNode可以看出,所有的ModuleMacro都是组织为一个FoldingSet之中的。

MacroDefinition

这个类型用来记录当前宏的定义,此外还保留了各个定义历史信息。

class MacroDefinition
{
    llvm::PointerIntPair<DefMacroDirective *, 1, bool> LatestLocalAndAmbiguous;
    ArrayRef<ModuleMacro *> ModuleMacros;
}

这里的LatestLocalAndAmbiguous的最后一位是用来记录当前定义是否有歧义(宏的歧义?)。最新的定义存储在ModuleMacroback

宏处理与TokenLexer

这个类型是用来处理宏展开和_Pragma的,由于我们当前只关心常用的C语法,因此只介绍宏展开这部分。

TokenLexer这个类相对来说不算复杂,主要的宏相关数据成员包括:

/// Macro - The macro we are expanding from.  This is null if expanding a
/// token stream.
///
MacroInfo *Macro;

/// ActualArgs - The actual arguments specified for a function-like macro, or
/// null.  The TokenLexer owns the pointed-to object.
MacroArgs *ActualArgs;

/// PP - The current preprocessor object we are expanding for.
///
Preprocessor &PP;

/// Tokens - This is the pointer to an array of tokens that the macro is
/// defined to, with arguments expanded for function-like macros.  If this is
/// a token stream, these are the tokens we are returning.  This points into
/// the macro definition we are lexing from, a cache buffer that is owned by
/// the preprocessor, or some other buffer that we may or may not own
/// (depending on OwnsTokens).
/// Note that if it points into Preprocessor's cache buffer, the Preprocessor
/// may update the pointer as needed.
const Token *Tokens;

也就是宏名,相关实参和宏体。

主要的源代码位置相关成员包括:

/// NumTokens - This is the length of the Tokens array.
///
unsigned NumTokens;

/// CurToken - This is the next token that Lex will return.
///
unsigned CurToken;

/// ExpandLocStart/End - The source location range where this macro was
/// expanded.
SourceLocation ExpandLocStart, ExpandLocEnd;

/// \brief Source location pointing at the source location entry chunk that
/// was reserved for the current macro expansion.
SourceLocation MacroExpansionStart;

/// \brief The offset of the macro expansion in the
/// "source location address space".
unsigned MacroStartSLocOffset;

/// \brief Location of the macro definition.
SourceLocation MacroDefStart;
/// \brief Length of the macro definition.
unsigned MacroDefLength;

/// Lexical information about the expansion point of the macro: the identifier
/// that the macro expanded from had these properties.
bool AtStartOfLine : 1;
bool HasLeadingSpace : 1;

// NextTokGetsSpace - When this is true, the next token appended to the
// output list during function argument expansion will get a leading space,
// regardless of whether it had one to begin with or not. This is used for
// placemarker support. If still true after function argument expansion, the
// leading space will be applied to the first token following the macro
// expansion.
bool NextTokGetsSpace : 1;

/// OwnsTokens - This is true if this TokenLexer allocated the Tokens
/// array, and thus needs to free it when destroyed.  For simple object-like
/// macros (for example) we just point into the token buffer of the macro
/// definition, we don't make a copy of it.
bool OwnsTokens : 1;

/// DisableMacroExpansion - This is true when tokens lexed from the TokenLexer
/// should not be subject to further macro expansion.
bool DisableMacroExpansion : 1;

这里的注释也算非常详尽,暂不解释。

Function Expansion

TokenLexer中最重要的操作就是展开操作,该操作被封装在ExpandFunctionArguments函数中,该函数的签名很简单,无参数无返回值!但是该函数有250多行,比较复杂,分步剖析。

首先是基本存储变量:

SmallVector<Token, 128> ResultToks;

// Loop through 'Tokens', expanding them into ResultToks.  Keep
// track of whether we change anything.  If not, no need to keep them.  If so,
// we install the newly expanded sequence as the new 'Tokens' list.
bool MadeChange = false;

扫描定义体的时候,首先要判断的是##或者#这个连接符。如果有连接符,则判断后面的是否是形参参数。如果不是形参,则直接报错。如果是形参,则我们需要处理很多细节,如一下代码:

const Token &CurTok = Tokens[i];
if (i != 0 && !Tokens[i - 1].is(tok::hashhash) && CurTok.hasLeadingSpace())
    NextTokGetsSpace = true;

if (CurTok.isOneOf(tok::hash, tok::hashat))
{
    int ArgNo = Macro->getArgumentNum(Tokens[i + 1].getIdentifierInfo());
    assert(ArgNo != -1 && "Token following # is not an argument?");

    SourceLocation ExpansionLocStart =
        getExpansionLocForMacroDefLoc(CurTok.getLocation());
    SourceLocation ExpansionLocEnd =
        getExpansionLocForMacroDefLoc(Tokens[i + 1].getLocation());

    Token Res;
    if (CurTok.is(tok::hash))  // Stringify
        Res = ActualArgs->getStringifiedArgument(ArgNo, PP,
            ExpansionLocStart,
            ExpansionLocEnd);
    else
    {
        // 'charify': don't bother caching these.
        Res = MacroArgs::StringifyArgument(ActualArgs->getUnexpArgument(ArgNo),
            PP, true,
            ExpansionLocStart,
            ExpansionLocEnd);
    }
    Res.setFlag(Token::StringifiedInMacro);

    // The stringified/charified string leading space flag gets set to match
    // the #/#@ operator.
    if (NextTokGetsSpace)
        Res.setFlag(Token::LeadingSpace);

    ResultToks.push_back(Res);
    MadeChange = true;
    ++i;  // Skip arg name.
    NextTokGetsSpace = false;
    continue;
}

要想理解这些代码,我们必须搞清楚两种特殊符##之间的区别。一个是Stringify,一个是Charify。这里我们处理的时候会自动吃掉下一个Token,所以这里会++i

然后需要特殊对待的是##连接符,判断当前Token的前后Token是否是连接符。

// Find out if there is a paste (##) operator before or after the token.
bool NonEmptyPasteBefore =
    !ResultToks.empty() && ResultToks.back().is(tok::hashhash);
bool PasteBefore = i != 0 && Tokens[i - 1].is(tok::hashhash);
bool PasteAfter = i + 1 != e && Tokens[i + 1].is(tok::hashhash);
assert(!NonEmptyPasteBefore || PasteBefore);

然后判断当前Token是否是Arg,如果不是参数的话,直接复制到输出中去:

// Otherwise, if this is not an argument token, just add the token to the
// output buffer.
IdentifierInfo *II = CurTok.getIdentifierInfo();
int ArgNo = II ? Macro->getArgumentNum(II) : -1;
if (ArgNo == -1)
{
    // This isn't an argument, just add it.
    ResultToks.push_back(CurTok);

    if (NextTokGetsSpace)
    {
        ResultToks.back().setFlag(Token::LeadingSpace);
        NextTokGetsSpace = false;
    }
    else if (PasteBefore && !NonEmptyPasteBefore)
        ResultToks.back().clearFlag(Token::LeadingSpace);

    continue;
}

否则就只剩下一种情况了,当前的是参数,我们要对字符串做替换。这里又有两种情况,一种是普通替换,一种是##

普通替换

对于普通替换,我们需要提前把参数展开好,然后替换:

// If it is not the LHS/RHS of a ## operator, we must pre-expand the
// argument and substitute the expanded tokens into the result.  This is
// C99 6.10.3.1p1.
if (!PasteBefore && !PasteAfter)
{
    const Token *ResultArgToks;

    // Only preexpand the argument if it could possibly need it.  This
    // avoids some work in common cases.
    const Token *ArgTok = ActualArgs->getUnexpArgument(ArgNo);
    if (ActualArgs->ArgNeedsPreexpansion(ArgTok, PP))
        ResultArgToks = &ActualArgs->getPreExpArgument(ArgNo, Macro, PP)[0];
    else
        ResultArgToks = ArgTok;  // Use non-preexpanded tokens.

// If the arg token expanded into anything, append it.
    if (ResultArgToks->isNot(tok::eof))
    {
        unsigned FirstResult = ResultToks.size();
        unsigned NumToks = MacroArgs::getArgLength(ResultArgToks);
        ResultToks.append(ResultArgToks, ResultArgToks + NumToks);

        // In Microsoft-compatibility mode, we follow MSVC's preprocessing
        // behavior by not considering single commas from nested macro
        // expansions as argument separators. Set a flag on the token so we can
        // test for this later when the macro expansion is processed.
        if (PP.getLangOpts().MSVCCompat && NumToks == 1 &&
            ResultToks.back().is(tok::comma))
            ResultToks.back().setFlag(Token::IgnoredComma);

        // If the '##' came from expanding an argument, turn it into 'unknown'
        // to avoid pasting.
        for (unsigned i = FirstResult, e = ResultToks.size(); i != e; ++i)
        {
            Token &Tok = ResultToks[i];
            if (Tok.is(tok::hashhash))
                Tok.setKind(tok::unknown);
        }

        if (ExpandLocStart.isValid())
        {
            updateLocForMacroArgTokens(CurTok.getLocation(),
                ResultToks.begin() + FirstResult,
                ResultToks.end());
        }

        // If any tokens were substituted from the argument, the whitespace
        // before the first token should match the whitespace of the arg
        // identifier.
        ResultToks[FirstResult].setFlagValue(Token::LeadingSpace,
            NextTokGetsSpace);
        NextTokGetsSpace = false;
    }
    continue;
}

参数计算好之后,这里的代码除了Token替换之外,主要处理了三个问题:微软带来的逗号分隔标准,参数展开带来的##和新的位置计算。前两个只是做一些标识工作,最后的需要调用函数来重新计算SourceLocation,该函数的定义如下:

/// \brief Creates SLocEntries and updates the locations of macro argument
/// tokens to their new expanded locations.
///
/// \param ArgIdDefLoc the location of the macro argument id inside the macro
/// definition.
/// \param Tokens the macro argument tokens to update.
void TokenLexer::updateLocForMacroArgTokens(SourceLocation ArgIdSpellLoc,
    Token *begin_tokens,
    Token *end_tokens)
{
    SourceManager &SM = PP.getSourceManager();

    SourceLocation InstLoc =
        getExpansionLocForMacroDefLoc(ArgIdSpellLoc);

    while (begin_tokens < end_tokens)
    {
        // If there's only one token just create a SLocEntry for it.
        if (end_tokens - begin_tokens == 1)
        {
            Token &Tok = *begin_tokens;
            Tok.setLocation(SM.createMacroArgExpansionLoc(Tok.getLocation(),
                InstLoc,
                Tok.getLength()));
            return;
        }

        updateConsecutiveMacroArgTokens(SM, InstLoc, begin_tokens, end_tokens);
    }
}

while循环内,会调用UpdateConsecutiveMacroArgTokens这个名字非常长的函数,来处理连续Token的位置调整,这里的begin_tokensend_tokens是以引用的形式传递进去的,所以循环才可能终止。这个UpdateConsecutiveMacroArgTokens的签名如下:

// \brief Finds the tokens that are consecutive (from the same FileID)
/// creates a single SLocEntry, and assigns SourceLocations to each token that
/// point to that SLocEntry. e.g for
///   assert(foo == bar);
/// There will be a single SLocEntry for the "foo == bar" chunk and locations
/// for the 'foo', '==', 'bar' tokens will point inside that chunk.
///
/// \arg begin_tokens will be updated to a position past all the found
/// consecutive tokens.
static void updateConsecutiveMacroArgTokens(SourceManager &SM,
    SourceLocation InstLoc,
    Token *&begin_tokens,
    Token * end_tokens)

其实内部操作很简单,就是一直扫描处于同一文件之中且相邻距离不超过50个字符的Token,合并为一个SlocEntry以压缩存储,但是各个TokenSourceLocation还是各自维持的。寻找连续Token的代码如下:

assert(begin_tokens < end_tokens);

SourceLocation FirstLoc = begin_tokens->getLocation();
SourceLocation CurLoc = FirstLoc;

// Compare the source location offset of tokens and group together tokens that
// are close, even if their locations point to different FileIDs. e.g.
//
//  |bar    |  foo | cake   |  (3 tokens from 3 consecutive FileIDs)
//  ^                    ^
//  |bar       foo   cake|     (one SLocEntry chunk for all tokens)
//
// we can perform this "merge" since the token's spelling location depends
// on the relative offset.

Token *NextTok = begin_tokens + 1;
for (; NextTok < end_tokens; ++NextTok)
{
    SourceLocation NextLoc = NextTok->getLocation();
    if (CurLoc.isFileID() != NextLoc.isFileID())
        break; // Token from different kind of FileID.

    int RelOffs;
    if (!SM.isInSameSLocAddrSpace(CurLoc, NextLoc, &RelOffs))
        break; // Token from different local/loaded location.
 // Check that token is not before the previous token or more than 50
 // "characters" away.
    if (RelOffs < 0 || RelOffs > 50)
        break;
    CurLoc = NextLoc;
}

之后的处理相对来说就比较简单,填充对应的SlocEntry,维持各个TokenSourceLocation

// For the consecutive tokens, find the length of the SLocEntry to contain
// all of them.
Token &LastConsecutiveTok = *(NextTok - 1);
int LastRelOffs = 0;
SM.isInSameSLocAddrSpace(FirstLoc, LastConsecutiveTok.getLocation(),
    &LastRelOffs);
unsigned FullLength = LastRelOffs + LastConsecutiveTok.getLength();

// Create a macro expansion SLocEntry that will "contain" all of the tokens.
SourceLocation Expansion =
    SM.createMacroArgExpansionLoc(FirstLoc, InstLoc, FullLength);

// Change the location of the tokens from the spelling location to the new
// expanded location.
for (; begin_tokens < NextTok; ++begin_tokens)
{
    Token &Tok = *begin_tokens;
    int RelOffs = 0;
    SM.isInSameSLocAddrSpace(FirstLoc, Tok.getLocation(), &RelOffs);
    Tok.setLocation(Expansion.getLocWithOffset(RelOffs));
}

##拼接

简单来说,在C语言的宏中是容许嵌套的,其嵌套后,一般的展开规律像函数的参数一样,先展开参数,在分析函数,所以展开顺序是由内而外。但是当宏中有#则不再展开参数了,如果宏中有##,则先展开函数,再展开里面的参数。所以在处理##时,不能使用展开之后的实参,只能用未展开之前的实参。所以我们需要如下处理,首先获得实参未展开形式:

// Okay, we have a token that is either the LHS or RHS of a paste (##)
// argument.  It gets substituted as its non-pre-expanded tokens.
const Token *ArgToks = ActualArgs->getUnexpArgument(ArgNo);
unsigned NumToks = MacroArgs::getArgLength(ArgToks);

如果实参不为空,则需要进行拼接处理,这里又会遇到GNU comma这个烦人的东西,其形式为, ## __VA_ARGS__,此时我们需要把这个扩展的使用记录下来,同时将##弹出,然后再进行处理。

处理时首先把这个参数放入结果Token数组中,如果压入的这些Token也包含##,标记为tok::unknown。此外还要处理是否保留第一个空格的问题,所以这部分总的代码如下:

if (NumToks)
{  // Not an empty argument?
// If this is the GNU ", ## __VA_ARGS__" extension, and we just learned
// that __VA_ARGS__ expands to multiple tokens, avoid a pasting error when
// the expander trys to paste ',' with the first token of the __VA_ARGS__
// expansion.
    if (NonEmptyPasteBefore && ResultToks.size() >= 2 &&
        ResultToks[ResultToks.size() - 2].is(tok::comma) &&
        (unsigned)ArgNo == Macro->getNumArgs() - 1 &&
        Macro->isVariadic())
    {
        // Remove the paste operator, report use of the extension.
        PP.Diag(ResultToks.pop_back_val().getLocation(), diag::ext_paste_comma);
    }

    ResultToks.append(ArgToks, ArgToks + NumToks);

    // If the '##' came from expanding an argument, turn it into 'unknown'
    // to avoid pasting.
    for (unsigned i = ResultToks.size() - NumToks, e = ResultToks.size();
    i != e; ++i)
    {
        Token &Tok = ResultToks[i];
        if (Tok.is(tok::hashhash))
            Tok.setKind(tok::unknown);
    }

    if (ExpandLocStart.isValid())
    {
        updateLocForMacroArgTokens(CurTok.getLocation(),
            ResultToks.end() - NumToks, ResultToks.end());
    }

    // If this token (the macro argument) was supposed to get leading
    // whitespace, transfer this information onto the first token of the
    // expansion.
    //
    // Do not do this if the paste operator occurs before the macro argument,
    // as in "A ## MACROARG".  In valid code, the first token will get
    // smooshed onto the preceding one anyway (forming AMACROARG).  In
    // assembler-with-cpp mode, invalid pastes are allowed through: in this
    // case, we do not want the extra whitespace to be added.  For example,
    // we want ". ## foo" -> ".foo" not ". foo".
    if (NextTokGetsSpace)
        ResultToks[ResultToks.size() - NumToks].setFlag(Token::LeadingSpace);

    NextTokGetsSpace = false;
    continue;
}

此处还有一个特殊情况,如果##连接符的任何一个参数是空的话,直接忽略当前连接符。

// If an empty argument is on the LHS or RHS of a paste, the standard (C99
// 6.10.3.3p2,3) calls for a bunch of placemarker stuff to occur.  We
// implement this by eating ## operators when a LHS or RHS expands to
// empty.
if (PasteAfter)
{
    // Discard the argument token and skip (don't copy to the expansion
    // buffer) the paste operator after it.
    ++i;
    continue;
}

宏cache

如果这个宏会被多次展开的话,我们可以复用之前分析的结果,放进cacheMacro中,同时当前ResultTokens的所有权也被转移到了PrePocessor之中。

if (MadeChange)
{
    assert(!OwnsTokens && "This would leak if we already own the token list");
    // This is deleted in the dtor.
    NumTokens = ResultToks.size();
    // The tokens will be added to Preprocessor's cache and will be removed
    // when this TokenLexer finishes lexing them.
    Tokens = PP.cacheMacroExpandedTokens(this, ResultToks);

    // The preprocessor cache of macro expanded tokens owns these tokens,not us.
    OwnsTokens = false;
}

TokenLexer的各个构造函数都会调用init这个函数,由此可见他的重要程度。该函数的签名如下:

/// Init - Initialize this TokenLexer to expand from the specified macro
/// with the specified argument information.  Note that this ctor takes
/// ownership of the ActualArgs pointer.  ILEnd specifies the location of the
/// ')' for a function-like macro or the identifier for an object-like macro.
void Init(Token &Tok, SourceLocation ILEnd, MacroInfo *MI,
    MacroArgs *ActualArgs);

这几个参数包括:展开开始位点,展开结束位点,宏定义信息,宏实参信息。

由于TokenLexer是一个可复用的对象,所以我们在调用init时,首先需要释放之前占有的资源,需要调用一个destroy函数:

void TokenLexer::destroy()
{
    // If this was a function-like macro that actually uses its arguments, delete
    // the expanded tokens.
    if (OwnsTokens)
    {
        delete[] Tokens;
        Tokens = nullptr;
        OwnsTokens = false;
    }

    // TokenLexer owns its formal arguments.
    if (ActualArgs) ActualArgs->destroy(PP);
}

这个函数作用就是释放占有的TokenActualArgs,这些都是属于上一个宏的。

在释放这些资源之后,以输入参数重置当前TokenLexer,所以init的开头代码定义如下:

destroy();

Macro = MI;
ActualArgs = Actuals;
CurToken = 0;

ExpandLocStart = Tok.getLocation();
ExpandLocEnd = ELEnd;
AtStartOfLine = Tok.isAtStartOfLine();
HasLeadingSpace = Tok.hasLeadingSpace();
NextTokGetsSpace = false;
Tokens = &*Macro->tokens_begin();
OwnsTokens = false;
DisableMacroExpansion = false;
NumTokens = Macro->tokens_end() - Macro->tokens_begin();
MacroExpansionStart = SourceLocation();

SourceManager &SM = PP.getSourceManager();
MacroStartSLocOffset = SM.getNextLocalOffset();

各种初始化!然后是预留空间:

if (NumTokens > 0)
{
    assert(Tokens[0].getLocation().isValid());
    assert((Tokens[0].getLocation().isFileID() || Tokens[0].is(tok::comment)) &&
        "Macro defined in macro?");
    assert(ExpandLocStart.isValid());

    // Reserve a source location entry chunk for the length of the macro
    // definition. Tokens that get lexed directly from the definition will
    // have their locations pointing inside this chunk. This is to avoid
    // creating separate source location entries for each token.
    MacroDefStart = SM.getExpansionLoc(Tokens[0].getLocation());
    MacroDefLength = Macro->getDefinitionLength(SM);
    MacroExpansionStart = SM.createExpansionLoc(MacroDefStart,
        ExpandLocStart,
        ExpandLocEnd,
        MacroDefLength);
}

如果是带参宏函数,则展开该宏函数:

// If this is a function-like macro, expand the arguments and change
// Tokens to point to the expanded tokens.
if (Macro->isFunctionLike() && Macro->getNumArgs())
    ExpandFunctionArguments();

Lex

这里的Lex函数的每次调用会返回一个Token,并把这个Token从结果中剔除,类似于 queue.pop_front() 操作。其函数签名见下:

/// Lex - Lex and return a token from this macro stream.
///
bool TokenLexer::Lex(Token &Tok)

函数执行时,首先判断是否到了Buffer的末尾。如果到了,则允许当前宏在其他过程中展开,最后的处理会委托到Preprocessor中进行:

// Lexing off the end of the macro, pop this macro off the expansion stack.
if (isAtEnd())
{
    // If this is a macro (not a token stream), mark the macro enabled now
    // that it is no longer being expanded.
    if (Macro) Macro->EnableMacro();

    Tok.startToken();
    Tok.setFlagValue(Token::StartOfLine, AtStartOfLine);
    Tok.setFlagValue(Token::LeadingSpace, HasLeadingSpace || NextTokGetsSpace);
    if (CurToken == 0)
        Tok.setFlag(Token::LeadingEmptyMacro);
    return PP.HandleEndOfTokenLexer(Tok);
}

不是末尾的话,获得当前Token,同时判断下一个Token是否是##符。

SourceManager &SM = PP.getSourceManager();

// If this is the first token of the expanded result, we inherit spacing
// properties later.
bool isFirstToken = CurToken == 0;

// Get the next token to return.
Tok = Tokens[CurToken++];

bool TokenIsFromPaste = false;

如果是##符,则需要PasteToken操作:

// If this token is followed by a token paste (##) operator, paste the tokens!
// Note that ## is a normal token when not expanding a macro.
if (!isAtEnd() && Macro &&
    (Tokens[CurToken].is(tok::hashhash) ||
        // Special processing of L#x macros in -fms-compatibility mode.
        // Microsoft compiler is able to form a wide string literal from
        // 'L#macro_arg' construct in a function-like macro.
        (PP.getLangOpts().MSVCCompat &&
            isWideStringLiteralFromMacro(Tok, Tokens[CurToken]))))
{
    // When handling the microsoft /##/ extension, the final token is
    // returned by PasteTokens, not the pasted token.
    if (PasteTokens(Tok))
        return true;

    TokenIsFromPaste = true;
}

这里的PasteTokens操作非常的长,这里就不详细说明了,主要是我自己还看不懂!大意就是将##两边的操作符连接起来组成一个Identifier,需要注意的是有需要递归处理的情况,如多个连接符。这里就贴一下该函数的签名以及注释吧:

/// PasteTokens - Tok is the LHS of a ## operator, and CurToken is the ##
/// operator.  Read the ## and RHS, and paste the LHS/RHS together.  If there
/// are more ## after it, chomp them iteratively.  Return the result as Tok.
/// If this returns true, the caller should immediately return the token.
bool TokenLexer::PasteTokens(Token &Tok)

如果下一个是普通Token,则我们只需要修正一下位置信息即可:

// The token's current location indicate where the token was lexed from.  We
// need this information to compute the spelling of the token, but any
// diagnostics for the expanded token should appear as if they came from
// ExpansionLoc.  Pull this information together into a new SourceLocation
// that captures all of this.
if (ExpandLocStart.isValid() &&   // Don't do this for token streams.
    // Check that the token's location was not already set properly.
    SM.isBeforeInSLocAddrSpace(Tok.getLocation(), MacroStartSLocOffset))
{
    SourceLocation instLoc;
    if (Tok.is(tok::comment))
    {
        instLoc = SM.createExpansionLoc(Tok.getLocation(),
            ExpandLocStart,
            ExpandLocEnd,
            Tok.getLength());
    }
    else
    {
        instLoc = getExpansionLocForMacroDefLoc(Tok.getLocation());
    }

    Tok.setLocation(instLoc);
}

剩下的就是一些错误处理相关的东西,如果我们当前的Token是一个Identifier,则需要判断该Identifier是否有效,无效的话扔进错误处理:

// Handle recursive expansion!
if (!Tok.isAnnotation() && Tok.getIdentifierInfo() != nullptr)
{
    // Change the kind of this identifier to the appropriate token kind, e.g.
    // turning "for" into a keyword.
    IdentifierInfo *II = Tok.getIdentifierInfo();
    Tok.setKind(II->getTokenID());

    // If this identifier was poisoned and from a paste, emit an error.  This
    // won't be handled by Preprocessor::HandleIdentifier because this is coming
    // from a macro expansion.
    if (II->isPoisoned() && TokenIsFromPaste)
    {
        PP.HandlePoisonedIdentifier(Tok);
    }

    if (!DisableMacroExpansion && II->isHandleIdentifierCase())
        return PP.HandleIdentifier(Tok);
}

预处理器

预处理回调

预处理回调相关接口都放在PPCallbacks里面,这里提供了很多在预处理期间可以暴露的接口。其中比较重要的接口如下

  1. InclusionDirective 当一个头文件引入指令或者一个模块引入指令被处理时回调
  2. PragmaDirective 当一个pragma指令被处理时的回调
  3. MacroExpands 当Preprocessor::HandleMacroExpandedIdentifier发现了一个宏展开时的回调
  4. MacroDefined,MacroUndefined,Defined 顾名思义
  5. If,Elif, Ifdef,Ifndef, Else,Endif 顾名思义

为了处理多个回调的情况,这里还定义了一个PPChainedCallbacks, 就是一个callback的pair,调用的时候会首先调用First,然后调用Second。

    /// \brief Simple wrapper class for chaining callbacks.
class PPChainedCallbacks : public PPCallbacks
{
    virtual void anchor();
    std::unique_ptr<PPCallbacks> First, Second;

    public:
    PPChainedCallbacks(std::unique_ptr<PPCallbacks> _First,
    std::unique_ptr<PPCallbacks> _Second)
    : First(std::move(_First)), Second(std::move(_Second))
    {}
}

从这些接口就可以基本了解到预处理期间的主要工作了。

Pragma

pragma的处理基本都在pragma.h里。在clang里,将pragma的引用形式分为了三种变体:

enum PragmaIntroducerKind
{
    /**
    * \brief The pragma was introduced via \#pragma.
    */
    PIK_HashPragma,

    /**
    * \brief The pragma was introduced via the C99 _Pragma(string-literal).
    */
    PIK__Pragma,

    /**
    * \brief The pragma was introduced via the Microsoft
    * __pragma(token-string).
    */
    PIK___pragma
};

/par 每次引入一个pragma的时候,都会生成一个对应的pragmahandler来处理:

class PragmaHandler
{
    std::string Name;
    public:
    explicit PragmaHandler(StringRef name) : Name(name)
    {}
    PragmaHandler()
    {}
    virtual ~PragmaHandler();

    StringRef getName() const
    {
        return Name;
    }
    virtual void HandlePragma(Preprocessor &PP, PragmaIntroducerKind Introducer,
    Token &FirstToken) = 0;

    /// getIfNamespace - If this is a namespace, return it.  This is equivalent to
    /// using a dynamic_cast, but doesn't require RTTI.
    virtual PragmaNamespace *getIfNamespace()
    {
        return nullptr;
    }
};

每个handler都会有这个pragma的名字和对应的PragmaNamespace。PragmaNamespace其实也继承pragmahandler,只不过内部又存储了一个stringMap\,可以理解为一个目录树的结构。

/// PragmaNamespace - This PragmaHandler subdivides the namespace of pragmas,
/// allowing hierarchical pragmas to be defined.  Common examples of namespaces
/// are "\#pragma GCC", "\#pragma STDC", and "\#pragma omp", but any namespaces
/// may be (potentially recursively) defined.
class PragmaNamespace : public PragmaHandler
{
    /// Handlers - This is a map of the handlers in this namespace with their name
    /// as key.
    ///
    llvm::StringMap<PragmaHandler*> Handlers;
}

pragma的具体处理函数是handlePragma,其实就是在当前PragmaNamespace里查找对应的handler来委托处理:

void PragmaNamespace::HandlePragma(Preprocessor &PP,
PragmaIntroducerKind Introducer,
Token &Tok)
{
    // Read the 'namespace' that the directive is in, e.g. STDC.  Do not macro
    // expand it, the user can have a STDC #define, that should not affect this.
    PP.LexUnexpandedToken(Tok);

    // Get the handler for this token.  If there is no handler, ignore the pragma.
    PragmaHandler *Handler
    = FindHandler(Tok.getIdentifierInfo() ? Tok.getIdentifierInfo()->getName()
    : StringRef(),
    /*IgnoreNull=*/false);
    if (!Handler)
    {
        PP.Diag(Tok, diag::warn_pragma_ignored);
        return;
    }

    // Otherwise, pass it down.
    Handler->HandlePragma(PP, Introducer, Tok);
}

一般来说,pragma的处理流程是这样的:发现pragma之后,一路lex到endofline,然后把第一个token当作pragma的名字,用对应的pragmahandler。干扰这个一般流程的因素是宏,需要考虑当前pragma是否在宏展开阶段。宏展开可能导致当前pragma被忽略。下面就是一个pragma与宏相互影响的例子

#define EMPTY(x)
#define INACTIVE(x) EMPTY(x)
INACTIVE(_Pragma("clang diagnostic ignored \"-Wconversion\""))

所以pragma的最终处理流程被修正为了:如果当前的pragma在宏展开里面,则先不处理,记录这个位置为backtrack位点,然后一路lex下去。直到宏处理验证格式正确时才回退到这个backtrack位点,然后在真正的处理pragma。

这里有一个特殊的pragma,#pragma once ,用来标记当前头文件已经处理过了,没必要再重新include一次。

/// HandlePragmaOnce - Handle \#pragma once.  OnceTok is the 'once'.
///
void Preprocessor::HandlePragmaOnce(Token &OnceTok)
{
    if (isInPrimaryFile())
    {
        Diag(OnceTok, diag::pp_pragma_once_in_main_file);
        return;
    }

    // Get the current file lexer we're looking at.  Ignore _Pragma 'files' etc.
    // Mark the file as a once-only file now.
    HeaderInfo.MarkFileIncludeOnce(getCurrentFileLexer()->getFileEntry());
}

这里还有一个非常有意思的pragma push_macro和pop_macro。

  1. #pragma push_macro(“MACRONAME”) 是把当前与宏MACRONAME 相关联的字符串值保存到栈中;
  2. #pragma pop_acro(“MACRONAME”)是把栈中之前保存的与宏 MACRONAME相关联的字符串值重新关联到宏 MACRNAME 上

一般来说代码里面是这样使用的:某段代码前后部分要使用某宏第一种定义,中间部分要使用另一种定义。

#define MACRO_FOO an_str_value // 宏 MACRO_FOO 的第一种定义
/*
使用 MACRO_FOO 关联 an_str_value 的代码
*/
#pragma push_macro("MACRO_FOO") // 将 MACRO_FOO 关联的 an_str_value 先保存起来
#undef MACRO_FOO // 这句不能忘记, 否则会出现宏重复定义的警告信息. 

#define MACRO_FOO another_str_value // 宏 MACRO_FOO 的第二种定义
/*
使用 MACRO_FOO 关联 another_str_value 的代码
*/

// #undef MACRO_FOO // pop_macro 之前这句不需要, 因为并没有定义宏, 只是重新关联了字符串值而已.
#pragma pop_macro("MACRO_FOO") // 将宏 MACRO_FOO 的值恢复为之前保存的 an_str_value
/*
继续使用 MACRO_FOO 关联 an_str_value 的代码
*/ 

一般来说,这个用法主要用来hook标准自带的宏,例如new。

clang里面预制了一些pragmahandler,并对这些handler都进行了注册:

void Preprocessor::RegisterBuiltinPragmas()
{
    AddPragmaHandler(new PragmaOnceHandler());
    AddPragmaHandler(new PragmaMarkHandler());
    AddPragmaHandler(new PragmaPushMacroHandler());
    AddPragmaHandler(new PragmaPopMacroHandler());
    AddPragmaHandler(new PragmaMessageHandler(PPCallbacks::PMK_Message));

    // #pragma GCC ...
    AddPragmaHandler("GCC", new PragmaPoisonHandler());
    AddPragmaHandler("GCC", new PragmaSystemHeaderHandler());
    AddPragmaHandler("GCC", new PragmaDependencyHandler());
    AddPragmaHandler("GCC", new PragmaDiagnosticHandler("GCC"));
    AddPragmaHandler("GCC", new PragmaMessageHandler(PPCallbacks::PMK_Warning,
    "GCC"));
    AddPragmaHandler("GCC", new PragmaMessageHandler(PPCallbacks::PMK_Error,
    "GCC"));
    // #pragma clang ...
    AddPragmaHandler("clang", new PragmaPoisonHandler());
    AddPragmaHandler("clang", new PragmaSystemHeaderHandler());
    AddPragmaHandler("clang", new PragmaDebugHandler());
    AddPragmaHandler("clang", new PragmaDependencyHandler());
    AddPragmaHandler("clang", new PragmaDiagnosticHandler("clang"));
    AddPragmaHandler("clang", new PragmaARCCFCodeAuditedHandler());

    AddPragmaHandler("STDC", new PragmaSTDC_FENV_ACCESSHandler());
    AddPragmaHandler("STDC", new PragmaSTDC_CX_LIMITED_RANGEHandler());
    AddPragmaHandler("STDC", new PragmaSTDC_UnknownHandler());

    // MS extensions.
    if (LangOpts.MicrosoftExt)
    {
        AddPragmaHandler(new PragmaWarningHandler());
        AddPragmaHandler(new PragmaIncludeAliasHandler());
        AddPragmaHandler(new PragmaRegionHandler("region"));
        AddPragmaHandler(new PragmaRegionHandler("endregion"));
    }
}

条件编译

条件编译的处理放在了PPConditionalDirectiveRecord.h文件内。这里定义了PPConditionalDirectiveRecord类型,继承自PPCallbacks, 定义了几个跟条件编译相关的虚函数。这里还定义了一个存储条件编译指令位置的类CondDirectiveLoc,以及相应的位置比较器Comp。在PPConditionalDirectiveRecord里定义了一个vertor来处理所有的条件编译位点,其实就是一个有序的栈。

typedef std::vector<CondDirectiveLoc> CondDirectiveLocsTy;
/// \brief The locations of conditional directives in source order.
CondDirectiveLocsTy CondDirectiveLocs;

这个有序栈的作用就是判断一个代码range处于哪一个条件之中,为此提供了两个接口:

/// \brief Returns true if the given range intersects with a conditional
/// directive. if a \#if/\#endif block is fully contained within the range,
/// this function will return false.
bool rangeIntersectsConditionalDirective(SourceRange Range) const;

/// \brief Returns true if the given locations are in different regions,
/// separated by conditional directive blocks.
bool areInDifferentConditionalDirectiveRegion(SourceLocation LHS,
SourceLocation RHS) const
{
    return findConditionalDirectiveRegionLoc(LHS) !=
    findConditionalDirectiveRegionLoc(RHS);
}

SourceLocation findConditionalDirectiveRegionLoc(SourceLocation Loc) const;

这三个接口里最基础的是find,就是一个lower\(\backslash\)bound的查找和判断。

SourceLocation PPConditionalDirectiveRecord::findConditionalDirectiveRegionLoc(
SourceLocation Loc) const {
    if (Loc.isInvalid())
    return SourceLocation();
    if (CondDirectiveLocs.empty())
    return SourceLocation();

    if (SourceMgr.isBeforeInTranslationUnit(CondDirectiveLocs.back().getLoc(),
    Loc))
    return CondDirectiveStack.back();

    CondDirectiveLocsTy::const_iterator
    low = std::lower_bound(CondDirectiveLocs.begin(), CondDirectiveLocs.end(),
    Loc, CondDirectiveLoc::Comp(SourceMgr));
    assert(low != CondDirectiveLocs.end());
    return low->getRegionLoc();
}

预处理求值

预处理求值发生在#if语句之中,这里的操作数只有无符号整数,相关运算只有常规的四则运算,且没有自定义函数。这部分的代码都在PPExpressions.cpp里定义。

此处首先定义了一个PPValue结构,用来代表一个表达式的值和这个表达式所占的SourceLocation区间,可以理解为树形求值表达式上的一个节点。

这里求值表达式中也可能出现这种形式:#defined(x)或者!#defined(x)。clang用一个DefinedTracker来记录某个子表达式是否是这两种形式。

struct DefinedTracker
{
    /// Each time a Value is evaluated, it returns information about whether the
    /// parsed value is of the form defined(X), !defined(X) or is something else.
    enum TrackerState
    {
        DefinedMacro,        // defined(X)
        NotDefinedMacro,     // !defined(X)
        Unknown              // Something else.
    } State;
    /// TheMacro - When the state is DefinedMacro or NotDefinedMacro, this
    /// indicates the macro that was checked.
    IdentifierInfo *TheMacro;
};

对应的有一个专门的函数来处理这种形式的值:EvaluateDefined。基本就是一些合法性检测,如果是宏则标记这个宏为使用过的。

对于主要的表达式来说,还是算数求值比较多,这个define求值很少见。算数求值首先需要明确每一个单项的值,这个求单独项的值的过程在EvaluateValue中定义。可能会调用到EvaluateDefined,因为defined也算一个项。剩下的就三种种情况:bool值和整值,char值,bool也算一个预定义宏。但是不知道为什么这个EvaluateValue函数还处理了括号,加减法,取反运算符和取非运算符。为了处理左括号这种情况还调用到了EvaluateDirectiveSubExpr这个函数,这个函数是用来计算树节点表达式的,里面又会调用EvaluateValue这个函数。

EvaluateDirectiveSubExpr函数处理时已经获得了左操作数的值和左操作数之前的那个操作符的优先级。他首先peek下一个操作符,

  1. 如果该操作符的优先级比之前记录的左操作符优先级低则直接返回。
  2. 如果该操作符是具有短路性质的逻辑操作符,检查左边的值是不是已经可以确定当前子表达式的值,并把右边的表达式标记为IsDead。
  3. 如果该操作符是比之前的优先级高,则再EvaluateValue 一下RHS,同时获得RHS右边的操作符优先级,来决定是否可以结束当前子表达式节点,如果还需要处理则继续递归调用EvaluateDirectiveSubExpr。这里还需要特殊处理条件表达式,当遇到?号的时候,直接赋予他逗号的优先级。如果不是问号,则直接将这个表达式的优先级+1,这样来处理右结合的运算符。

上述两个函数的入口函数在EvaluateDirectiveExpression里,会首先peek这个token,然后调用EvaluateValue函数求完整表达式的值。

预处理记录

在预处理阶段发生的任何事情都会被记录为一个PreprocessedEntity。通过这些Entity来记录所有的预处理细节。预处理事件分为了如下几类:

enum EntityKind
{
    /// \brief Indicates a problem trying to load the preprocessed entity.
    InvalidKind,

    /// \brief A macro expansion.
    MacroExpansionKind,

    /// \defgroup Preprocessing directives
    /// @{

        /// \brief A macro definition.
        MacroDefinitionKind,

        /// \brief An inclusion directive, such as \c \#include, \c
        /// \#import, or \c \#include_next.
        InclusionDirectiveKind,

        /// @}

    FirstPreprocessingDirective = MacroDefinitionKind,
    LastPreprocessingDirective = InclusionDirectiveKind
};

所有的Entity都继承自一个基类PreprocessEntity,记录了这个record的类型和发生的位置。

class PreprocessedEntity
{
    public:
    /// \brief The kind of preprocessed entity an object describes.
    private:
    /// \brief The kind of preprocessed entity that this object describes.
    EntityKind Kind;

    /// \brief The source range that covers this preprocessed entity.
    SourceRange Range;

    protected:
    PreprocessedEntity(EntityKind Kind, SourceRange Range)
    : Kind(Kind), Range(Range)
    {}

    friend class PreprocessingRecord;

    public:
    /// \brief Retrieve the kind of preprocessed entity stored in this object.
    EntityKind getKind() const
    {
        return Kind;
    }

    /// \brief Retrieve the source range that covers this entire preprocessed 
    /// entity.
    SourceRange getSourceRange() const LLVM_READONLY
    {
        return Range;
    }

    /// \brief Returns true if there was a problem loading the preprocessed
    /// entity.
    bool isInvalid() const
    {
        return Kind == InvalidKind;
    }
}

这个基类还重定义了new和delete,使得内存分配都是8字节对齐的。

在此基类下,还定义了四个子类,这些子类的类型都在enum里定义好了。比较特殊的就是MacroExpansion,里面区分了标准内置宏和源文件自定义宏:

llvm::PointerUnion<IdentifierInfo *, MacroDefinition *> NameOrDef;

所有的预处理Entity都被一个PreprocessingRecord所管理,里面负责分配内存和Entity的ID。ID区分正负,正的ID代表当前预处理器所引入的Enitty,而负的ID代表外部源引入的Enitty。这里使用了两个Vector来记录遇到的Entity和实际处理的Entity:

/// \brief The set of preprocessed entities in this record, in order they
/// were seen.
std::vector<PreprocessedEntity *> PreprocessedEntities;

/// \brief The set of preprocessed entities in this record that have been
/// loaded from external sources.
///
/// The entries in this vector are loaded lazily from the external source,
/// and are referenced by the iterator using negative indices.
std::vector<PreprocessedEntity *> LoadedPreprocessedEntities;

/// \brief The set of ranges that were skipped by the preprocessor,
std::vector<SourceRange> SkippedRanges;

同时还用map记录了每一个MacroInfo对应的MacroDefinition,也就是宏定义。

/// \brief Mapping from MacroInfo structures to their definitions.
llvm::DenseMap<const MacroInfo *, MacroDefinition *> MacroDefinitions;

这里提供了所有Entity的构造和注册方法:

void MacroExpands(const Token &Id, const MacroDirective *MD,
SourceRange Range, const MacroArgs *Args) override;
void MacroDefined(const Token &Id, const MacroDirective *MD) override;
void MacroUndefined(const Token &Id, const MacroDirective *MD) override;
void InclusionDirective(SourceLocation HashLoc, const Token &IncludeTok,
StringRef FileName, bool IsAngled,
CharSourceRange FilenameRange,
const FileEntry *File, StringRef SearchPath,
StringRef RelativePath,
const Module *Imported) override;
void Ifdef(SourceLocation Loc, const Token &MacroNameTok,
const MacroDirective *MD) override;
void Ifndef(SourceLocation Loc, const Token &MacroNameTok,
const MacroDirective *MD) override;
/// \brief Hook called whenever the 'defined' operator is seen.
void Defined(const Token &MacroNameTok, const MacroDirective *MD,
SourceRange Range) override;

void SourceRangeSkipped(SourceRange Range) override;

void addMacroExpansion(const Token &Id, const MacroInfo *MI,
SourceRange Range);

这些函数最终都会调用到被重载的new函数,然后通过addPreprocessedEntity对entity进行注册,其实就是在vector里pushback。

PreprocessorLexer.h

目前还不怎么了解这个类是干什么的?感觉只是用来处理头文件引入和条件编译选项的。其数据成员如下:

class PreprocessorLexer
{
    virtual void anchor();
    protected:
    Preprocessor *PP;              // Preprocessor object controlling lexing.

    /// The SourceManager FileID corresponding to the file being lexed.
    const FileID FID;

    /// \brief Number of SLocEntries before lexing the file.
    unsigned InitialNumSLocEntries;

    //===--------------------------------------------------------------------===//
    // Context-specific lexing flags set by the preprocessor.
    //===--------------------------------------------------------------------===//

    /// \brief True when parsing \#XXX; turns '\\n' into a tok::eod token.
    bool ParsingPreprocessorDirective;

    /// \brief True after \#include; turns \<xx> into a tok::angle_string_literal
    /// token.
    bool ParsingFilename;

    /// \brief True if in raw mode.
    ///
    /// Raw mode disables interpretation of tokens and is a far faster mode to
    /// lex in than non-raw-mode.  This flag:
    ///  1. If EOF of the current lexer is found, the include stack isn't popped.
    ///  2. Identifier information is not looked up for identifier tokens.  As an
    ///     effect of this, implicit macro expansion is naturally disabled.
    ///  3. "#" tokens at the start of a line are treated as normal tokens, not
    ///     implicitly transformed by the lexer.
    ///  4. All diagnostic messages are disabled.
    ///  5. No callbacks are made into the preprocessor.
    ///
    /// Note that in raw mode that the PP pointer may be null.
    bool LexingRawMode;

    /// \brief A state machine that detects the \#ifndef-wrapping a file
    /// idiom for the multiple-include optimization.
    MultipleIncludeOpt MIOpt;

    /// \brief Information about the set of \#if/\#ifdef/\#ifndef blocks
    /// we are currently in.
    SmallVector<PPConditionalInfo, 4> ConditionalStack;
}

前两个bool组成一个状态机,配合MIOpt专门用来parse头文件引入。而ConditionalStack专门用来处理条件编译。这里的条件编译需要利用Token.h中定义的PPConditionalInfo,这是一个POD

/// \brief Information about the conditional stack (\#if directives)
/// currently active.
struct PPConditionalInfo
{
    /// \brief Location where the conditional started.
    SourceLocation IfLoc;

    /// \brief True if this was contained in a skipping directive, e.g.,
    /// in a "\#if 0" block.
    bool WasSkipping;

    /// \brief True if we have emitted tokens already, and now we're in
    /// an \#else block or something.  Only useful in Skipping blocks.
    bool FoundNonSkip;

    /// \brief True if we've seen a \#else in this block.  If so,
    /// \#elif/\#else directives are not allowed.
    bool FoundElse;
};

这个结构就是专门记录#if #else之类的条件编译的。所以在处理条件编译时,我们通过这样的来压栈:

/// pushConditionalLevel - When we enter a \#if directive, this keeps track of
/// what we are currently in for diagnostic emission (e.g. \#if with missing
/// \#endif).
void pushConditionalLevel(SourceLocation DirectiveStart, bool WasSkipping,
bool FoundNonSkip, bool FoundElse)
{
    PPConditionalInfo CI;
    CI.IfLoc = DirectiveStart;
    CI.WasSkipping = WasSkipping;
    CI.FoundNonSkip = FoundNonSkip;
    CI.FoundElse = FoundElse;
    ConditionalStack.push_back(CI);
}

对应的出栈操作如下:

/// popConditionalLevel - Remove an entry off the top of the conditional
/// stack, returning information about it.  If the conditional stack is empty,
/// this returns true and does not fill in the arguments.
bool popConditionalLevel(PPConditionalInfo &CI)
{
    if (ConditionalStack.empty())
    return true;
    CI = ConditionalStack.pop_back_val();
    return false;
}

除了这几个宏相关的操作之外,PreprocessorLexer里面还定义了一些虚接口函数:

virtual void IndirectLex(Token& Result) = 0;

/// \brief Return the source location for the next observable location.
virtual SourceLocation getSourceLocation() = 0;

这几个虚接口函数都需要在继承类中实现,特别是这里的IndirectLex,在具体实现的时候都会作为转接函数来调用对应类型的Lex函数,后面我们会涉及到。

导言处理

有了这么多准备工作之后,所有的Directives的处理代码都在PPDirectives.cpp里定义。

当遇到一个#define或者#undefine时,需要调用CheckMacroName这个函数来检查被操作的宏名是否合法,例如undef就不能操作标准自带的一些宏。这个CheckMacroName是配合ReadMacroName一起使用的,如果CheckMacroName认为这个宏不合法,则会吃掉所有后面的Token直到本行导言结束。

当遇到一个#if导言时,调用SkipExcludedConditionalBlock函数来忽略掉所有被关闭的代码块,直到我们遇到了一个对应的#endif。需要注意的是这里内部可能也定义了条件代码块,所以要处理递归。这里有另外的一个函数PTHSkipExcludedConditionalBlock来处理PTH文件里的条件开关。

LookupFile用来处理头文件的查找,这里就涉及到了之前的头文件管理,按照标准规定好的方式去查找该文件名对应的物理文件。

上述的这些函数最终都会被HandleDirective函数调用到,所有以#开头的行都会调用到HandleDirective函数。这个函数调用的时候会做一些检查,然后根据下一个token的值调用到对应的Handlexxx函数:

  1. HandleLineDirective 这个会处理#line number形式的导言,获得了相应值之后调用SourceMgr.AddLineNote 来进行相应的改变。

  2. HandleUserDiagnosticDirective 这个会处理#warn 或者#error形式的导言,其实就是用户自定义的编译时错误输出。

  3. HandleIncludeDirective 这个函数是用来处理头文件插入的,也是首先获得头文件名称并获得其FileID,之后向头文件管理器检查是否需要处理该文件(pragma once或者header guard).一切检查通过的时候,最终会调用到EnterSourceFile函数去处理新加入的头文件。

  4. ReadMacroDefinitionArgList 这个用来获得宏的参数列表,并作一些检查。

  5. HandleDefineDirective 这个函数用来处理#define形式,首先左一些基本检查,然后如果是函数型宏的话还需要调用ReadMacroDefinitionArgList来获得参数列表。最后再处理宏体:

  6. 如果是对象型宏,则直接获得后面的所有token,通过AddTokenToBody记录下来

  7. 如果是函数型的宏,也是差不多,但是要求所有的#和##的右参数必须是宏参数。同时##不能出现在开头和结尾。

最终当一个宏体处理完成之后,需要检查之前是否已经定义了这个宏。如果定义了则需要检查宏定义是否一致。不一致的话给出警告或者错误。最后通过appendDefMacroDirective来等级这个操作。

  1. HandleUndefDirective 这里做一些基本检查,然后通过appendMacroDirective来记录这个操作

  2. HandleIfdefDirective 这里需要考虑的主要是header guard这个东西。如果这个条件通过的话,将这个值当作一个新的条件编译等级,调用pushConditionalLevel来记录。否则直接跳过当前block到#else相关导言区

  3. HandleIfDirective 处理#if 指令,这里会调用到EvaluateDirectiveExpression来获得表达式的值。之后的操作等价于上一个HandleIfdefDirective函数

  4. HandleEndifDirective 这个处理#endif导言,其实就是调用一下popConditionalLevel来弹栈。

  5. HandleElseDirective 跟#endif很像,也是直接popConditionalLevel来弹栈。

  6. HandleElifDirective 同上

这些函数都会调用到预处理回调的callback,所以具体的处理函数还依赖每个继承自Callback的具体实现。

正如上面所分析的,头文件处理会调用到Preprocessor::EnterSourceFile这个函数来处理新文件。这个函数定义在PPLexerChange.cpp里面。这个函数会首先做一些检查,最后会执行到这一句:

EnterSourceFileWithLexer(new Lexer(FID, InputFile, *this), CurDir);

也就是说会使用一个新的Lexer来处理新文件,但是会拥有同样的预处理器*this。这个函数的主要作用就是重设这个新的lexer为新lexer,同时将之前的lexer保存下来用来恢复,然后通知相关注册了的callback。

对于宏展开,则会使用到Preprocessor::EnterMacro函数来处理宏展开。跟处理include差不多,就是切换当前TokenLexer:

void Preprocessor::EnterMacro(Token &Tok, SourceLocation ILEnd,
MacroInfo *Macro, MacroArgs *Args)
{
    std::unique_ptr<TokenLexer> TokLexer;
    if (NumCachedTokenLexers == 0)
    {
        TokLexer = llvm::make_unique<TokenLexer>(Tok, ILEnd, Macro, Args, *this);
    }
    else
    {
        TokLexer = std::move(TokenLexerCache[--NumCachedTokenLexers]);
        TokLexer->Init(Tok, ILEnd, Macro, Args);
    }

    PushIncludeMacroStack();
    CurDirLookup = nullptr;
    CurTokenLexer = std::move(TokLexer);
    if (CurLexerKind != CLK_LexAfterModuleImport)
    CurLexerKind = CLK_TokenLexer;
}

跟EnterMacro函数相似的还有另外一个函数EnterTokenStream,就是将一个TokenSteam当作一个临时的头文件放在栈顶,然后进行lex。

EOF代表文件的结束,一般来说当Lexer处理EOF时,要么会抛出一个EOFtoken,要么会把当前的头文件引入栈的栈顶弹出。具体处理函数在PPLexerChange.cpp的HandleEndOffFile.

  1. 首先需要关系的是当前文件是否被header guard包围了,如果预设的header guard不匹配的话需要给出警告。
  2. 如果头文件栈不是空的话,则使用RemoveTopOfLexerStack进行弹栈
  3. 如果头文件栈是空的,则说明这是一个编译单元的末尾,返回EOF

对于宏展开,也是类似。宏可以当作一个特殊的头文件域来处理,最后还是会调用到HandleEndOfFile。

Published:
2016-11-27 22:29
Category:
Tag:
CPP15