Clang Basic

基本类型

FileManager

FileID

这个FileID只包含一个数据成员int ID,作为其代表文件的索引。通过这个索引,SourceManager可以得到其MemoryBuffer和引入该文件的位置。

这里还提到了加载域,如果当前文件是从当前模块中加载的,则这个ID是正的,否则是负的。实现时是通过一个全局的ID=index++来分配,如果是从其他模块中加载进来的,则设置为ID=-ID-2。如果ID=-1,则认为是非法FileID.

FileInfo

FileInfo代表的是一个文件相关的引入文件信息和自身的被引入信息。

class FileInfo
{
    /// \brief The location of the #include that brought in this file.
    ///
    /// This is an invalid SLOC for the main file (top of the #include chain).
    unsigned IncludeLoc;  // Really a SourceLocation

    /// \brief Number of FileIDs (files and macros) that were created during
    /// preprocessing of this #include, including this SLocEntry.
    ///
    /// Zero means the preprocessor didn't provide such info for this SLocEntry.
    unsigned NumCreatedFIDs;

    /// \brief Contains the ContentCache* and the bits indicating the
    /// characteristic of the file and whether it has #line info, all
    /// bitmangled together.
    uintptr_t Data;
}
  • IncludeLoc:代表当前文件被引入的位置,也就是#include当前文件的那一行代码所在位置。这个位置使用SourceLocation表示的,其实就是一个unsigned。对于cpp文件来说,他是引用链的头,所以他的值为0.

  • NumCreatedFIDs:代表的是当前文件所引入的新的#include文件和宏展开的数目。

  • Data:这个是一个非常复杂的指针域,注释里面说这个糅合了ContentCache*和两个标志位。

这个FileInfo的构造是通过一个静态函数来做的:

static FileInfo get(SourceLocation IL, const ContentCache *Con,
    CharacteristicKind FileCharacter)
{
    FileInfo X;
    X.IncludeLoc = IL.getRawEncoding();
    X.NumCreatedFIDs = 0;
    X.Data = (uintptr_t)Con;
    assert((X.Data & 7) == 0 && "ContentCache pointer insufficiently aligned");
    assert((unsigned)FileCharacter < 4 && "invalid file character");
    X.Data |= (unsigned)FileCharacter;
    return X;
}

这里的CharacteristicKind是一个很简单的枚举类,用来指明当前文件是用户文件还是系统文件还是通过extern C引入的系统文件:

enum CharacteristicKind
{
    C_User, C_System, C_ExternCSystem
};

ContentCache

clang为每一个读入的文件都提供了一个ContentCache结构来管理缓存:

class LLVM_ALIGNAS(8) ContentCache
{
    enum CCFlags
    {
        /// \brief Whether the buffer is invalid.
        InvalidFlag = 0x01,
        /// \brief Whether the buffer should not be freed on destruction.
        DoNotFreeFlag = 0x02
    };

    /// \brief The actual buffer containing the characters from the input
    /// file.
    ///
    /// This is owned by the ContentCache object.  The bits indicate
    /// whether the buffer is invalid.
    mutable llvm::PointerIntPair<llvm::MemoryBuffer *, 2> Buffer;

public:
    unsigned NumLines : 31;
    const FileEntry *OrigEntry;
    const FileEntry *ContentsEntry;
    unsigned *SourceLineCache;
    unsigned BufferOverridden : 1;
    unsigned IsSystemFile : 1;
}

这里的CCFlags是标志位,存储在llvm::PointerIntPair<llvm::MemoryBuffer *, 2>的后两位中。这个PointerIntPair<T*, unsinged num>是一个压缩指针,代表的是该指针指向一个对齐分配的对象T,同时该指针的后num位作为标志位使用。一般来说,对象类型分配都是按照8字节对齐的,所以可以使用的位只有三位,因此\(num\le 3\),使用时先占用高位,然后再使用低位。这样我们就可以处理嵌套的PointIntPair,如PointIntPair<PointIntPair<T,2>,1> 这种,否则按照低位优先就完全错了。

这里有一个不怎么好理解的OrigEntryContentEntryOrigEntry代表的是当前文件的句柄,而ContentEntry代表的是实际内容来源的文件句柄,这两者一般是相同的。如果当前文件是在内存中建立的临时文件的话,origEntry可以是空的。如果当前文件被另外一个文件的内容覆盖了的话,ContentEntry就是覆盖文件的句柄。

对于ContentCache具体包含的内容,存储在SourceLineCache这个数据成员中,指向缓存的初始位置。这是一个bump the pointer,具体的不怎么清楚,涉及到自动内存管理相关。同时用NumLines记录缓存中的行数。BufferOverridden代表的是当前缓存是否最后将写回相关文件。

在这一层又一层的封装下,最后SourceManager中的处理就简单多了:

llvm::MemoryBuffer *SourceManager::getMemoryBufferForFile(const FileEntry *File,
    bool *Invalid)
{
    const SrcMgr::ContentCache *IR = getOrCreateContentCache(File);
    assert(IR && "getOrCreateContentCache() cannot return NULL");
    return IR->getBuffer(Diag, *this, SourceLocation(), Invalid);
}

在从ContentCache中获得MemoryBuffer的函数比较复杂,就先分部分介绍。其函数签名如下:

llvm::MemoryBuffer *ContentCache::getBuffer(DiagnosticsEngine &Diag,
    const SourceManager &SM,
    SourceLocation Loc,
    bool *Invalid) const

首先,我们并不能直接返回ContentCache所存储的 MemoryBuffer的指针。因为文件的MemoryBuffer是延迟构造的,只有在需要的时候才会分配这些资源,否则整个内存空间就炸了。

所以我们首先需要判断ContentCache保存的这个内部memoryBuffer指针是否是有效的,如果有效则直接返回,否则再建立一个MemoryBuffer,然后再返回。

// Lazily create the Buffer for ContentCaches that wrap files.  If we already
// computed it, just return what we have.
if (Buffer.getPointer() || !ContentsEntry)
{
    if (Invalid)
        *Invalid = isBufferInvalid();

    return Buffer.getPointer();
}

bool isVolatile = SM.userFilesAreVolatile() && !IsSystemFile;
auto BufferOrError =
    SM.getFileManager().getBufferForFile(ContentsEntry, isVolatile);

如果这个操作返回的并不是MemoryBuffer,而是错误的话(基本可以肯定是源文件不见了被删了),我们需要做一些错误处理:

if (!BufferOrError)
{
    StringRef FillStr("<<<MISSING SOURCE FILE>>>\n");
    Buffer.setPointer(MemoryBuffer::getNewUninitMemBuffer(
        ContentsEntry->getSize(), "<invalid>").release());
    char *Ptr = const_cast<char*>(Buffer.getPointer()->getBufferStart());
    for (unsigned i = 0, e = ContentsEntry->getSize(); i != e; ++i)
        Ptr[i] = FillStr[i % FillStr.size()];

    if (Diag.isDiagnosticInFlight())
        Diag.SetDelayedDiagnostic(diag::err_cannot_open_file,
            ContentsEntry->getName(),
            BufferOrError.getError().message());
    else
        Diag.Report(Loc, diag::err_cannot_open_file)
        << ContentsEntry->getName() << BufferOrError.getError().message();

    Buffer.setInt(Buffer.getInt() | InvalidFlag);

    if (Invalid) *Invalid = true;
    return Buffer.getPointer();
}

主要的操作就是分配一个MemoryBuffer,这个MemoryBuffer的名字就是<invalid>,其中的内容就是错误信息的循环填充。同时向错误诊断管理器中报告错误,最后返回这个错误的MemoryBuffer

如果没有返回错误,则检查我们新建的MemoryBuffer的大小是否等于我们所期待的大小。如果不是的话,说明源文件被修改了,向诊断管理器中报告错误,同时返回这个MemoryBuffer。如果是的话,再检查文件头的UTF Byte Order Mark。

// If the buffer is valid, check to see if it has a UTF Byte Order Mark
// (BOM).  We only support UTF-8 with and without a BOM right now.  See
// http://en.wikipedia.org/wiki/Byte_order_mark for more information.
StringRef BufStr = Buffer.getPointer()->getBuffer();
const char *InvalidBOM = llvm::StringSwitch<const char *>(BufStr)
    .StartsWith("\xFE\xFF", "UTF-16 (BE)")
    .StartsWith("\xFF\xFE", "UTF-16 (LE)")
    .StartsWith("\x00\x00\xFE\xFF", "UTF-32 (BE)")
    .StartsWith("\xFF\xFE\x00\x00", "UTF-32 (LE)")
    .StartsWith("\x2B\x2F\x76", "UTF-7")
    .StartsWith("\xF7\x64\x4C", "UTF-1")
    .StartsWith("\xDD\x73\x66\x73", "UTF-EBCDIC")
    .StartsWith("\x0E\xFE\xFF", "SDSU")
    .StartsWith("\xFB\xEE\x28", "BOCU-1")
    .StartsWith("\x84\x31\x95\x33", "GB-18030")
    .Default(nullptr);

如果并不是合法的BOM,则继续报错。

首先获得文件对应的ContentCache,然后再由ContentCache获得MemoryBuffer

如果文件的内容被MemoryBuffer给重写了,就将对应的ContentCache里面的MemoryBuffer替换一下,然后往缓存替换表里注册一下。这里又涉及到了延迟构造的问题,真烦。

void SourceManager::overrideFileContents(const FileEntry *SourceFile,
    llvm::MemoryBuffer *Buffer,
    bool DoNotFree)
{
    const SrcMgr::ContentCache *IR = getOrCreateContentCache(SourceFile);
    assert(IR && "getOrCreateContentCache() cannot return NULL");

    const_cast<SrcMgr::ContentCache *>(IR)->replaceBuffer(Buffer, DoNotFree);
    const_cast<SrcMgr::ContentCache *>(IR)->BufferOverridden = true;

    getOverriddenFilesInfo().OverriddenFilesWithBuffer.insert(SourceFile);
}

如果文件内容是被另外一个文件所覆盖的话,处理也是比较简单,直接向文件覆盖表中添加映射就可以了。

void SourceManager::overrideFileContents(const FileEntry *SourceFile,
    const FileEntry *NewFile)
{
    assert(SourceFile->getSize() == NewFile->getSize() &&
        "Different sizes, use the FileManager to create a virtual file with "
        "the correct size");
    assert(FileInfos.count(SourceFile) == 0 &&
        "This function should be called at the initialization stage, before "
        "any parsing occurs.");
    getOverriddenFilesInfo().OverriddenFiles[SourceFile] = NewFile;
}

搞笑的是这里还提供了撤销操作:

void SourceManager::disableFileContentsOverride(const FileEntry *File)
{
    if (!isFileOverridden(File))
        return;

    const SrcMgr::ContentCache *IR = getOrCreateContentCache(File);
    const_cast<SrcMgr::ContentCache *>(IR)->replaceBuffer(nullptr);
    const_cast<SrcMgr::ContentCache *>(IR)->ContentsEntry = IR->OrigEntry;

    assert(OverriddenFilesInfo);
    OverriddenFilesInfo->OverriddenFiles.erase(File);
    OverriddenFilesInfo->OverriddenFilesWithBuffer.erase(File);
}

SourceLocation

SourceLocation:Encodes a location in the source. The SourceManager can decode this to get at the full include stack, line and column information.在编译器中需要用到的三个“重要”Location:行号,列号,声明和调用文件路径,都与SourceManager有关,其中行号和列号用SourceLocation表示。clang中,将SourceLocation分为了三种类型:

  • Spelling Location 宏的定义位置

  • Expansion Location 宏的展开位置,即使用位置

  • Presumed Location 根据#line导言调整之后的展开位置

同一个物理位置他的expansion locationpresumed location只有行号不一样,这种差异是#line导言造成的。举个例子来说,在下面的代码中

#define min(x,y) (x)>(y)?(y):(x)

void main(int argc,char**argv){

#line 4
int a=min(2,4)
}

第6行的数字2的spelling location是第一行,其expansion location是第六行,而presumed location是第4行。

SourceLocation

SourceLocation是一个偏移,整个type大小为四个字节(4==sizeof(SourceLocation)),SourceRange是两个SourceLocation组成的区间。SourceLocation的具体定义如下,数据成员只有一个unsigned int

class SourceLocation
{
    unsigned ID;
    enum : unsigned
    {
        MacroIDBit = 1U << 31
    };
}

这个ID的最高位是用来表明当前Location是宏展开还是正常文本,同时有效的SourceLocation的ID不为0。实现时是这样做的,一个全局的递增index作为ID的分配值,如果这个是文件中的文本,则index=ID,否则是宏的话index=-ID-2。根据这个性质,这个类内置了几个函数,比较简单,分别是判断是否有效和是否是宏:

SourceLocation() : ID(0) {}
//默认无效Location
bool isFileID() const { return (ID & MacroIDBit) == 0; }
bool isMacroID() const { return (ID & MacroIDBit) != 0; }

/// \brief Return true if this is a valid SourceLocation object.
///
/// Invalid SourceLocations are often used when events have no corresponding
/// location in the source (e.g. a diagnostic is required for a command line
/// option).
bool isValid() const { return ID != 0; }
bool isInvalid() const { return ID == 0; }

作为无效的Location,一般是用来组成SourceRange的end端的,表明当前并不需要这个end,一般是处理单token的时候。

为了构造有效的SourceLocation,需要用到另外的几个静态构造函数:

/// \brief Return the offset into the manager's global input view.
unsigned getOffset() const
{
    return ID & ~MacroIDBit;
}

static SourceLocation getFileLoc(unsigned ID)
{
    assert((ID & MacroIDBit) == 0 && "Ran out of source locations!");
    SourceLocation L;
    L.ID = ID;
    return L;
}

static SourceLocation getMacroLoc(unsigned ID)
{
    assert((ID & MacroIDBit) == 0 && "Ran out of source locations!");
    SourceLocation L;
    L.ID = MacroIDBit | ID;
    return L;
}

虽然这几个都是private的,不过由于声明了几个友元类,所以还是可以调用的。

对于SourceRange来说,很简单,就是两个SourceLocation的复合。

class SourceRange
{
    SourceLocation B;
    SourceLocation E;
}

相关的代码也比较简单直白,基本都是一些get,set valid方法,就懒得说了。

CharSourceRange

这里还定义了一个CharSourceRange,在SourceRange上加了一个bool IsTokenRange作为一个标志位。

class CharSourceRange
{
    SourceRange Range;
    bool IsTokenRange;
}

当这个IsTokenRange为真时,这里的Rangeend指明的是我们所指代区域内最后一个token的第一个字节的SourceLocation;否则指的是该区域内最后一个字节的SourceLocation

FullSourceLoc

一个SourceLocation只包含一个ID,其具体意义需要相应的SourceManager来管理。SourceLocationSourceManager的复合体为FullSourceLoc

class FullSourceLoc : public SourceLocation
{
    const SourceManager *SrcMgr;
}

这个类型主要是用来传递SourceLocationSourceManager这两个参数的,打包传递。

事实上很多操作都被打包进了这个FullSourceLoc之中,主要的就是行号、列号、文件名、宏定义位置、宏展开位置等

FileID getFileID() const;

FullSourceLoc getExpansionLoc() const;
FullSourceLoc getSpellingLoc() const;

unsigned getExpansionLineNumber(bool *Invalid = nullptr) const;
unsigned getExpansionColumnNumber(bool *Invalid = nullptr) const;

unsigned getSpellingLineNumber(bool *Invalid = nullptr) const;
unsigned getSpellingColumnNumber(bool *Invalid = nullptr) const;

const char *getCharacterData(bool *Invalid = nullptr) const;

不过这些操作都被委托进了SourceManager的对应操作中,这些函数只是作为转接存在的。。

PresumedLoc

这个PresumedLoc其实也是一个打包类,因为SourceLocation就可以判断是不是经过 #line导言调整过的行号了,这个主要是通过SourceManager来进行。但是作为参数传递出去的时候,不想让另外一方操作SourceManager,所以把相关信息都提取出来了,作为了 PresumedLoc:

class PresumedLoc
{
    const char *Filename;
    unsigned Line, Col;
    SourceLocation IncludeLoc;
}

事实上这个PresumedLoc就是以一个SourceLocation通过SourceManager构造出来的,非常复杂:

PresumedLoc SourceManager::getPresumedLoc(SourceLocation Loc,
    bool UseLineDirectives) const

对于这个函数的具体逻辑我们在这里先不详细解释,因为这个依赖于SourceManager的定义,需要先介绍SourceManager的相关操作。

LineTable

这个LineTable是用来处理#line导言的,所以需要记录所有文件中该类型导言的相关信息,每一个信息都存储为一个LineEntry结构,记录了改导言所在文件的行位置FileOffset、该导言自定义的行号LineNo和文件名FilenameID

struct LineEntry
{
    /// \brief The offset in this file that the line entry occurs at.
    unsigned FileOffset;

    /// \brief The presumed line number of this line entry: #line 4.
    unsigned LineNo;

    /// \brief The ID of the filename identified by this line entry:
    /// #line 4 "foo.c".  This is -1 if not specified.
    int FilenameID;

    /// \brief Set the 0 if no flags, 1 if a system header,
    SrcMgr::CharacteristicKind FileKind;

    /// \brief The offset of the virtual include stack location,
    /// which is manipulated by GNU linemarker directives.
    ///
    /// If this is 0 then there is no virtual #includer.
    //都是GNU自己做得死
    unsigned IncludeOffset;
}

按照惯例,这个类型的构造仍然是通过static来做的,就是简单的字段赋值,不再说明。

而所有的LineEntry信息都维护在一个LineTableInfo中,记录了每一个文件中所包含的LineEntry信息。

/// \brief Used to hold and unique data used to represent #line information.
class LineTableInfo
{
    /// \brief Map used to assign unique IDs to filenames in #line directives. 
    ///
    /// This allows us to unique the filenames that
    /// frequently reoccur and reference them with indices.  FilenameIDs holds
    /// the mapping from string -> ID, and FilenamesByID holds the mapping of ID
    /// to string.
    llvm::StringMap<unsigned, llvm::BumpPtrAllocator> FilenameIDs;
    std::vector<llvm::StringMapEntry<unsigned>*> FilenamesByID;

    /// \brief Map from FileIDs to a list of line entries (sorted by the offset
    /// at which they occur in the file).
    std::map<FileID, std::vector<LineEntry> > LineEntries;
}

这里的FilenameIDsFilenamesByID是一个典型的互索引结构,新的Filename首先通过 FilenamesByID.size()获得其ID,然后将这二元组插入到FilenameIDs中。根据所获得的 llvm::StringMapEntrypush_backFilenamesByID中。所以这里管理了FileID的注册于发放。

在使用的时候,我们希望得到的是给定expansion的行号,如何获得其在presumed中的行号。这个操作需要获得在当前文件中离当前行最近的LineEntry信息,基本就是一个二分查找过程。

const LineEntry *LineTableInfo::FindNearestLineEntry(FileID FID,
    unsigned Offset)
{
    const std::vector<LineEntry> &Entries = LineEntries[FID];
    assert(!Entries.empty() && "No #line entries for this FID after all!");

    // It is very common for the query to be after the last #line, check this
    // first.
    if (Entries.back().FileOffset <= Offset)
        return &Entries.back();

    // Do a binary search to find the maximal element that is still before Offset.
    std::vector<LineEntry>::const_iterator I =
        std::upper_bound(Entries.begin(), Entries.end(), Offset);
    if (I == Entries.begin()) return nullptr;
    return &*--I;
}

ExpansionInfo

这个ExpansionInfo代表了宏展开相关的信息:展开位置和宏定义位置。

class ExpansionInfo
{
    // Really these are all SourceLocations.
    /// \brief Where the spelling for the token can be found.
    unsigned SpellingLoc;

    /// In a macro expansion, ExpansionLocStart and ExpansionLocEnd
    /// indicate the start and end of the expansion. In object-like macros,
    /// they will be the same. In a function-like macro expansion, the start
    /// will be the identifier and the end will be the ')'. Finally, in
    /// macro-argument instantiations, the end will be 'SourceLocation()', an
    /// invalid location.
    unsigned ExpansionLocStart, ExpansionLocEnd;
}

上面的注释已经说的很清楚了,

  • 如果是简单的文本替换,则startend是相等的。

  • 如果是函数调用宏展开,则end指向的是)

  • 如果是宏参数展开,则end是非法Location,此时他的值为0.

这个类型的默认构造函数也是无效的,构造都委托到了下面这个静态函数里面:

static ExpansionInfo create(SourceLocation SpellingLoc,
    SourceLocation Start, SourceLocation End)
{
    ExpansionInfo X;
    X.SpellingLoc = SpellingLoc.getRawEncoding();
    X.ExpansionLocStart = Start.getRawEncoding();
    X.ExpansionLocEnd = End.getRawEncoding();
    return X;
}

对于参数展开来说,那就直接调用上面这个函数了,非常机制啊:

static ExpansionInfo createForMacroArg(SourceLocation SpellingLoc,
    SourceLocation ExpansionLoc)
{
    // We store an intentionally invalid source location for the end of the
    // expansion range to mark that this is a macro argument ion rather than
    // a normal one.
    return create(SpellingLoc, ExpansionLoc, SourceLocation());
}

SLocEntry

这个类型是一个联合类,包含了FileInfoExpansionInfo这两种类型:

class SLocEntry
{
    unsigned Offset;   // low bit is set for expansion info.
    union
    {
        FileInfo File;
        ExpansionInfo Expansion;
    };
}

这里的Offset的最低位用来作为标志位,来区分是哪一个具体类型:

unsigned getOffset() const { return Offset >> 1; }
bool isExpansion() const { return Offset & 1; }
bool isFile() const { return !isExpansion(); }

这个类的构造函数也是静态的,主要是这个对象经常需要临时构造,静态处理比较方便。

static SLocEntry get(unsigned Offset, const FileInfo &FI)
{
    SLocEntry E;
    E.Offset = Offset << 1;
    E.File = FI;
    return E;
}

static SLocEntry get(unsigned Offset, const ExpansionInfo &Expansion)
{
    SLocEntry E;
    E.Offset = (Offset << 1) | 1;
    E.Expansion = Expansion;
    return E;
}

SourceManager

这个类是一个非常大的类,管理了所有与源文件相关的资源,同时他也是一个被侵入式引用计数IntrusiveRefPtr管理的资源,因为在其他地方经常有SourceManager的指针副本。

class SourceManager : public RefCountedBase<SourceManager>

资源管理

资源句柄

SourceManager中,首先保存了DiagnosticEngineFileManager这两个引用,至此源文件管理相关的大类已经集合完毕。

DiagnosticsEngine &Diag;

FileManager &FileMgr;

首先记录在案的是文件FileEntry与相应缓存ContentCache之间的映射,这是通过一个map来实现的。

llvm::DenseMap<const FileEntry*, SrcMgr::ContentCache*> FileInfos;

对于文件被重写的情况,SourceManager也记录了这些信息

struct OverriddenFilesInfoTy
{
    /// \brief Files that have been overridden with the contents from another
    /// file.
    llvm::DenseMap<const FileEntry *, const FileEntry *> OverriddenFiles;
    /// \brief Files that were overridden with a memory buffer.
    llvm::DenseSet<const FileEntry *> OverriddenFilesWithBuffer;
};

/// \brief Lazily create the object keeping overridden files info, since
/// it is uncommonly used.
std::unique_ptr<OverriddenFilesInfoTy> OverriddenFilesInfo;

OverriddenFilesInfoTy &getOverriddenFilesInfo()
{
    if (!OverriddenFilesInfo)
        OverriddenFilesInfo.reset(new OverriddenFilesInfoTy);
    return *OverriddenFilesInfo;
}

这里还设置这个覆盖结构为延迟实例化的,注释里面说这个东西很少用到。但是你们真的缺这两个map,set的内存空间吗!过早优化是万恶之源啊。

这里还管理了所有的ContentCache

std::vector<SrcMgr::ContentCache*> MemBufferInfos;

同时还对错误处理专门分配了缓存空间

// Cache for the "fake" buffer used for error-recovery purposes.
mutable std::unique_ptr<llvm::MemoryBuffer> FakeBufferForRecovery;

mutable std::unique_ptr<SrcMgr::ContentCache> FakeContentCacheForRecovery;

对于SlocEntry的管理,分为了两个部分,一个是本模块内的SlocEntry,另外一个是从其他模块中装载进来的SlocEntry,也就是localloaded之分。

/// \brief The table of SLocEntries that are local to this module.
///
/// Positive FileIDs are indexes into this table. Entry 0 indicates an invalid
/// expansion.
SmallVector<SrcMgr::SLocEntry, 0> LocalSLocEntryTable;

/// \brief The table of SLocEntries that are loaded from other modules.
///
/// Negative FileIDs are indexes into this table. To get from ID to an index,
/// use (-ID - 2).
mutable SmallVector<SrcMgr::SLocEntry, 0> LoadedSLocEntryTable;

还有一个位图来表示外部的SlocEntry是否真正的加载进来了,看来加载过程是延迟的:

/// \brief A bitmap that indicates whether the entries of LoadedSLocEntryTable
/// have already been loaded from the external source.
///
/// Same indexing as LoadedSLocEntryTable.
std::vector<bool> SLocEntryLoaded;

SourceManager中也有一个LineTable的指针,用来延迟实例化这个结构。由于延迟实例化导致了每次使用的时候都需要判断是否已经实例化,真是蛋疼。

/// \brief Holds information for #line directives.
///
/// This is referenced by indices from SLocEntryTable.
LineTableInfo *LineTable;

对于宏参数的处理,则存储在另外一个Map中:

/// \brief Lazily computed map of macro argument chunks to their expanded
/// source location.
typedef std::map<unsigned, SourceLocation> MacroArgsMap;

mutable llvm::DenseMap<FileID, MacroArgsMap *> MacroArgsCacheMap;

这里还记录了几个重要的FileID,包括当前文件的和预编译导言区,话说这个预编译导言区是干嘛的?

/// \brief The file ID for the main source file of the translation unit.
FileID MainFileID;

/// \brief The file ID for the precompiled preamble there is one.
FileID PreambleFileID;

对于模块之间的引用与依赖关系,存储于一个vector中。

/// \brief The stack of modules being built, which is used to detect
/// cycles in the module dependency graph as modules are being built, as
/// well as to describe why we're rebuilding a particular module.
///
/// There is no way to set this value from the command line. If we ever need
/// to do so (e.g., if on-demand module construction moves out-of-process),
/// we can add a cc1-level option to do so.
SmallVector<std::pair<std::string, FullSourceLoc>, 2> StoredModuleBuildStack;

这里的pair<string,FullSourceLoc>中的string是模块名称,而FullSourceLoc则是引入模块的位置。

void pushModuleBuildStack(StringRef moduleName, FullSourceLoc importLoc)
{
    StoredModuleBuildStack.push_back(std::make_pair(moduleName.str(), importLoc));
}

为File构建MemoryBuffer

File准备MemoryBuffer分为了好几步来进行。首先记录文件的大小,如果文件是处于编辑状态的话,设置大小为-1。如果文件为打开状态,则调用vfs::getBuffer准备好Buffer,并按照要求选择是否关闭文件,感觉像把文件完整的读进来啊。如果文件没有打开的话,则打开或者建立这个文件,然后获得MemoryBuffer。值得注意的是,这个操作有可能不成功,从而返回错误。具体见下面的代码:

llvm::ErrorOr<std::unique_ptr<llvm::MemoryBuffer>>
FileManager::getBufferForFile(const FileEntry *Entry, bool isVolatile,
    bool ShouldCloseOpenFile)
{
    uint64_t FileSize = Entry->getSize();
    // If there's a high enough chance that the file have changed since we
    // got its size, force a stat before opening it.
    if (isVolatile)
        FileSize = -1;

    const char *Filename = Entry->getName();
    // If the file is already open, use the open file descriptor.
    if (Entry->File)
    {
        auto Result =
            Entry->File->getBuffer(Filename, FileSize,
                /*RequiresNullTerminator=*/true, isVolatile);
        // FIXME: we need a set of APIs that can make guarantees about whether a
        // FileEntry is open or not.
        if (ShouldCloseOpenFile)
            Entry->closeFile();
        return Result;
    }

    // Otherwise, open the file.

    if (FileSystemOpts.WorkingDir.empty())
        return FS->getBufferForFile(Filename, FileSize,
            /*RequiresNullTerminator=*/true, isVolatile);

    SmallString<128> FilePath(Entry->getName());
    FixupRelativePath(FilePath);
    return FS->getBufferForFile(FilePath, FileSize,
        /*RequiresNullTerminator=*/true, isVolatile);
}

为File构建ContentCache

对于每个文件来说,其ContentCache就是其内容的内存镜像。我们不能把所有的文件直接装载进内存中,这样会爆内存的,所以需要延迟装载这个ContentCache。所以对于文件的ContentCache的引用需要考虑还没装载的情况,统一以下面这个函数来操作getOrCreateContentCache

/// getOrCreateContentCache - Create or return a cached ContentCache for the
/// specified file.
const ContentCache *
SourceManager::getOrCreateContentCache(const FileEntry *FileEnt,
    bool isSystemFile)
{
    assert(FileEnt && "Didn't specify a file entry to use?");

    // Do we already have information about this file?
    ContentCache *&Entry = FileInfos[FileEnt];
    if (Entry) return Entry;

    // Nope, create a new Cache entry.
    Entry = ContentCacheAlloc.Allocate<ContentCache>();

    if (OverriddenFilesInfo)
    {
        // If the file contents are overridden with contents from another file,
        // pass that file to ContentCache.
        llvm::DenseMap<const FileEntry *, const FileEntry *>::iterator
            overI = OverriddenFilesInfo->OverriddenFiles.find(FileEnt);
        if (overI == OverriddenFilesInfo->OverriddenFiles.end())
            new (Entry) ContentCache(FileEnt);
        else
            new (Entry) ContentCache(OverridenFilesKeepOriginalName ? FileEnt
                : overI->second,
                overI->second);
    }
    else
    {
        new (Entry) ContentCache(FileEnt);
    }

    Entry->IsSystemFile = isSystemFile;

    return Entry;
}

该函数的逻辑还是比较简单的,首先判断是有已经有了对应的ContentCache,没有的话生成一个。生成时又需要判断这个文件是否已经被其他文件给覆盖了,如果是的话还需要把这个覆盖关系传递过去。

生成新的FileID

对于FileID,通过以下两种方式生成,一个处理头文件引用,一个处理缓存映射。

如果以头文件引用的形式生成,则需要调用一下函数:

/// \brief Create a new FileID that represents the specified file
/// being #included from the specified IncludePosition.
///
/// This translates NULL into standard input.
FileID createFileID(const FileEntry *SourceFile, SourceLocation IncludePos,
    SrcMgr::CharacteristicKind FileCharacter,
    int LoadedID = 0, unsigned LoadedOffset = 0)
{
    const SrcMgr::ContentCache *
        IR = getOrCreateContentCache(SourceFile,
            /*isSystemFile=*/FileCharacter != SrcMgr::C_User);
    assert(IR && "getOrCreateContentCache() cannot return NULL");
    return createFileID(IR, IncludePos, FileCharacter, LoadedID, LoadedOffset);
}

首先获得其FileEntry对应的ContentCache,然后再调用具体的执行函数。

如果我们是以一个内存缓冲区来建立一个新的FileID的话,则需要调用另外一个函数了:

/// \brief Create a new FileID that represents the specified memory buffer.
///
/// This does no caching of the buffer and takes ownership of the
/// MemoryBuffer, so only pass a MemoryBuffer to this once.
FileID createFileID(std::unique_ptr<llvm::MemoryBuffer> Buffer,
    SrcMgr::CharacteristicKind FileCharacter = SrcMgr::C_User,
    int LoadedID = 0, unsigned LoadedOffset = 0,
    SourceLocation IncludeLoc = SourceLocation())
{
    return createFileID(createMemBufferContentCache(std::move(Buffer)),
        IncludeLoc, FileCharacter, LoadedID, LoadedOffset);
}

这里的逻辑也差不多,根据一个MemoryBuffer来生成一个ContentCache,基本只需要move就行了。

这两种FileID的创建方式最后都需要调用createFileID这个函数:

/// createFileID - Create a new FileID for the specified ContentCache and
/// include position.  This works regardless of whether the ContentCache
/// corresponds to a file or some other input source.
FileID SourceManager::createFileID(const ContentCache *File,
    SourceLocation IncludePos,
    SrcMgr::CharacteristicKind FileCharacter,
    int LoadedID, unsigned LoadedOffset)

这里的参数LoadedID其实就是FileID的编号。如果这个ID的值小于0,则他加载自外部模块,其值被限定为LoadedSlocEntryTable.size()内。此时的执行逻辑如下,这里还涉及到了延迟加载,所以会有一个vector<bool>来表明是否已经加载了,:

assert(LoadedID != -1 && "Loading sentinel FileID");
unsigned Index = unsigned(-LoadedID) - 2;
assert(Index < LoadedSLocEntryTable.size() && "FileID out of range");
assert(!SLocEntryLoaded[Index] && "FileID already loaded");
LoadedSLocEntryTable[Index] = SLocEntry::get(LoadedOffset,
    FileInfo::get(IncludePos, File, FileCharacter));
SLocEntryLoaded[Index] = true;
return FileID::get(LoadedID);

反之,其ID大于等于0,这是本模块内的ID,则直接加入到LocalSlocEntryTable中,并返回这个加入的索引:

LocalSLocEntryTable.push_back(SLocEntry::get(NextLocalOffset,
    FileInfo::get(IncludePos, File,
        FileCharacter)));
unsigned FileSize = File->getSize();
assert(NextLocalOffset + FileSize + 1 > NextLocalOffset &&
    NextLocalOffset + FileSize + 1 <= CurrentLoadedOffset &&
    "Ran out of source locations!");
// We do a +1 here because we want a SourceLocation that means "the end of the
// file", e.g. for the "no newline at the end of the file" diagnostic.
NextLocalOffset += FileSize + 1;

// Set LastFileIDLookup to the newly created file.  The next getFileID call is
// almost guaranteed to be from that file.
FileID FID = FileID::get(LocalSLocEntryTable.size() - 1);
return LastFileIDLookup = FID;

为ContentCache生成行号索引

ContentCache中的SourceLineCache是一个延迟填充的数据成员,只有在被使用时才初始化行号缓存,来指明这个文件中每一行的开始位置偏移。这个行号缓存的构建函数签名如下:

static LLVM_ATTRIBUTE_NOINLINE void
ComputeLineNumbers(DiagnosticsEngine &Diag, ContentCache *FI,
    llvm::BumpPtrAllocator &Alloc,
    const SourceManager &SM, bool &Invalid);

这个是一个全局函数,而不是内部函数。其实现比较长,看到这一个条件开关简直虎躯一震:

#ifdef __SSE2__
#include <emmintrin.h>
#endif

其实他的逻辑很简单,就是对一个ContentCache进行一个线性扫描换行符,SSE指令很适合这种情况:

__m128i CRs = _mm_set1_epi8('\r');
__m128i LFs = _mm_set1_epi8('\n');
while (NextBuf + 16 <= End)
{
    const __m128i Chunk = *(const __m128i*)NextBuf;
    __m128i Cmp = _mm_or_si128(_mm_cmpeq_epi8(Chunk, CRs),
        _mm_cmpeq_epi8(Chunk, LFs));
    unsigned Mask = _mm_movemask_epi8(Cmp);

    // If we found a newline, adjust the pointer and jump to the handling code.
    if (Mask != 0)
    {
        NextBuf += llvm::countTrailingZeros(Mask);
        goto FoundSpecialChar;
    }
    NextBuf += 16;
}

遇到换行的时候就记录位置到ContentCache.SourceLineCache中。

对于给定文件内Offset,寻找对应的行号则只需要在SourceLineCache中二分查找就可以了。

Macro展开位置的建立

对于宏的展开分为了两种:宏调用展开和宏参数展开,宏调用展开包含了宏参数展开。这两个都是先生成一个ExpansionInfo,然后再调用最后执行该任务的函数。

SourceLocation
SourceManager::createMacroArgExpansionLoc(SourceLocation SpellingLoc,
    SourceLocation ExpansionLoc,
    unsigned TokLength)
{
    ExpansionInfo Info = ExpansionInfo::createForMacroArg(SpellingLoc,
        ExpansionLoc);
    return createExpansionLocImpl(Info, TokLength);
}

SourceLocation
SourceManager::createExpansionLoc(SourceLocation SpellingLoc,
    SourceLocation ExpansionLocStart,
    SourceLocation ExpansionLocEnd,
    unsigned TokLength,
    int LoadedID,
    unsigned LoadedOffset)
{
    ExpansionInfo Info = ExpansionInfo::create(SpellingLoc, ExpansionLocStart,
        ExpansionLocEnd);
    return createExpansionLocImpl(Info, TokLength, LoadedID, LoadedOffset);
}

上面的代码中createForMacroArg只需要一个ExpansionLoc,是因为宏参数只是一个token,第二个SourceLocation默认为无效的。

具体执行函数的签名如下,有两个默认参数:

/// \brief Return a new SourceLocation that encodes the fact
/// that a token from SpellingLoc should actually be referenced from
/// ExpansionLoc.
SourceLocation createExpansionLoc(SourceLocation Loc,
    SourceLocation ExpansionLocStart,
    SourceLocation ExpansionLocEnd,
    unsigned TokLength,
    int LoadedID = 0,
    unsigned LoadedOffset = 0);

该函数的执行逻辑与上面的CreateFileID的执行逻辑差不多,也是分为了外部模块和本地模块来讨论的。当处理的展开是外部模块时,执行以下代码:

if (LoadedID < 0)
{
    assert(LoadedID != -1 && "Loading sentinel FileID");
    unsigned Index = unsigned(-LoadedID) - 2;
    assert(Index < LoadedSLocEntryTable.size() && "FileID out of range");
    assert(!SLocEntryLoaded[Index] && "FileID already loaded");
    LoadedSLocEntryTable[Index] = SLocEntry::get(LoadedOffset, Info);
    SLocEntryLoaded[Index] = true;
    return SourceLocation::getMacroLoc(LoadedOffset);
}

当处理的是本地模块的时候,执行以下代码:

LocalSLocEntryTable.push_back(SLocEntry::get(NextLocalOffset, Info));
assert(NextLocalOffset + TokLength + 1 > NextLocalOffset &&
    NextLocalOffset + TokLength + 1 <= CurrentLoadedOffset &&
    "Ran out of source locations!");
// See createFileID for that +1.
NextLocalOffset += TokLength + 1;
return SourceLocation::getMacroLoc(NextLocalOffset - (TokLength + 1));

从这些代码中可以看出,本地模块与外部模块最主要的差异就是:本地模块内需要更新 NextLocalOffset,而外部模块则不需要,但是外部模块中的 FileID很多是延迟加载的,处理的时候需要判断是否已经加载。

为FileID建立宏参数映射

宏参数缓存全称为MacroArgsMapclang中对于这个的解释还是不清楚:

/// \brief Compute a map of macro argument chunks to their expanded source
/// location. Chunks that are not part of a macro argument will map to an
/// invalid source location. e.g. if a file contains one macro argument at
/// offset 100 with length 10, this is how the map will be formed:
///     0   -> SourceLocation()
///     100 -> Expanded macro arg location
///     110 -> SourceLocation()

好像大意是:把源文件按照是否是宏参数切分为多个区间,对于是宏参数的区域,映射为该位置的SourceLocation,否则映射为无效的SourceLocation

对于一个特定的文件FileID,它可以引入多个新的FileID,有些是头文件引用,有些是宏展开。处理的时候我们需要明确一下FileID的分配规则,每处理一个新的头文件引入或者宏展开,这个全局的FileID会自增。同时会先处理这个新引入的头文件,然后再回到原始的位置,继续处理当前文件后面的头文件引入或者展开。当前文件后面的第一个FileID偏移值会等于上一个头文件所引入的新FileID个数。

所以完整的逻辑是这样的。当前FileID自增,如果遇到头文件引入,首先判断这个头文件引入位置是否是当前文件,不是的话说明FileID已经越界不需要处理了,否则FileID+ =NumCreatedFIDs。如果遇到的是宏展开,也需要判断这个展开是否来自于当前文件,不是的话也表明FileID越界,不需要再处理,否则将这个加入到MacroArgMap里面做记录。完整的代码如下:

int ID = FID.ID;
while (1)
{
    ++ID;
    // Stop if there are no more FileIDs to check.
    if (ID > 0)
    {
        if (unsigned(ID) >= local_sloc_entry_size())
            return;
    }
    else if (ID == -1)
    {
        return;
    }

    bool Invalid = false;
    const SrcMgr::SLocEntry &Entry = getSLocEntryByID(ID, &Invalid);
    if (Invalid)
        return;
    if (Entry.isFile())
    {
        SourceLocation IncludeLoc = Entry.getFile().getIncludeLoc();
        if (IncludeLoc.isInvalid())
            continue;
        if (!isInFileID(IncludeLoc, FID))
            return; // No more files/macros that may be "contained" in this file.

  // Skip the files/macros of the #include'd file, we only care about macros
  // that lexed macro arguments from our file.
        if (Entry.getFile().NumCreatedFIDs)
            ID += Entry.getFile().NumCreatedFIDs - 1/*because of next ++ID*/;
        continue;
    }

    const ExpansionInfo &ExpInfo = Entry.getExpansion();

    if (ExpInfo.getExpansionLocStart().isFileID())
    {
        if (!isInFileID(ExpInfo.getExpansionLocStart(), FID))
            return; // No more files/macros that may be "contained" in this file.
    }

    if (!ExpInfo.isMacroArgExpansion())
        continue;

    associateFileChunkWithMacroArgExp(MacroArgsCache, FID,
        ExpInfo.getSpellingLoc(),
        SourceLocation::getMacroLoc(Entry.getOffset()),
        getFileIDSize(FileID::get(ID)));
}

最后的的记录函数是associateFileChunkWithMacroArgExp,需要考虑的是宏参数也是一个宏函数的时候。此时我们要把整个宏参数展开过程记录下来,所以要递归处理。

if (!SpellLoc.isFileID())
{
    unsigned SpellBeginOffs = SpellLoc.getOffset();
    unsigned SpellEndOffs = SpellBeginOffs + ExpansionLength;

    // The spelling range for this macro argument expansion can span multiple
    // consecutive FileID entries. Go through each entry contained in the
    // spelling range and if one is itself a macro argument expansion, recurse
    // and associate the file chunk that it represents.

    FileID SpellFID; // Current FileID in the spelling range.
    unsigned SpellRelativeOffs;
    std::tie(SpellFID, SpellRelativeOffs) = getDecomposedLoc(SpellLoc);
    while (1)
    {
        const SLocEntry &Entry = getSLocEntry(SpellFID);
        unsigned SpellFIDBeginOffs = Entry.getOffset();
        unsigned SpellFIDSize = getFileIDSize(SpellFID);
        unsigned SpellFIDEndOffs = SpellFIDBeginOffs + SpellFIDSize;
        const ExpansionInfo &Info = Entry.getExpansion();
        if (Info.isMacroArgExpansion())
        {
            unsigned CurrSpellLength;
            if (SpellFIDEndOffs < SpellEndOffs)
                CurrSpellLength = SpellFIDSize - SpellRelativeOffs;
            else
                CurrSpellLength = ExpansionLength;
            associateFileChunkWithMacroArgExp(MacroArgsCache, FID,
                Info.getSpellingLoc().getLocWithOffset(SpellRelativeOffs),
                ExpansionLoc, CurrSpellLength);
        }

        if (SpellFIDEndOffs >= SpellEndOffs)
            return; // we covered all FileID entries in the spelling range.

  // Move to the next FileID entry in the spelling range.
        unsigned advance = SpellFIDSize - SpellRelativeOffs + 1;
        ExpansionLoc = ExpansionLoc.getLocWithOffset(advance);
        ExpansionLength -= advance;
        ++SpellFID.ID;
        SpellRelativeOffs = 0;
    }

}

查询访问

访问缓存管理与统计

这个结构体还给了很大的笔墨来记录一些最近的访问记录和一些统计信息:

/// \brief A one-entry cache to speed up getFileID.
///
/// LastFileIDLookup records the last FileID looked up or created, because it
/// is very common to look up many tokens from the same file.
mutable FileID LastFileIDLookup;
/// \brief These ivars serve as a cache used in the getLineNumber
/// method which is used to speedup getLineNumber calls to nearby locations.
mutable FileID LastLineNoFileIDQuery;
mutable SrcMgr::ContentCache *LastLineNoContentCache;
mutable unsigned LastLineNoFilePos;
mutable unsigned LastLineNoResult;
// Statistics for -print-stats.
mutable unsigned NumLinearScans, NumBinaryProbes;

/// \brief Associates a FileID with its "included/expanded in" decomposed
/// location.
///
/// Used to cache results from and speed-up \c getDecomposedIncludedLoc
/// function.
mutable llvm::DenseMap<FileID, std::pair<FileID, unsigned> > IncludedLocMap;

根据SLocOffset查询FileID

在为FileEntry创建ContentCache的代码中,为了维护NextLocalOffset,有如下判断断言

assert(NextLocalOffset + FileSize + 1 > NextLocalOffset &&
    NextLocalOffset + FileSize + 1 <= CurrentLoadedOffset &&
    "Ran out of source locations!");

上面的代码中的assert是用来来判断这个 NextLocalOffset是否有效。第一个判断 NextLocalOffset是否可能溢出,第二个判断NextLocalOffset是否会干涉到CurrentLoadedOffset。这里之所以要与CurrentLoadedOffset做比较,是因为全局只有一个SlocOffsetNextLocalOffsetCurrentLoadedOffset共享一个地址空间,一个从0向上增长,一个从\(2^{32}-1\)向下增长。

给定一个SlocOffset判断他是local还是loaded就只需要与其中一个做大小比较就可以了。具体的见下面的代码:

FileID SourceManager::getFileIDSlow(unsigned SLocOffset) const
{
    if (!SLocOffset)
        return FileID::get(0);

    // Now it is time to search for the correct file. See where the SLocOffset
    // sits in the global view and consult local or loaded buffers for it.
    if (SLocOffset < NextLocalOffset)
        return getFileIDLocal(SLocOffset);
    return getFileIDLoaded(SLocOffset);
}

对于local的处理,其代码比较长,所以我们分片介绍。

首先判断这个localOffset是否有效,直接与NextLocalOffset做比较。在有效的情况下,找到一个Offset比他大的SlocEntry,能用上一次查询的位点就用,clang说局部性很好。

const SrcMgr::SLocEntry *I;

if (LastFileIDLookup.ID < 0 ||
    LocalSLocEntryTable[LastFileIDLookup.ID].getOffset() < SLocOffset)
{
    // Neither loc prunes our search.
    I = LocalSLocEntryTable.end();
}
else
{
    // Perhaps it is near the file point.
    I = LocalSLocEntryTable.begin() + LastFileIDLookup.ID;
}

找到这个位点之后,逆向遍历,获得第一个比Offset小的SLocEntry,这个就是包含当前Offset的文件位点。这里的寻找策略也是非常诡异,首先线性查找8个单位,如果遇到结果则返回。

// Find the FileID that contains this.  "I" is an iterator that points to a
// FileID whose offset is known to be larger than SLocOffset.
unsigned NumProbes = 0;
while (1)
{
    --I;
    if (I->getOffset() <= SLocOffset)
    {
        FileID Res = FileID::get(int(I - LocalSLocEntryTable.begin()));

        // If this isn't an expansion, remember it.  We have good locality across
        // FileID lookups.
        if (!I->isExpansion())
            LastFileIDLookup = Res;
        NumLinearScans += NumProbes + 1;
        return Res;
    }
    if (++NumProbes == 8)
        break;
}

// Convert "I" back into an index.  We know that it is an entry whose index is
// larger than the offset we are looking for.
unsigned GreaterIndex = I - LocalSLocEntryTable.begin();
// LessIndex - This is the lower bound of the range that we're searching.
// We know that the offset corresponding to the FileID is is less than
// SLocOffset.

如果这8次探寻都没中,则开始二分查找:

unsigned LessIndex = 0;
NumProbes = 0;
while (1)
{
    bool Invalid = false;
    unsigned MiddleIndex = (GreaterIndex - LessIndex) / 2 + LessIndex;
    unsigned MidOffset = getLocalSLocEntry(MiddleIndex, &Invalid).getOffset();
    if (Invalid)
        return FileID::get(0);

    ++NumProbes;

    // If the offset of the midpoint is too large, chop the high side of the
    // range to the midpoint.
    if (MidOffset > SLocOffset)
    {
        GreaterIndex = MiddleIndex;
        continue;
    }

    // If the middle index contains the value, succeed and return.
    // FIXME: This could be made faster by using a function that's aware of
    // being in the local area.
    if (isOffsetInFileID(FileID::get(MiddleIndex), SLocOffset))
    {
        FileID Res = FileID::get(MiddleIndex);

        // If this isn't a macro expansion, remember it.  We have good locality
        // across FileID lookups.
        if (!LocalSLocEntryTable[MiddleIndex].isExpansion())
            LastFileIDLookup = Res;
        NumBinaryProbes += NumProbes;
        return Res;
    }

    // Otherwise, move the low-side up to the middle index.
    LessIndex = MiddleIndex;
}

至于loadedFileID的查找也是大同小异,唯一的不同是比较符号是反的,同时里面的文件可能木有Load进来。

首先是线性扫描部分:

// Sanity checking, otherwise a bug may lead to hanging in release build.
if (SLocOffset < CurrentLoadedOffset)
{
    assert(0 && "Invalid SLocOffset or bad function choice");
    return FileID();
}

// Essentially the same as the local case, but the loaded array is sorted
// in the other direction.

// First do a linear scan from the last lookup position, if possible.
unsigned I;
int LastID = LastFileIDLookup.ID;
if (LastID >= 0 || getLoadedSLocEntryByID(LastID).getOffset() < SLocOffset)
    I = 0;
else
    I = (-LastID - 2) + 1;

unsigned NumProbes;
for (NumProbes = 0; NumProbes < 8; ++NumProbes, ++I)
{
    // Make sure the entry is loaded!
    const SrcMgr::SLocEntry &E = getLoadedSLocEntry(I);
    if (E.getOffset() <= SLocOffset)
    {
        FileID Res = FileID::get(-int(I) - 2);

        if (!E.isExpansion())
            LastFileIDLookup = Res;
        NumLinearScans += NumProbes + 1;
        return Res;
    }
}

然后是二分查找部分:

// Linear scan failed. Do the binary search. Note the reverse sorting of the
// table: GreaterIndex is the one where the offset is greater, which is
// actually a lower index!
unsigned GreaterIndex = I;
unsigned LessIndex = LoadedSLocEntryTable.size();
NumProbes = 0;
while (1)
{
    ++NumProbes;
    unsigned MiddleIndex = (LessIndex - GreaterIndex) / 2 + GreaterIndex;
    const SrcMgr::SLocEntry &E = getLoadedSLocEntry(MiddleIndex);
    if (E.getOffset() == 0)
        return FileID(); // invalid entry.

    ++NumProbes;

    if (E.getOffset() > SLocOffset)
    {
        // Sanity checking, otherwise a bug may lead to hanging in release build.
        if (GreaterIndex == MiddleIndex)
        {
            assert(0 && "binary search missed the entry");
            return FileID();
        }
        GreaterIndex = MiddleIndex;
        continue;
    }

    if (isOffsetInFileID(FileID::get(-int(MiddleIndex) - 2), SLocOffset))
    {
        FileID Res = FileID::get(-int(MiddleIndex) - 2);
        if (!E.isExpansion())
            LastFileIDLookup = Res;
        NumBinaryProbes += NumProbes;
        return Res;
    }

    // Sanity checking, otherwise a bug may lead to hanging in release build.
    if (LessIndex == MiddleIndex)
    {
        assert(0 && "binary search missed the entry");
        return FileID();
    }
    LessIndex = MiddleIndex;
}

至此,由一个offset寻找对应的FileID的功能就完成了,这个功能的主要调用者是下面这个函数,由SpellingLoc寻找对应的FileID,因为这个才需要跨文件寻找:

/// This is a very hot method that is used for all SourceManager queries
/// that start with a SourceLocation object.  It is responsible for finding
/// the entry in SLocEntryTable which contains the specified location.
///
FileID getFileID(SourceLocation SpellingLoc) const
{
    unsigned SLocOffset = SpellingLoc.getOffset();

    // If our one-entry cache covers this offset, just return it.
    if (isOffsetInFileID(LastFileIDLookup, SLocOffset))
        return LastFileIDLookup;

    return getFileIDSlow(SLocOffset);
}

注意第10行的处理,看来访问局部性还是很高啊。

很多时候我们需要的是文件名和文件内偏移这个pair。在我们实现了由SLocOffset获得对应的文件FileID之后,这个功能其实很简单,直接相减即可以得到文件内偏移:

/// \brief Decompose the specified location into a raw FileID + Offset pair.
///
/// The first element is the FileID, the second is the offset from the
/// start of the buffer of the location.
std::pair<FileID, unsigned> getDecomposedLoc(SourceLocation Loc) const
{
    FileID FID = getFileID(Loc);
    bool Invalid = false;
    const SrcMgr::SLocEntry &E = getSLocEntry(FID, &Invalid);
    if (Invalid)
        return std::make_pair(FileID(), 0);
    return std::make_pair(FID, Loc.getOffset() - E.getOffset());
}

根据FileEntry查询FileID

这个操作是由translateFile函数来完成的。根据FileEntry得到对应的FileID,看上去很简单的功能居然有140多行,分部分来讲。首先是函数签名:

/// \brief Get the FileID for the given file.
///
/// If the source file is included multiple times, the FileID will be the
/// first inclusion.
FileID SourceManager::translateFile(const FileEntry *SourceFile) const

这里的注释说一个文件可能被include多次,我们返回第一次引入所生成的FileID

首先我们检查当前的翻译单元是否就是要找的FileID

FileID FirstFID;

// First, check the main file ID, since it is common to look for a
// location in the main file.
Optional<llvm::sys::fs::UniqueID> SourceFileUID;
Optional<StringRef> SourceFileName;
if (MainFileID.isValid())
{
    bool Invalid = false;
    const SLocEntry &MainSLoc = getSLocEntry(MainFileID, &Invalid);
    if (Invalid)
        return FileID();

    if (MainSLoc.isFile())
    {
        const ContentCache *MainContentCache
            = MainSLoc.getFile().getContentCache();
        if (!MainContentCache)
        {
            // Can't do anything
        }
        else if (MainContentCache->OrigEntry == SourceFile)
        {
            FirstFID = MainFileID;
        }
        else
        {
            // Fall back: check whether we have the same base name and inode
            // as the main file.
            //此处省略很多字
        }
    }
}

这么多if简直太恐怖了,其实流程上来说很简单,根据MainFileID获得对应的SLocEntry,然后再获得对应的ContentCache,比较里面存储的OrigEntry是否等价于当前的FileEntry。如果等价则直接返回,否则继续检查文件名是否相等,相等的情况下去检查文件系统里面的ID是否相等。

const FileEntry *MainFile = MainContentCache->OrigEntry;
SourceFileName = llvm::sys::path::filename(SourceFile->getName());
if (*SourceFileName == llvm::sys::path::filename(MainFile->getName()))
{
    SourceFileUID = getActualFileUID(SourceFile);
    if (SourceFileUID)
    {
        if (Optional<llvm::sys::fs::UniqueID> MainFileUID =
            getActualFileUID(MainFile))
        {
            if (*SourceFileUID == *MainFileUID)
            {
                FirstFID = MainFileID;
                SourceFile = MainFile;
            }
        }
    }
}

如果各种检查都没有通过,则遍历整个LocalSLocEntry一一判断,好蠢:

// The location we're looking for isn't in the main file; look
// through all of the local source locations.
for (unsigned I = 0, N = local_sloc_entry_size(); I != N; ++I)
{
    bool Invalid = false;
    const SLocEntry &SLoc = getLocalSLocEntry(I, &Invalid);
    if (Invalid)
        return FileID();

    if (SLoc.isFile() &&
        SLoc.getFile().getContentCache() &&
        SLoc.getFile().getContentCache()->OrigEntry == SourceFile)
    {
        FirstFID = FileID::get(I);
        break;
    }
}

如果还是没找到,开始遍历外部的LoadedSLocEntry,也是一一比较,好蠢:

for (unsigned I = 0, N = loaded_sloc_entry_size(); I != N; ++I)
{
    const SLocEntry &SLoc = getLoadedSLocEntry(I);
    if (SLoc.isFile() &&
        SLoc.getFile().getContentCache() &&
        SLoc.getFile().getContentCache()->OrigEntry == SourceFile)
    {
        FirstFID = FileID::get(-int(I) - 2);
        break;
    }
}

如果还是没有找到,就重新遍历一次LocalSLocEntry,不过这次又加上了文件系统级别的比较,这里就不贴代码了,与MainID那里的比较相似。

话说为什么会这么麻烦,建立一个映射表不就完事了么?居然需要遍历,而且还是两次!

Macro展开位置的查询

对于一个给定的SourceLocation,我们有些时候需要找出对应宏展开的起始位点,这是一个递归的过程,因为宏展开是一个递归的过程。

/// \brief Given a SourceLocation object \p Loc, return the expansion
/// location referenced by the ID.
SourceLocation getExpansionLoc(SourceLocation Loc) const
{
    // Handle the non-mapped case inline, defer to out of line code to handle
    // expansions.
    if (Loc.isFileID()) return Loc;
    return getExpansionLocSlowCase(Loc);
}
SourceLocation SourceManager::
getExpansionLocSlowCase(SourceLocation Loc) const
{
    do
    {
        // Note: If Loc indicates an offset into a token that came from a macro
        // expansion (e.g. the 5th character of the token) we do not want to add
        // this offset when going to the expansion location.  The expansion
        // location is the macro invocation, which the offset has nothing to do
        // with.  This is unlike when we get the spelling loc, because the offset
        // directly correspond to the token whose spelling we're inspecting.
        Loc = getSLocEntry(getFileID(Loc)).getExpansion().getExpansionLocStart();
    }
    while (!Loc.isFileID());

    return Loc;
}

只要中间的查询结果还是Expansion相关的SourceLocation,就一直递归查询。

类似的,寻找SpellingLoc也是递归查询的,直到查询到一个真正的FileID

SourceLocation SourceManager::getSpellingLocSlowCase(SourceLocation Loc) const
{
    do
    {
        std::pair<FileID, unsigned> LocInfo = getDecomposedLoc(Loc);
        Loc = getSLocEntry(LocInfo.first).getExpansion().getSpellingLoc();
        Loc = Loc.getLocWithOffset(LocInfo.second);
    }
    while (!Loc.isFileID());
    return Loc;
}

这里的代码依赖于LocInfo.second的值是2的倍数,这样就不会影响Loc的最低位。是否是2的倍数依赖于getDecomposedLoc的实现:

/// \brief Decompose the specified location into a raw FileID + Offset pair.
///
/// The first element is the FileID, the second is the offset from the
/// start of the buffer of the location.
std::pair<FileID, unsigned> getDecomposedLoc(SourceLocation Loc) const
{
    FileID FID = getFileID(Loc);
    bool Invalid = false;
    const SrcMgr::SLocEntry &E = getSLocEntry(FID, &Invalid);
    if (Invalid)
        return std::make_pair(FileID(), 0);
    return std::make_pair(FID, Loc.getOffset() - E.getOffset());
}

由于getOffset这个函数返回的是已经经过位修正的offset,所以2的倍数得到了满足。

这个函数只是处理了宏展开相关的开始位点,我们很多时候还需要开始与结束位点,即完整的展开区间。相关的函数有两个,一个只处理一次展开,一个处理完整的展开链:

/// \brief Return the start/end of the expansion information for an
/// expansion location.
///
/// \pre \p Loc is required to be an expansion location.
std::pair<SourceLocation, SourceLocation>
getImmediateExpansionRange(SourceLocation Loc) const;
/// \brief Given a SourceLocation object, return the range of
/// tokens covered by the expansion in the ultimate file.
std::pair<SourceLocation, SourceLocation>
    getExpansionRange(SourceLocation Loc) const;

直接展开很好解决,直接返回存储的展开区间就可以了:

/// getImmediateExpansionRange - Loc is required to be an expansion location.
/// Return the start/end of the expansion information.
std::pair<SourceLocation, SourceLocation>
SourceManager::getImmediateExpansionRange(SourceLocation Loc) const
{
    assert(Loc.isMacroID() && "Not a macro expansion loc!");
    const ExpansionInfo &Expansion = getSLocEntry(getFileID(Loc)).getExpansion();
    return Expansion.getExpansionLocRange();
}

而对于完整的宏展开来说,就不能这样做了,又需要递归处理,对于上面这个函数所获得的值,递归查询:

/// getExpansionRange - Given a SourceLocation object, return the range of
/// tokens covered by the expansion in the ultimate file.
std::pair<SourceLocation, SourceLocation>
SourceManager::getExpansionRange(SourceLocation Loc) const
{
    if (Loc.isFileID()) return std::make_pair(Loc, Loc);

    std::pair<SourceLocation, SourceLocation> Res =
        getImmediateExpansionRange(Loc);

    // Fully resolve the start and end locations to their ultimate expansion
    // points.
    while (!Res.first.isFileID())
        Res.first = getImmediateExpansionRange(Res.first).first;
    while (!Res.second.isFileID())
        Res.second = getImmediateExpansionRange(Res.second).second;
}

最后还剩下两个有趣的函数,一个是判断当前位置是不是直接宏展开的头部,一个是判断当前位置是不是直接宏展开的尾部。这两个函数的执行逻辑都是类似的,首先判断位置有效性:是不是宏展开位置,是不是文件头或者文件尾。然后再得到该展开区域的Range,继续判断。

BeforeInTUCache

目前还不是很懂这个词的意思,好像处理的是两个SourceLocation在转换单元中的处理顺序。

/// The key value into the IsBeforeInTUCache table.
typedef std::pair<FileID, FileID> IsBeforeInTUCacheKey;

/// The IsBeforeInTranslationUnitCache is a mapping from FileID pairs
/// to cache results.
typedef llvm::DenseMap<IsBeforeInTUCacheKey, InBeforeInTUCacheEntry>
    InBeforeInTUCache;

/// Cache results for the isBeforeInTranslationUnit method.
mutable InBeforeInTUCache IBTUCache;
mutable InBeforeInTUCacheEntry IBTUCacheOverflow;

/// Return the cache entry for comparing the given file IDs
/// for isBeforeInTranslationUnit.
InBeforeInTUCacheEntry &getInBeforeInTUCache(FileID LFID, FileID RFID) const;

行列号管理

列号查询

注释里面说列号比行号好处理的多,因此我们就先以列号来说吧。使用时是通过文件标识符和偏移位置来计算的,其函数签名如下:

/// getColumnNumber - Return the column # for the specified file position.
/// this is significantly cheaper to compute than the line number.
unsigned SourceManager::getColumnNumber(FileID FID, unsigned FilePos,
    bool *Invalid) const

具体执行流程是首先判断该文件是否有对应的MemoryBuffer,没有则直接返回无效值。如果偏移量大于缓冲大小也直接返回无效值。

bool MyInvalid = false;
llvm::MemoryBuffer *MemBuf = getBuffer(FID, &MyInvalid);
if (Invalid)
    *Invalid = MyInvalid;

if (MyInvalid)
    return 1;

// It is okay to request a position just past the end of the buffer.
if (FilePos > MemBuf->getBufferSize())
{
    if (Invalid)
        *Invalid = true;
    return 1;
}

然后再利用SourceManager中存储的最近查询记录,如果该位置在最近查询的行内,则直接计算列偏移。

// See if we just calculated the line number for this FilePos and can use
// that to lookup the start of the line instead of searching for it.
if (LastLineNoFileIDQuery == FID &&
    LastLineNoContentCache->SourceLineCache != nullptr &&
    LastLineNoResult < LastLineNoContentCache->NumLines)
{
    unsigned *SourceLineCache = LastLineNoContentCache->SourceLineCache;
    unsigned LineStart = SourceLineCache[LastLineNoResult - 1];
    unsigned LineEnd = SourceLineCache[LastLineNoResult];
    if (FilePos >= LineStart && FilePos < LineEnd)
        return FilePos - LineStart + 1;
}

否则的话,就一直回退到换行符\textbackslash r\textbackslash n,记录后退了多少步,直接返回。

const char *Buf = MemBuf->getBufferStart();
unsigned LineStart = FilePos;
while (LineStart && Buf[LineStart - 1] != '\n' && Buf[LineStart - 1] != '\r')
    --LineStart;
return FilePos - LineStart + 1;

行号查询

对于行号的计算就没有这么爽了,SourceManager内有一个函数来执行寻找行号这个操作,其函数签名如下:

unsigned SourceManager::getLineNumber(FileID FID, unsigned FilePos,
    bool *Invalid) const

这个函数也比较复杂,所以分部分说明。首先找到这个文件的ContentCache,这里可以利用缓存。

ContentCache *Content;
if (LastLineNoFileIDQuery == FID)
    Content = LastLineNoContentCache;
else
{
    bool MyInvalid = false;
    const SLocEntry &Entry = getSLocEntry(FID, &MyInvalid);
    if (MyInvalid || !Entry.isFile())
    {
        if (Invalid)
            *Invalid = true;
        return 1;
    }

    Content = const_cast<ContentCache*>(Entry.getFile().getContentCache());
}

获得这个ContentCache之后,在获得其SourceLineCache,如果没有的话则需要调用前面的ComputeLineNumbers函数来建立:

// If this is the first use of line information for this buffer, compute the
/// SourceLineCache for it on demand.
if (!Content->SourceLineCache)
{
    bool MyInvalid = false;
    ComputeLineNumbers(Diag, Content, ContentCacheAlloc, *this, MyInvalid);
    if (Invalid)
        *Invalid = MyInvalid;
    if (MyInvalid)
        return 1;
}
else if (Invalid)
    *Invalid = false;

之后就是二分查找的过程,但是这里又搞了一些复杂的事来寻找一个比较好的上界.如果上次的查询的是同一个文件,则当前查询的位点应该离上次查询的位点不远,何必呢!:

if (LastLineNoFileIDQuery == FID)
{
    if (QueriedFilePos >= LastLineNoFilePos)
    {
        // FIXME: Potential overflow?
        SourceLineCache = SourceLineCache + LastLineNoResult - 1;

        // The query is likely to be nearby the previous one.  Here we check to
        // see if it is within 5, 10 or 20 lines.  It can be far away in cases
        // where big comment blocks and vertical whitespace eat up lines but
        // contribute no tokens.
        if (SourceLineCache + 5 < SourceLineCacheEnd)
        {
            if (SourceLineCache[5] > QueriedFilePos)
                SourceLineCacheEnd = SourceLineCache + 5;
            else if (SourceLineCache + 10 < SourceLineCacheEnd)
            {
                if (SourceLineCache[10] > QueriedFilePos)
                    SourceLineCacheEnd = SourceLineCache + 10;
                else if (SourceLineCache + 20 < SourceLineCacheEnd)
                {
                    if (SourceLineCache[20] > QueriedFilePos)
                        SourceLineCacheEnd = SourceLineCache + 20;
                }
            }
        }
    }
    else
    {
        if (LastLineNoResult < Content->NumLines)
            SourceLineCacheEnd = SourceLineCache + LastLineNoResult + 1;
    }
}

现在我们才可以调用二分查找过程了,并同时填好最近访问记录:

unsigned *Pos
    = std::lower_bound(SourceLineCache, SourceLineCacheEnd, QueriedFilePos);
unsigned LineNo = Pos - SourceLineCacheStart;

LastLineNoFileIDQuery = FID;
LastLineNoContentCache = Content;
LastLineNoFilePos = QueriedFilePos;
LastLineNoResult = LineNo;
return LineNo;

在做了这么多的铺垫之后,上级函数实现起来就非常简单了:

unsigned SourceManager::getSpellingLineNumber(SourceLocation Loc,
    bool *Invalid) const
{
    if (isInvalid(Loc, Invalid)) return 0;
    std::pair<FileID, unsigned> LocInfo = getDecomposedSpellingLoc(Loc);
    return getLineNumber(LocInfo.first, LocInfo.second);
}
unsigned SourceManager::getExpansionLineNumber(SourceLocation Loc,
    bool *Invalid) const
{
    if (isInvalid(Loc, Invalid)) return 0;
    std::pair<FileID, unsigned> LocInfo = getDecomposedExpansionLoc(Loc);
    return getLineNumber(LocInfo.first, LocInfo.second);
}

Presumed行号查询

SpellingLineNumberExpansionLineNumber可以比较直接利用前面的函数进行处理,但是对于PresumedLineNumber来说,就没这么简单了。

unsigned SourceManager::getPresumedLineNumber(SourceLocation Loc,
    bool *Invalid) const
{
    if (isInvalid(Loc, Invalid)) return 0;
    return getPresumedLoc(Loc).getLine();
}

这里依赖于getPresumeLoc的行为,比较特殊,我们需要进行分析,其函数签名如下:

/// getPresumedLoc - This method returns the "presumed" location of a
/// SourceLocation specifies.  A "presumed location" can be modified by #line
/// or GNU line marker directives.  This provides a view on the data that a
/// user should see in diagnostics, for example.
///
/// Note that a presumed location is always given as the expansion point of an
/// expansion location, not at the spelling location.
PresumedLoc SourceManager::getPresumedLoc(SourceLocation Loc,
    bool UseLineDirectives) const

首先需要得到这个loc的展开开始位置来明确行号:

if (Loc.isInvalid()) return PresumedLoc();

// Presumed locations are always for expansion points.
std::pair<FileID, unsigned> LocInfo = getDecomposedExpansionLoc(Loc);

bool Invalid = false;
const SLocEntry &Entry = getSLocEntry(LocInfo.first, &Invalid);
if (Invalid || !Entry.isFile())
    return PresumedLoc();

const SrcMgr::FileInfo &FI = Entry.getFile();
const SrcMgr::ContentCache *C = FI.getContentCache();
// To get the source name, first consult the FileEntry (if one exists)
// before the MemBuffer as this will avoid unnecessarily paging in the
// MemBuffer.
const char *Filename;
if (C->OrigEntry)
    Filename = C->OrigEntry->getName();
else
    Filename = C->getBuffer(Diag, *this)->getBufferIdentifier();

unsigned LineNo = getLineNumber(LocInfo.first, LocInfo.second, &Invalid);
if (Invalid)
    return PresumedLoc();
unsigned ColNo = getColumnNumber(LocInfo.first, LocInfo.second, &Invalid);
if (Invalid)
    return PresumedLoc();

得到这个展开开始区域之后,寻找最近的导言行,然后调整差值并返回:

// If we have #line directives in this file, update and overwrite the physical
// location info if appropriate.
if (UseLineDirectives && FI.hasLineDirectives())
{
    assert(LineTable && "Can't have linetable entries without a LineTable!");
    // See if there is a #line directive before this.  If so, get it.
    if (const LineEntry *Entry =
        LineTable->FindNearestLineEntry(LocInfo.first, LocInfo.second))
    {
        // If the LineEntry indicates a filename, use it.
        if (Entry->FilenameID != -1)
            Filename = LineTable->getFilename(Entry->FilenameID);

        // Use the line number specified by the LineEntry.  This line number may
        // be multiple lines down from the line entry.  Add the difference in
        // physical line numbers from the query point and the line marker to the
        // total.
        unsigned MarkerLineNo = getLineNumber(LocInfo.first, Entry->FileOffset);
        LineNo = Entry->LineNo + (LineNo - MarkerLineNo - 1);

        // Note that column numbers are not molested by line markers.

        // Handle virtual #include manipulation.
        if (Entry->IncludeOffset)
        {
            IncludeLoc = getLocForStartOfFile(LocInfo.first);
            IncludeLoc = IncludeLoc.getLocWithOffset(Entry->IncludeOffset);
        }
    }
}

return PresumedLoc(Filename, LineNo, ColNo, IncludeLoc);

由行列号生成SourceLocation

这个就相当于之前操作的逆函数,在已知文件名和行列号的情况下,得到对应的SourceLocation需要调用下面的函数:

/// \brief Get the source location for the given file:line:col triplet.
///
/// If the source file is included multiple times, the source location will
/// be based upon an arbitrary inclusion.
SourceLocation SourceManager::translateFileLineCol(const FileEntry *SourceFile,
    unsigned Line,
    unsigned Col) const
{
    assert(SourceFile && "Null source file!");
    assert(Line && Col && "Line and column should start from 1!");

    FileID FirstFID = translateFile(SourceFile);
    return translateLineCol(FirstFID, Line, Col);
}

然而这里又需要调用两个底层函数,translateFile这个已经在前面说过了,现在我们需要考虑的是translateLineCol。其实流程上也比较简单,由FileID获得SlocEntry,进而得到ContentCache。然后按需建立SourceLineCache

// If this is the first use of line information for this buffer, compute the
// SourceLineCache for it on demand.
if (!Content->SourceLineCache)
{
    bool MyInvalid = false;
    ComputeLineNumbers(Diag, Content, ContentCacheAlloc, *this, MyInvalid);
    if (MyInvalid)
        return SourceLocation();
}

如果行号越界,直接返回最后一个位置。。。其实我觉得返回无效位置比较好。

if (Line > Content->NumLines)
{
    unsigned Size = Content->getBuffer(Diag, *this)->getBufferSize();
    if (Size > 0)
        --Size;
    return FileLoc.getLocWithOffset(Size);
}

后面就是针对列号和行号的加减,如果列号越界了,则返回本行最后一列。

llvm::MemoryBuffer *Buffer = Content->getBuffer(Diag, *this);
unsigned FilePos = Content->SourceLineCache[Line - 1];
const char *Buf = Buffer->getBufferStart() + FilePos;
unsigned BufLength = Buffer->getBufferSize() - FilePos;
if (BufLength == 0)
    return FileLoc.getLocWithOffset(FilePos);

unsigned i = 0;

// Check that the given column is valid.
while (i < BufLength - 1 && i < Col - 1 && Buf[i] != '\n' && Buf[i] != '\r')
    ++i;
return FileLoc.getLocWithOffset(FilePos + i);
Published:
2015-11-17 19:27
Category:
Tag:
CPP15