Clang for fuzzy parsing C++

Is it at all possible to parse C++ with incomplete declarations with clang with its existing libclang API ? Ie parse .cpp file without including all the headers, deducing declarations on the fly. so, eg The following text:

A B::Foo(){return stuff();}

Will detect unknown symbol A, call my callback that deducts A is a class using my magic heuristic, then call this callback the same way with B and Foo and stuff. In the end I want to be able to infer that I saw a member Foo of class B returning A, and stuff is a function.. Or something to that effect. context: I wanna see if I can do sensible syntax highlighting and on the fly code analysis without parsing all the headers very quickly.

[EDIT] To clarify, I'm looking for very heavily restricted C++ parsing, possibly with some heuristic to lift some of the restrictions.

C++ grammar is full of context dependencies. Is Foo() a function call or a construction of a temporary of class Foo? Is Foo<Bar> stuff; a template Foo<Bar> instantiation and declaration of variable stuff, or is it weird-looking 2 calls to overloaded operator < and operator > ? It's only possible to tell in context, and context often comes from parsing the headers.

What I'm looking for is a way to plug my custom convention rules. Eg I know that I don't overload Win32 symbols, so I can safely assume that CreateFile is always a function, and I even know its signature. I also know that all my classes start with a capital letter and are nouns, and functions are usually verbs, so I can reasonably guess that Foo and Bar are class names. In a more complex scenario, I know I don't write side-effect-free expressions like a < b > c; so I can assume that a is always a template instantiation. And so on.

So, the question is whether it's possible to use Clang API to call back every time it encounters an unknown symbol, and give it an answer using my own non-C++ heuristic. If my heuristic fails, then the parse fails, obviously. And I'm not talking about parsing Boost library :) I'm talking about very simple C++, probably without templates, restricted to some minimum that clang can handle in this case.


Another solution which I think will suit more the OP than fuzzy parsing.

When parsing, clang maintains Semantic information through the Sema part of the analyzer. When encountering an unknown symbol, Sema will fallback to ExternalSemaSource to get some information about this symbol. Through this, you could implement what you want.

Here is a quick example how to set up it. It is not entirely functional (I'm not doing anything in the LookupUnqualified method), you might need to do further investigations and I think it is a good start.

// Declares clang::SyntaxOnlyAction.
#include <clang/Frontend/FrontendActions.h>
#include <clang/Tooling/CommonOptionsParser.h>
#include <clang/Tooling/Tooling.h>
#include <llvm/Support/CommandLine.h>
#include <clang/AST/AST.h>
#include <clang/AST/ASTConsumer.h>
#include <clang/AST/RecursiveASTVisitor.h>
#include <clang/Frontend/ASTConsumers.h>
#include <clang/Frontend/FrontendActions.h>
#include <clang/Frontend/CompilerInstance.h>
#include <clang/Tooling/CommonOptionsParser.h>
#include <clang/Tooling/Tooling.h>
#include <clang/Rewrite/Core/Rewriter.h>
#include <llvm/Support/raw_ostream.h>
#include <clang/Sema/ExternalSemaSource.h>
#include <clang/Sema/Sema.h>
#include "clang/Basic/DiagnosticOptions.h"
#include "clang/Frontend/TextDiagnosticPrinter.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Basic/TargetOptions.h"
#include "clang/Basic/TargetInfo.h"
#include "clang/Basic/FileManager.h"
#include "clang/Basic/SourceManager.h"
#include "clang/Lex/Preprocessor.h"
#include "clang/Basic/Diagnostic.h"
#include "clang/AST/ASTContext.h"
#include "clang/AST/ASTConsumer.h"
#include "clang/Parse/Parser.h"
#include "clang/Parse/ParseAST.h"
#include <clang/Sema/Lookup.h>

#include <iostream>
using namespace clang;
using namespace clang::tooling;
using namespace llvm;

class ExampleVisitor : public RecursiveASTVisitor<ExampleVisitor> {
private:
  ASTContext *astContext;

public:
  explicit ExampleVisitor(CompilerInstance *CI, StringRef file)
      : astContext(&(CI->getASTContext())) {}

  virtual bool VisitVarDecl(VarDecl *d) {
    std::cout << d->getNameAsString() << "@n";
    return true;
  }
};

class ExampleASTConsumer : public ASTConsumer {
private:
  ExampleVisitor visitor;

public:
  explicit ExampleASTConsumer(CompilerInstance *CI, StringRef file)
      : visitor(CI, file) {}
  virtual void HandleTranslationUnit(ASTContext &Context) {
    // de cette façon, on applique le visiteur sur l'ensemble de la translation
    // unit
    visitor.TraverseDecl(Context.getTranslationUnitDecl());
  }
};

class DynamicIDHandler : public clang::ExternalSemaSource {
public:
  DynamicIDHandler(clang::Sema *Sema)
      : m_Sema(Sema), m_Context(Sema->getASTContext()) {}
  ~DynamicIDHandler() = default;

  /// brief Provides last resort lookup for failed unqualified lookups
  ///
  /// If there is failed lookup, tell sema to create an artificial declaration
  /// which is of dependent type. So the lookup result is marked as dependent
  /// and the diagnostics are suppressed. After that is's an interpreter's
  /// responsibility to fix all these fake declarations and lookups.
  /// It is done by the DynamicExprTransformer.
  ///
  /// @param[out] R The recovered symbol.
  /// @param[in] S The scope in which the lookup failed.
  virtual bool LookupUnqualified(clang::LookupResult &R, clang::Scope *S) {
     DeclarationName Name = R.getLookupName();
     std::cout << Name.getAsString() << "n";
    // IdentifierInfo *II = Name.getAsIdentifierInfo();
    // SourceLocation Loc = R.getNameLoc();
    // VarDecl *Result =
    //     // VarDecl::Create(m_Context, R.getSema().getFunctionLevelDeclContext(),
    //     //                 Loc, Loc, II, m_Context.DependentTy,
    //     //                 /*TypeSourceInfo*/ 0, SC_None, SC_None);
    // if (Result) {
    //   R.addDecl(Result);
    //   // Say that we can handle the situation. Clang should try to recover
    //   return true;
    // } else{
    //   return false;
    // }
    return false;
  }

private:
  clang::Sema *m_Sema;
  clang::ASTContext &m_Context;
};

// *****************************************************************************/

LangOptions getFormattingLangOpts(bool Cpp03 = false) {
  LangOptions LangOpts;
  LangOpts.CPlusPlus = 1;
  LangOpts.CPlusPlus11 = Cpp03 ? 0 : 1;
  LangOpts.CPlusPlus14 = Cpp03 ? 0 : 1;
  LangOpts.LineComment = 1;
  LangOpts.Bool = 1;
  LangOpts.ObjC1 = 1;
  LangOpts.ObjC2 = 1;
  return LangOpts;
}

int main() {
  using clang::CompilerInstance;
  using clang::TargetOptions;
  using clang::TargetInfo;
  using clang::FileEntry;
  using clang::Token;
  using clang::ASTContext;
  using clang::ASTConsumer;
  using clang::Parser;
  using clang::DiagnosticOptions;
  using clang::TextDiagnosticPrinter;

  CompilerInstance ci;
  ci.getLangOpts() = getFormattingLangOpts(false);
  DiagnosticOptions diagnosticOptions;
  ci.createDiagnostics();

  std::shared_ptr<clang::TargetOptions> pto = std::make_shared<clang::TargetOptions>();
  pto->Triple = llvm::sys::getDefaultTargetTriple();

  TargetInfo *pti = TargetInfo::CreateTargetInfo(ci.getDiagnostics(), pto);

  ci.setTarget(pti);
  ci.createFileManager();
  ci.createSourceManager(ci.getFileManager());
  ci.createPreprocessor(clang::TU_Complete);
  ci.getPreprocessorOpts().UsePredefines = false;
  ci.createASTContext();

  ci.setASTConsumer(
      llvm::make_unique<ExampleASTConsumer>(&ci, "../src/test.cpp"));

  ci.createSema(TU_Complete, nullptr);
  auto &sema = ci.getSema();
  sema.Initialize();
  DynamicIDHandler handler(&sema);
  sema.addExternalSource(&handler);

  const FileEntry *pFile = ci.getFileManager().getFile("../src/test.cpp");
  ci.getSourceManager().setMainFileID(ci.getSourceManager().createFileID(
      pFile, clang::SourceLocation(), clang::SrcMgr::C_User));
  ci.getDiagnosticClient().BeginSourceFile(ci.getLangOpts(),
                                           &ci.getPreprocessor());
  clang::ParseAST(sema,true,false);
  ci.getDiagnosticClient().EndSourceFile();

  return 0;
}

The idea and the DynamicIDHandler class are from cling project where unknown symbols are variable (hence the comments and the code).


Unless you heavily restrict the code that people are allowed to write, it is basically impossible to do a good job of parsing C++ (and hence syntax highlighting beyond keywords/regular expressions) without parsing all the headers. The pre-processor is particularly good at screwing things up for you.

There are some thoughts on the difficulties of fuzzy parsing (in the context of visual studio) here which might be of interest: http://blogs.msdn.com/b/vcblog/archive/2011/03/03/10136696.aspx


I know the question is fairly old, but have a look here :

LibFuzzy is a library for heuristically parsing C++ based on Clang's Lexer. The fuzzy parser is fault-tolerant, works without knowledge of the build system and on incomplete source files. As the parser necessarily makes guesses, the resulting syntax tree may be partially wrong.

It is a sub-project from clang-highlight, an (experimental?) tool which seems to be no longer developed.

I'm only interested in the fuzzy parsing part and forked it on my github page where I fixed several minor issues and made the tool autonomous (it can be compiled outside clang's source tree). Don't try to compile it with C++14 (which G++ 6's default mode), because there will be conflicts with make_unique .

According to this page, clang-format has its own fuzzy parser (and is actively developed), but the parser was (is ?) more tighly coupled to the tool.

链接地址: http://www.djcxy.com/p/94570.html

上一篇: 通过引用与getter通过复制获得两个不同的单词

下一篇: Clang模糊解析C ++