Getting "actual" registers from MCInsts (x86)

2018-06-25 21:24:10

I'm using llvm-mc with the goal of making a relatively smart disassembler (identifying and tracking locals, easily following branches, etc), and part of that is creating a string representation of the disassembled instructions.

When I started this, I expected that I would be able to relatively easily identify registers and values used by MCInst s and whip out another representation myself with which I could easily work with. However, after some investigation, I realized that the correlation between the operands shown with the textual representation of an instruction and the operands that are actually present within the MCInst object is fairly low. Here are a few examples (Intel syntax):

Moving, say, 11587 as a 32-bit immediate into eax would be done with the MOV32ri opcode. The textual representation would be mov eax, 11587 . The corresponding MCInst would have two operands, a register and an immediate. This works for me. This is great.

Adding 11587 to eax would be done with the ADD32ri opcode. The textual representation would be add eax, 11587 . However, this time, the corresponding MCInst has three operands: eax is there twice and the immediate is in the end. This isn't so great. I can assume that this is an artifact of the lowering process, that the first instance of eax is the destination register and that the second one is there to be the source (even though x86 does not distinguish between the two), and I can hack around that.

Moving a 32-bits eip -relative value to eax would be done with the MOV32ao32 opcode. The textual representation would be mov eax, dword ptr [11587] . In this case, the MCInst doesn't even have an operand for eax , it can only be inferred from the operand type present in the opcode name. I can hack around that too, but things are getting less and less pretty and I've only tested 5-6 different instructions out of the 1300+ that x86 supports.

Obviously, for the purpose of showing text, I could get the textual representation with an MCInstPrinter , but the mapping between what's shown there and what the MCInst has is still muddy.

Is there a straightforward way to tell which operands appears in the textual representation of an instruction?

Add having three arguments sounds like a compiler builder preference for Three address code is bleeding through, since there is no justification for that in Intel assembler. (you can't add and store to a different register with the ADD instruction, you can with LEA though).

The opcodes run into the hundreds if you count all extensions (like SSE, FPU etc), and worse there are multiple variants of an opcode due to addressing modes and prefixes.

The NASM assembler has some tables in the source that you could try to mine if your llvm-mc system doesn't provide the functionality.

The MC level is very low and the operand layout depends on the opcode. That said, there are mapping tables that tell you what is where. MCInstrDesc and MCOperandInfo will tell you which operands and sources and destinations, whether they are immediates, registers, etc. and a set of flags.

You'll also need to get familiar with MCRegisterClass and MCRegisterInfo and a bunch of other stuff. It's a complicated interface because the task of representing arbitrary target information is complicated.

I would look at the code for the various MC-based tools to get started. You shouldn't need your own representation, MC should have everything you need.

链接地址: http://www.djcxy.com/p/72442.html

上一篇: 引用内存位置的内容。（x86寻址模式）

下一篇: 从MCInsts（x86）获取“实际”寄存器