Skip to content

[IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode #149214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 66 additions & 13 deletions llvm/docs/CommandGuide/llvm-ir2vec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,21 @@ DESCRIPTION

:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
generates IR2Vec embeddings for LLVM IR and supports triplet generation
for vocabulary training. It provides two main operation modes:
for vocabulary training. It provides three main operation modes:

1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary
training from LLVM IR.

2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary
training.

3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
at different granularity levels (instruction, basic block, or function).

The tool is designed to facilitate machine learning applications that work with
LLVM IR by converting the IR into numerical representations that can be used by
ML models.
ML models. The triplet mode generates numeric IDs directly instead of string
triplets, streamlining the training data preparation workflow.

.. note::

Expand All @@ -34,18 +38,46 @@ ML models.
OPERATION MODES
---------------

Triplet Generation and Entity Mapping Modes are used for preparing
vocabulary and training data for knowledge graph embeddings. The Embedding Mode
is used for generating embeddings from LLVM IR using a pre-trained vocabulary.

The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
by modeling the relationships between opcodes, types, and operands as a knowledge
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
triplets and entity mappings in the standard format used for knowledge graph
embedding training (see
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format>
for details).

Triplet Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~

In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
consisting of opcodes, types, and operands. These triplets can be used to train
vocabularies for embedding generation.
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets
are generated in train2id format. The tool outputs numeric IDs directly using
the ir2vec::Vocabulary mapping infrastructure, eliminating the need for
string-to-ID preprocessing.

Usage:

.. code-block:: bash

llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt

Entity Mapping Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by
IR2Vec in entity2id format. This mode outputs all supported entities (opcodes,
types, and operands) with their corresponding numeric IDs, and is not specific for
an LLVM IR file.

Usage:

.. code-block:: bash

llvm-ir2vec --mode=triplets input.bc -o triplets.txt
llvm-ir2vec --mode=entities -o entity2id.txt

Embedding Generation Mode
~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -67,6 +99,7 @@ OPTIONS
Specify the operation mode. Valid values are:

* ``triplets`` - Generate triplets for vocabulary training
* ``entities`` - Generate entity mappings for vocabulary training
* ``embeddings`` - Generate embeddings using trained vocabulary (default)

.. option:: --level=<level>
Expand Down Expand Up @@ -115,7 +148,7 @@ OPTIONS

``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
mode. These options are ignored in triplet mode.
mode. These options are ignored in triplet and entity modes.

INPUT FILE FORMAT
-----------------
Expand All @@ -129,14 +162,34 @@ OUTPUT FORMAT
Triplet Mode Output
~~~~~~~~~~~~~~~~~~~

In triplet mode, the output consists of lines containing space-separated triplets:
In triplet mode, the output consists of numeric triplets in train2id format with
metadata headers. The format includes:

.. code-block:: text

MAX_RELATIONS=<max_relations_count>
<head_entity_id> <tail_entity_id> <relation_id>
<head_entity_id> <tail_entity_id> <relation_id>
...

Each line after the metadata header represents one instruction relationship,
with numeric IDs for head entity, relation, and tail entity. The metadata
header (MAX_RELATIONS) provides counts for post-processing and training setup.

Entity Mode Output
~~~~~~~~~~~~~~~~~~

In entity mode, the output consists of entity mapping in the format:

.. code-block:: text

<opcode> <type> <operand1> <operand2> ...
<total_entities>
<entity_string> <numeric_id>
<entity_string> <numeric_id>
...

Each line represents the information of one instruction, with the opcode, type,
and operands.
The first line contains the total number of entities, followed by one entity
mapping per line with tab-separated entity string and numeric ID.

Embedding Mode Output
~~~~~~~~~~~~~~~~~~~~~
Expand Down
95 changes: 95 additions & 0 deletions llvm/test/tools/llvm-ir2vec/entities.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
; RUN: llvm-ir2vec --mode=entities | FileCheck %s

CHECK: 92
CHECK-NEXT: Ret 0
CHECK-NEXT: Br 1
CHECK-NEXT: Switch 2
CHECK-NEXT: IndirectBr 3
CHECK-NEXT: Invoke 4
CHECK-NEXT: Resume 5
CHECK-NEXT: Unreachable 6
CHECK-NEXT: CleanupRet 7
CHECK-NEXT: CatchRet 8
CHECK-NEXT: CatchSwitch 9
CHECK-NEXT: CallBr 10
CHECK-NEXT: FNeg 11
CHECK-NEXT: Add 12
CHECK-NEXT: FAdd 13
CHECK-NEXT: Sub 14
CHECK-NEXT: FSub 15
CHECK-NEXT: Mul 16
CHECK-NEXT: FMul 17
CHECK-NEXT: UDiv 18
CHECK-NEXT: SDiv 19
CHECK-NEXT: FDiv 20
CHECK-NEXT: URem 21
CHECK-NEXT: SRem 22
CHECK-NEXT: FRem 23
CHECK-NEXT: Shl 24
CHECK-NEXT: LShr 25
CHECK-NEXT: AShr 26
CHECK-NEXT: And 27
CHECK-NEXT: Or 28
CHECK-NEXT: Xor 29
CHECK-NEXT: Alloca 30
CHECK-NEXT: Load 31
CHECK-NEXT: Store 32
CHECK-NEXT: GetElementPtr 33
CHECK-NEXT: Fence 34
CHECK-NEXT: AtomicCmpXchg 35
CHECK-NEXT: AtomicRMW 36
CHECK-NEXT: Trunc 37
CHECK-NEXT: ZExt 38
CHECK-NEXT: SExt 39
CHECK-NEXT: FPToUI 40
CHECK-NEXT: FPToSI 41
CHECK-NEXT: UIToFP 42
CHECK-NEXT: SIToFP 43
CHECK-NEXT: FPTrunc 44
CHECK-NEXT: FPExt 45
CHECK-NEXT: PtrToInt 46
CHECK-NEXT: IntToPtr 47
CHECK-NEXT: BitCast 48
CHECK-NEXT: AddrSpaceCast 49
CHECK-NEXT: CleanupPad 50
CHECK-NEXT: CatchPad 51
CHECK-NEXT: ICmp 52
CHECK-NEXT: FCmp 53
CHECK-NEXT: PHI 54
CHECK-NEXT: Call 55
CHECK-NEXT: Select 56
CHECK-NEXT: UserOp1 57
CHECK-NEXT: UserOp2 58
CHECK-NEXT: VAArg 59
CHECK-NEXT: ExtractElement 60
CHECK-NEXT: InsertElement 61
CHECK-NEXT: ShuffleVector 62
CHECK-NEXT: ExtractValue 63
CHECK-NEXT: InsertValue 64
CHECK-NEXT: LandingPad 65
CHECK-NEXT: Freeze 66
CHECK-NEXT: FloatTy 67
CHECK-NEXT: FloatTy 68
CHECK-NEXT: FloatTy 69
CHECK-NEXT: FloatTy 70
CHECK-NEXT: FloatTy 71
CHECK-NEXT: FloatTy 72
CHECK-NEXT: FloatTy 73
CHECK-NEXT: VoidTy 74
CHECK-NEXT: LabelTy 75
CHECK-NEXT: MetadataTy 76
CHECK-NEXT: UnknownTy 77
CHECK-NEXT: TokenTy 78
CHECK-NEXT: IntegerTy 79
CHECK-NEXT: FunctionTy 80
CHECK-NEXT: PointerTy 81
CHECK-NEXT: StructTy 82
CHECK-NEXT: ArrayTy 83
CHECK-NEXT: VectorTy 84
CHECK-NEXT: VectorTy 85
CHECK-NEXT: PointerTy 86
CHECK-NEXT: UnknownTy 87
CHECK-NEXT: Function 88
CHECK-NEXT: Pointer 89
CHECK-NEXT: Constant 90
CHECK-NEXT: Variable 91
51 changes: 39 additions & 12 deletions llvm/test/tools/llvm-ir2vec/triplets.ll
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,42 @@ entry:
ret i32 %result
}

; TRIPLETS: Add IntegerTy Variable Variable
; TRIPLETS-NEXT: Ret VoidTy Variable
; TRIPLETS-NEXT: Mul IntegerTy Variable Variable
; TRIPLETS-NEXT: Ret VoidTy Variable
; TRIPLETS-NEXT: Alloca PointerTy Constant
; TRIPLETS-NEXT: Alloca PointerTy Constant
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
; TRIPLETS-NEXT: Store VoidTy Variable Pointer
; TRIPLETS-NEXT: Load IntegerTy Pointer
; TRIPLETS-NEXT: Load IntegerTy Pointer
; TRIPLETS-NEXT: Add IntegerTy Variable Variable
; TRIPLETS-NEXT: Ret VoidTy Variable
; TRIPLETS: MAX_RELATION=3
; TRIPLETS-NEXT: 12 79 0
; TRIPLETS-NEXT: 12 91 2
; TRIPLETS-NEXT: 12 91 3
; TRIPLETS-NEXT: 12 0 1
; TRIPLETS-NEXT: 0 74 0
; TRIPLETS-NEXT: 0 91 2
; TRIPLETS-NEXT: 16 79 0
; TRIPLETS-NEXT: 16 91 2
; TRIPLETS-NEXT: 16 91 3
; TRIPLETS-NEXT: 16 0 1
; TRIPLETS-NEXT: 0 74 0
; TRIPLETS-NEXT: 0 91 2
; TRIPLETS-NEXT: 30 81 0
; TRIPLETS-NEXT: 30 90 2
; TRIPLETS-NEXT: 30 30 1
; TRIPLETS-NEXT: 30 81 0
; TRIPLETS-NEXT: 30 90 2
; TRIPLETS-NEXT: 30 32 1
; TRIPLETS-NEXT: 32 74 0
; TRIPLETS-NEXT: 32 91 2
; TRIPLETS-NEXT: 32 89 3
; TRIPLETS-NEXT: 32 32 1
; TRIPLETS-NEXT: 32 74 0
; TRIPLETS-NEXT: 32 91 2
; TRIPLETS-NEXT: 32 89 3
; TRIPLETS-NEXT: 32 31 1
; TRIPLETS-NEXT: 31 79 0
; TRIPLETS-NEXT: 31 89 2
; TRIPLETS-NEXT: 31 31 1
; TRIPLETS-NEXT: 31 79 0
; TRIPLETS-NEXT: 31 89 2
; TRIPLETS-NEXT: 31 12 1
; TRIPLETS-NEXT: 12 79 0
; TRIPLETS-NEXT: 12 91 2
; TRIPLETS-NEXT: 12 91 3
; TRIPLETS-NEXT: 12 0 1
; TRIPLETS-NEXT: 0 74 0
; TRIPLETS-NEXT: 0 91 2
Loading