Skip to content

[IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode #149214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

svkeerthy
Copy link
Contributor

@svkeerthy svkeerthy commented Jul 16, 2025

Add entity mapping mode to llvm-ir2vec and improve triplet generation format for knowledge graph embedding training.

This change streamlines the workflow for training the vocabulary embeddings with IR2Vec by:

  1. Directly generating numeric IDs instead of requiring string-to-ID preprocessing
  2. Providing entity mappings in standard knowledge graph embedding format
  3. Structuring triplet output in train2id format compatible with knowledge graph embedding frameworks
  4. Adding metadata headers to simplify post-processing and training setup

These improvements make IR2Vec more compatible with standard knowledge graph embedding training pipelines and reduce the preprocessing steps needed before training.

See #149215 for more details on how it is used.

(Tracking issue - #141817)

Copy link
Contributor Author

svkeerthy commented Jul 16, 2025

@svkeerthy svkeerthy changed the title revamp-triplet-gen [IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode Jul 16, 2025
@svkeerthy svkeerthy marked this pull request as ready for review July 16, 2025 23:08
@llvmbot
Copy link
Member

llvmbot commented Jul 16, 2025

@llvm/pr-subscribers-mlgo

@llvm/pr-subscribers-llvm-binary-utilities

Author: S. VenkataKeerthy (svkeerthy)

Changes

Add entity mapping mode to llvm-ir2vec and improve triplet generation format for knowledge graph embedding training.

This change streamlines the workflow for training the vocabulary embeddings with IR2Vec by:

  1. Directly generating numeric IDs instead of requiring string-to-ID preprocessing
  2. Providing entity mappings in standard knowledge graph embedding format
  3. Structuring triplet output in train2id format compatible with knowledge graph embedding frameworks
  4. Adding metadata headers to simplify post-processing and training setup

These improvements make IR2Vec more compatible with standard knowledge graph embedding training pipelines and reduce the preprocessing steps needed before training.

See #149215 for more details on how it is used.

(Tracking issue - #141817)


Patch is 20.46 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/149214.diff

4 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-ir2vec.rst (+66-13)
  • (added) llvm/test/tools/llvm-ir2vec/entities.ll (+95)
  • (modified) llvm/test/tools/llvm-ir2vec/triplets.ll (+39-12)
  • (modified) llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp (+136-64)
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 13fe4996b968f..56ece4f509f6e 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -13,17 +13,21 @@ DESCRIPTION
 
 :program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
 generates IR2Vec embeddings for LLVM IR and supports triplet generation 
-for vocabulary training. It provides two main operation modes:
+for vocabulary training. It provides three main operation modes:
 
-1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary
    training from LLVM IR.
 
-2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary 
+   training.
+
+3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
    at different granularity levels (instruction, basic block, or function).
 
 The tool is designed to facilitate machine learning applications that work with
 LLVM IR by converting the IR into numerical representations that can be used by
-ML models.
+ML models. The triplet mode generates numeric IDs directly instead of string 
+triplets, streamlining the training data preparation workflow.
 
 .. note::
 
@@ -34,18 +38,46 @@ ML models.
 OPERATION MODES
 ---------------
 
+Triplet Generation and Entity Mapping Modes are used for preparing
+vocabulary and training data for knowledge graph embeddings. The Embedding Mode
+is used for generating embeddings from LLVM IR using a pre-trained vocabulary.
+
+The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR
+by modeling the relationships between opcodes, types, and operands as a knowledge
+graph. For this purpose, Triplet Generation and Entity Mapping Modes generate
+triplets and entity mappings in the standard format used for knowledge graph
+embedding training (see 
+<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format> 
+for details).
+
 Triplet Generation Mode
 ~~~~~~~~~~~~~~~~~~~~~~~
 
-In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
-consisting of opcodes, types, and operands. These triplets can be used to train
-vocabularies for embedding generation.
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric
+triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets 
+are generated in train2id format. The tool outputs numeric IDs directly using 
+the ir2vec::Vocabulary mapping infrastructure, eliminating the need for 
+string-to-ID preprocessing.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt
+
+Entity Mapping Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by
+IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, 
+types, and operands) with their corresponding numeric IDs, and is not specific for 
+an LLVM IR file.
 
 Usage:
 
 .. code-block:: bash
 
-   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+   llvm-ir2vec --mode=entities -o entity2id.txt
 
 Embedding Generation Mode
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -67,6 +99,7 @@ OPTIONS
  Specify the operation mode. Valid values are:
 
  * ``triplets`` - Generate triplets for vocabulary training
+ * ``entities`` - Generate entity mappings for vocabulary training
  * ``embeddings`` - Generate embeddings using trained vocabulary (default)
 
 .. option:: --level=<level>
@@ -115,7 +148,7 @@ OPTIONS
 
    ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, 
    ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding 
-   mode. These options are ignored in triplet mode.
+   mode. These options are ignored in triplet and entity modes.
 
 INPUT FILE FORMAT
 -----------------
@@ -129,14 +162,34 @@ OUTPUT FORMAT
 Triplet Mode Output
 ~~~~~~~~~~~~~~~~~~~
 
-In triplet mode, the output consists of lines containing space-separated triplets:
+In triplet mode, the output consists of numeric triplets in train2id format with
+metadata headers. The format includes:
+
+.. code-block:: text
+
+   MAX_RELATIONS=<max_relations_count>
+   <head_entity_id> <tail_entity_id> <relation_id>
+   <head_entity_id> <tail_entity_id> <relation_id>
+   ...
+
+Each line after the metadata header represents one instruction relationship,
+with numeric IDs for head entity, relation, and tail entity. The metadata 
+header (MAX_RELATIONS) provides counts for post-processing and training setup.
+
+Entity Mode Output
+~~~~~~~~~~~~~~~~~~
+
+In entity mode, the output consists of entity mapping in the format:
 
 .. code-block:: text
 
-   <opcode> <type> <operand1> <operand2> ...
+   <total_entities>
+   <entity_string>	<numeric_id>
+   <entity_string>	<numeric_id>
+   ...
 
-Each line represents the information of one instruction, with the opcode, type,
-and operands.
+The first line contains the total number of entities, followed by one entity
+mapping per line with tab-separated entity string and numeric ID.
 
 Embedding Mode Output
 ~~~~~~~~~~~~~~~~~~~~~
diff --git a/llvm/test/tools/llvm-ir2vec/entities.ll b/llvm/test/tools/llvm-ir2vec/entities.ll
new file mode 100644
index 0000000000000..57c3d6fa6d6c4
--- /dev/null
+++ b/llvm/test/tools/llvm-ir2vec/entities.ll
@@ -0,0 +1,95 @@
+; RUN: llvm-ir2vec --mode=entities | FileCheck %s
+
+CHECK: 92
+CHECK-NEXT: Ret     0
+CHECK-NEXT: Br      1
+CHECK-NEXT: Switch  2
+CHECK-NEXT: IndirectBr      3
+CHECK-NEXT: Invoke  4
+CHECK-NEXT: Resume  5
+CHECK-NEXT: Unreachable     6
+CHECK-NEXT: CleanupRet      7
+CHECK-NEXT: CatchRet        8
+CHECK-NEXT: CatchSwitch     9
+CHECK-NEXT: CallBr  10
+CHECK-NEXT: FNeg    11
+CHECK-NEXT: Add     12
+CHECK-NEXT: FAdd    13
+CHECK-NEXT: Sub     14
+CHECK-NEXT: FSub    15
+CHECK-NEXT: Mul     16
+CHECK-NEXT: FMul    17
+CHECK-NEXT: UDiv    18
+CHECK-NEXT: SDiv    19
+CHECK-NEXT: FDiv    20
+CHECK-NEXT: URem    21
+CHECK-NEXT: SRem    22
+CHECK-NEXT: FRem    23
+CHECK-NEXT: Shl     24
+CHECK-NEXT: LShr    25
+CHECK-NEXT: AShr    26
+CHECK-NEXT: And     27
+CHECK-NEXT: Or      28
+CHECK-NEXT: Xor     29
+CHECK-NEXT: Alloca  30
+CHECK-NEXT: Load    31
+CHECK-NEXT: Store   32
+CHECK-NEXT: GetElementPtr   33
+CHECK-NEXT: Fence   34
+CHECK-NEXT: AtomicCmpXchg   35
+CHECK-NEXT: AtomicRMW       36
+CHECK-NEXT: Trunc   37
+CHECK-NEXT: ZExt    38
+CHECK-NEXT: SExt    39
+CHECK-NEXT: FPToUI  40
+CHECK-NEXT: FPToSI  41
+CHECK-NEXT: UIToFP  42
+CHECK-NEXT: SIToFP  43
+CHECK-NEXT: FPTrunc 44
+CHECK-NEXT: FPExt   45
+CHECK-NEXT: PtrToInt        46
+CHECK-NEXT: IntToPtr        47
+CHECK-NEXT: BitCast 48
+CHECK-NEXT: AddrSpaceCast   49
+CHECK-NEXT: CleanupPad      50
+CHECK-NEXT: CatchPad        51
+CHECK-NEXT: ICmp    52
+CHECK-NEXT: FCmp    53
+CHECK-NEXT: PHI     54
+CHECK-NEXT: Call    55
+CHECK-NEXT: Select  56
+CHECK-NEXT: UserOp1 57
+CHECK-NEXT: UserOp2 58
+CHECK-NEXT: VAArg   59
+CHECK-NEXT: ExtractElement  60
+CHECK-NEXT: InsertElement   61
+CHECK-NEXT: ShuffleVector   62
+CHECK-NEXT: ExtractValue    63
+CHECK-NEXT: InsertValue     64
+CHECK-NEXT: LandingPad      65
+CHECK-NEXT: Freeze  66
+CHECK-NEXT: FloatTy 67
+CHECK-NEXT: FloatTy 68
+CHECK-NEXT: FloatTy 69
+CHECK-NEXT: FloatTy 70
+CHECK-NEXT: FloatTy 71
+CHECK-NEXT: FloatTy 72
+CHECK-NEXT: FloatTy 73
+CHECK-NEXT: VoidTy  74
+CHECK-NEXT: LabelTy 75
+CHECK-NEXT: MetadataTy      76
+CHECK-NEXT: UnknownTy       77
+CHECK-NEXT: TokenTy 78
+CHECK-NEXT: IntegerTy       79
+CHECK-NEXT: FunctionTy      80
+CHECK-NEXT: PointerTy       81
+CHECK-NEXT: StructTy        82
+CHECK-NEXT: ArrayTy 83
+CHECK-NEXT: VectorTy        84
+CHECK-NEXT: VectorTy        85
+CHECK-NEXT: PointerTy       86
+CHECK-NEXT: UnknownTy       87
+CHECK-NEXT: Function        88
+CHECK-NEXT: Pointer 89
+CHECK-NEXT: Constant        90
+CHECK-NEXT: Variable        91
diff --git a/llvm/test/tools/llvm-ir2vec/triplets.ll b/llvm/test/tools/llvm-ir2vec/triplets.ll
index d1ef5b388e258..dcd1dc9afb478 100644
--- a/llvm/test/tools/llvm-ir2vec/triplets.ll
+++ b/llvm/test/tools/llvm-ir2vec/triplets.ll
@@ -24,15 +24,42 @@ entry:
   ret i32 %result
 }
 
-; TRIPLETS: Add IntegerTy Variable Variable
-; TRIPLETS-NEXT: Ret VoidTy Variable
-; TRIPLETS-NEXT: Mul IntegerTy Variable Variable
-; TRIPLETS-NEXT: Ret VoidTy Variable
-; TRIPLETS-NEXT: Alloca PointerTy Constant
-; TRIPLETS-NEXT: Alloca PointerTy Constant
-; TRIPLETS-NEXT: Store VoidTy Variable Pointer
-; TRIPLETS-NEXT: Store VoidTy Variable Pointer
-; TRIPLETS-NEXT: Load IntegerTy Pointer
-; TRIPLETS-NEXT: Load IntegerTy Pointer
-; TRIPLETS-NEXT: Add IntegerTy Variable Variable
-; TRIPLETS-NEXT: Ret VoidTy Variable
+; TRIPLETS: MAX_RELATION=3
+; TRIPLETS-NEXT: 12      79      0
+; TRIPLETS-NEXT: 12      91      2
+; TRIPLETS-NEXT: 12      91      3
+; TRIPLETS-NEXT: 12      0       1
+; TRIPLETS-NEXT: 0       74      0
+; TRIPLETS-NEXT: 0       91      2
+; TRIPLETS-NEXT: 16      79      0
+; TRIPLETS-NEXT: 16      91      2
+; TRIPLETS-NEXT: 16      91      3
+; TRIPLETS-NEXT: 16      0       1
+; TRIPLETS-NEXT: 0       74      0
+; TRIPLETS-NEXT: 0       91      2
+; TRIPLETS-NEXT: 30      81      0
+; TRIPLETS-NEXT: 30      90      2
+; TRIPLETS-NEXT: 30      30      1
+; TRIPLETS-NEXT: 30      81      0
+; TRIPLETS-NEXT: 30      90      2
+; TRIPLETS-NEXT: 30      32      1
+; TRIPLETS-NEXT: 32      74      0
+; TRIPLETS-NEXT: 32      91      2
+; TRIPLETS-NEXT: 32      89      3
+; TRIPLETS-NEXT: 32      32      1
+; TRIPLETS-NEXT: 32      74      0
+; TRIPLETS-NEXT: 32      91      2
+; TRIPLETS-NEXT: 32      89      3
+; TRIPLETS-NEXT: 32      31      1
+; TRIPLETS-NEXT: 31      79      0
+; TRIPLETS-NEXT: 31      89      2
+; TRIPLETS-NEXT: 31      31      1
+; TRIPLETS-NEXT: 31      79      0
+; TRIPLETS-NEXT: 31      89      2
+; TRIPLETS-NEXT: 31      12      1
+; TRIPLETS-NEXT: 12      79      0
+; TRIPLETS-NEXT: 12      91      2
+; TRIPLETS-NEXT: 12      91      3
+; TRIPLETS-NEXT: 12      0       1
+; TRIPLETS-NEXT: 0       74      0
+; TRIPLETS-NEXT: 0       91      2
diff --git a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
index 3e6cb4b64fde5..40257c0d6aba4 100644
--- a/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
+++ b/llvm/tools/llvm-ir2vec/llvm-ir2vec.cpp
@@ -9,13 +9,20 @@
 /// \file
 /// This file implements the IR2Vec embedding generation tool.
 ///
-/// This tool provides two main functionalities:
+/// This tool provides three main modes:
 ///
 /// 1. Triplet Generation Mode (--mode=triplets):
-///    Generates triplets (opcode, type, operands) for vocabulary training.
-///    Usage: llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+///    Generates numeric triplets (head, tail, relation) for vocabulary
+///    training. Output format: MAX_RELATION=N header followed by
+///    head\ttail\trelation lines. Relations: 0=Type, 1=Next, 2+=Arg0,Arg1,...
+///    Usage: llvm-ir2vec --mode=triplets input.bc -o train2id.txt
 ///
-/// 2. Embedding Generation Mode (--mode=embeddings):
+/// 2. Entities Generation Mode (--mode=entities):
+///    Generates entity mappings for vocabulary training.
+///    Output format: <total_entities> header followed by entity\tid lines.
+///    Usage: llvm-ir2vec --mode=entities input.bc -o entity2id.txt
+///
+/// 3. Embedding Generation Mode (--mode=embeddings):
 ///    Generates IR2Vec embeddings using a trained vocabulary.
 ///    Usage: llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json
 ///    --level=func input.bc -o embeddings.txt Levels: --level=inst
@@ -61,16 +68,19 @@ static cl::opt<std::string> OutputFilename("o", cl::desc("Output filename"),
 
 enum ToolMode {
   TripletMode,  // Generate triplets for vocabulary training
+  EntityMode,   // Generate entity mappings for vocabulary training
   EmbeddingMode // Generate embeddings using trained vocabulary
 };
 
-static cl::opt<ToolMode>
-    Mode("mode", cl::desc("Tool operation mode:"),
-         cl::values(clEnumValN(TripletMode, "triplets",
-                               "Generate triplets for vocabulary training"),
-                    clEnumValN(EmbeddingMode, "embeddings",
-                               "Generate embeddings using trained vocabulary")),
-         cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory));
+static cl::opt<ToolMode> Mode(
+    "mode", cl::desc("Tool operation mode:"),
+    cl::values(clEnumValN(TripletMode, "triplets",
+                          "Generate triplets for vocabulary training"),
+               clEnumValN(EntityMode, "entities",
+                          "Generate entity mappings for vocabulary training"),
+               clEnumValN(EmbeddingMode, "embeddings",
+                          "Generate embeddings using trained vocabulary")),
+    cl::init(EmbeddingMode), cl::cat(IR2VecToolCategory));
 
 static cl::opt<std::string>
     FunctionName("function", cl::desc("Process specific function only"),
@@ -95,6 +105,13 @@ static cl::opt<EmbeddingLevel>
 
 namespace {
 
+/// Relation types for triplet generation
+enum RelationType {
+  TypeRelation = 0, ///< Instruction to type relationship
+  NextRelation = 1, ///< Sequential instruction relationship
+  ArgRelation = 2   ///< Instruction to operand relationship (ArgRelation + N)
+};
+
 /// Helper class for collecting IR triplets and generating embeddings
 class IR2VecTool {
 private:
@@ -116,25 +133,96 @@ class IR2VecTool {
     return Vocab->isValid();
   }
 
-  /// Generate triplets for the entire module
+  /// Generate triplets for the module
+  /// Output format: MAX_RELATION=N header followed by relationships
   void generateTriplets(raw_ostream &OS) const {
-    for (const Function &F : M)
-      generateTriplets(F, OS);
+    unsigned MaxRelation = NextRelation; // Track maximum relation ID
+    std::string Relationships;
+    raw_string_ostream RelOS(Relationships);
+
+    for (const Function &F : M) {
+      unsigned FuncMaxRelation = generateTriplets(F, RelOS);
+      MaxRelation = std::max(MaxRelation, FuncMaxRelation);
+    }
+
+    RelOS.flush();
+
+    // Write metadata header followed by relationships
+    OS << "MAX_RELATION=" << MaxRelation << '\n';
+    OS << Relationships;
   }
 
   /// Generate triplets for a single function
-  void generateTriplets(const Function &F, raw_ostream &OS) const {
+  /// Returns the maximum relation ID used in this function
+  unsigned generateTriplets(const Function &F, raw_ostream &OS) const {
     if (F.isDeclaration())
-      return;
+      return 0;
+
+    unsigned MaxRelation = 1;
+    unsigned PrevOpcode = 0;
+    bool HasPrevOpcode = false;
+
+    for (const BasicBlock &BB : F) {
+      for (const auto &I : BB.instructionsWithoutDebug()) {
+        unsigned Opcode = Vocabulary::getNumericID(I.getOpcode());
+        unsigned TypeID = Vocabulary::getNumericID(I.getType()->getTypeID());
+
+        // Add "Next" relationship with previous instruction
+        if (HasPrevOpcode) {
+          OS << PrevOpcode << '\t' << Opcode << '\t' << NextRelation << '\n';
+          LLVM_DEBUG(dbgs()
+                     << Vocabulary::getVocabKeyForOpcode(PrevOpcode + 1) << '\t'
+                     << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t'
+                     << "Next\n");
+        }
 
-    std::string LocalOutput;
-    raw_string_ostream LocalOS(LocalOutput);
+        // Add "Type" relationship
+        OS << Opcode << '\t' << TypeID << '\t' << TypeRelation << '\n';
+        LLVM_DEBUG(
+            dbgs() << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t'
+                   << Vocabulary::getVocabKeyForTypeID(I.getType()->getTypeID())
+                   << '\t' << "Type\n");
+
+        // Add "Arg" relationships
+        unsigned ArgIndex = 0;
+        for (const Use &U : I.operands()) {
+          unsigned OperandID = Vocabulary::getNumericID(U.get());
+          unsigned RelationID = ArgRelation + ArgIndex;
+          OS << Opcode << '\t' << OperandID << '\t' << RelationID << '\n';
+
+          LLVM_DEBUG({
+            StringRef OperandStr = Vocabulary::getVocabKeyForOperandKind(
+                Vocabulary::getOperandKind(U.get()));
+            dbgs() << Vocabulary::getVocabKeyForOpcode(Opcode + 1) << '\t'
+                   << OperandStr << '\t' << "Arg" << ArgIndex << '\n';
+          });
+
+          ArgIndex++;
+        }
+        // Only update MaxRelation if there were operands
+        if (ArgIndex > 0) {
+          MaxRelation = std::max(MaxRelation, ArgRelation + ArgIndex - 1);
+        }
+        PrevOpcode = Opcode;
+        HasPrevOpcode = true;
+      }
+    }
 
-    for (const BasicBlock &BB : F)
-      traverseBasicBlock(BB, LocalOS);
+    return MaxRelation;
+  }
 
-    LocalOS.flush();
-    OS << LocalOutput;
+  /// Dump entity ID to string mappings
+  static void generateEntityMappings(raw_ostream &OS) {
+    // FIXME: Currently, the generated entity mappings are not one-to-one;
+    // Multiple TypeIDs map to same string key (Like Half, BFloat, etc. map to
+    // FloatTy). This would hinder learning good seed embeddings.
+    // We should fix this in the future by ensuring unique string keys either by
+    // post-processing here without changing the mapping in ir2vec::Vocabulary,
+    // or by changing the Vocabulary generation logic to ensure unique keys.
+    auto EntityLen = Vocabulary::expectedSize();
+    OS << EntityLen << "\n";
+    for (unsigned EntityID = 0; EntityID < EntityLen; ++EntityID)
+      OS << Vocabulary::getStringKey(EntityID) << '\t' << EntityID << '\n';
   }
 
   /// Generate embeddings for the entire module
@@ -198,27 +286,6 @@ class IR2VecTool {
     }
     }
   }
-
-private:
-  /// Process a single basic block for triplet generation
-  void traverseBasicBlock(const BasicBlock &BB, raw_string_ostream &OS) const {
-    // Consider only non-debug and non-pseudo instructions
-    for (const auto &I : BB.instructionsWithoutDebug()) {
-      StringRef OpcStr = Vocabulary::getVocabKeyForOpcode(I.getOpcode());
-      StringRef TypeStr =
-          Vocabulary::getVocabKeyForTypeID(I.getType()->getTypeID());
-
-      OS << '\n' << OpcStr << ' ' << TypeStr << ' ';
-
-      LLVM_DEBUG(I.print(dbgs()); dbgs() << "\n");
-      LLVM_DEBUG(I.getType()->print(dbgs()); dbgs() << " Type\n");
-
-      for (const Use &U : I.operands())
-        OS << Vocabulary::getVocabKeyForOperandKind(
-                  Vocabulary::getOperandKind(U.get()))
-           << ' ';
-    }
-  }
 };
 
 Error processModule(Module &M, raw_ostream &OS) {
@@ -246,18 +313,7 @@ Error processModule(Module &M, raw_ostream &OS) {
       Tool.generateEmbeddings(OS);
     }
   } else {
-    // Triplet generation mode - no vocabulary needed
-    if (!FunctionName.empty())
-      // Process single function
-      if (const Function *F = M.getFunction(FunctionName))
-        Tool.generateTriplets(*F, OS);
-      else
-        return createStringError(errc::invalid_argument,
-                                 "Function '%s' not found",
-                                 FunctionName.c_str());
-    else
-      // Process all functions
-      Tool.generateTriplets(OS);
+    Tool.generateTriplets(OS);
   }
   return Error::success();
 }
@@ -280,9 +336,32 @@ int main(int argc, char **argv) {
       "See https://llvm.org/docs/CommandGuide/llvm-ir2vec.html for more "
       "information.\n");
 
+  // Validate input file requirement
+  if (InputFilename.empty() && Mode != EntityMode) {
+    errs() << "Error: Input file (.bc/.ll) or stdin (-) is required\n";
+    return 1;
+  }
+
   // Validate command line options
-  if (Mode == TripletMode && Level.getNumOccurrences() > 0)
-    errs() << "Warning: --level option is ignored in triplet mode\n";
+  if (Mode != EmbeddingMode) {
+    if (Level.getNumOccurrences() > 0)
+      errs() << "Warning: --level option is ignored\n";
+    if (FunctionName.getNumOccurrences() > 0)
+      errs() << "Warning: --function option is ignored\n";
+  }
+
+  std::error_code EC;
+  raw_fd_ostream OS(OutputFilename, EC);
+  if (EC) {
+    errs() << "Error opening output file: " << EC.message() << "\n";
+    return 1;
+  }
+
+  if (Mode == EntityMode) {
+    // Just dump entity mappings without processing any IR
+    IR2VecTool::generateEntityMappings(OS);
+    return 0;
+  }
 
   ...
[truncated]

@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from 42f9479 to 528ac7b Compare July 16, 2025 23:32
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec branch from 6efc8a8 to 9e17794 Compare July 16, 2025 23:32
@llvmbot llvmbot added the mlgo label Jul 16, 2025
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec branch from 9e17794 to 1e22261 Compare July 16, 2025 23:46
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch 2 times, most recently from 09d483a to 3f8c21f Compare July 17, 2025 18:04
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec branch 2 times, most recently from 0903552 to 7fee589 Compare July 17, 2025 19:12
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from 3f8c21f to dff3bdb Compare July 17, 2025 19:12
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec branch from 7fee589 to 36ecab5 Compare July 17, 2025 19:55
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch 2 times, most recently from ea01937 to b209998 Compare July 17, 2025 19:58
svkeerthy added a commit that referenced this pull request Jul 17, 2025
…#149212)

Add helper methods to IR2Vec's Vocabulary class for numeric ID mapping and vocabulary size calculation. These APIs will be useful in triplet generation for `llvm-ir2vec` tool (See #149214). 

(Tracking issue - #141817)
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec branch from 32275ce to a47b7f7 Compare July 17, 2025 20:41
Base automatically changed from users/svkeerthy/07-16-support-stdin-input-llvm-ir2vec to main July 17, 2025 20:43
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from b209998 to 2f6e0b4 Compare July 17, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants