Skip to content

[Store] Add a way to store document content when using Chroma DB #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dorrogeray
Copy link
Contributor

Q A
Bug fix? no
New feature? yes
Docs? yes
Issues N/A
License MIT

This is just a draft for now to explain what I am trying to achieve - I would like to be able to store the original document content in the Chroma DB records, but the VectorDocument currently has no way to pass that in, other than the metadata. I feel like this requirement will not be specific just to Chroma DB, but various databases used for embeddings will support storing of the original document content along with the vectors.

Any suggestions on how to approach this holistically? Should the VectorDocument be expanded with optional field like ?string $content = null?

Additionally (see https://symfony.com/releases):

@OskarStark
Copy link
Contributor

Should the VectorDocument be expanded with optional field like ?string $content = null?

IIRC we had something like this in the past, did we @chr-hertel ?

Other than that, it could be a good idea I guess.

@chr-hertel
Copy link
Member

documents with chromadb is more like a document reference (name?) or the original content?


Other than that, with #57 I implemented a source field that references the "original" while splitting.
Maybe we can adopt that if we're also talking about a reference here - maybe that also has some synergy with #158

So, my theory here:
TextDocuments represent a textual version of another resource, e.g. an url, an entity, a file, ...

The $id with UUID as identifier is obviously too limited, but i would still like to have something to express those relations between documents.

Idea

  • TextDocument has an ID (string|int) based on the source/content/entity/url/file
  • TextDocument get split into multiple documents, still has a parent ID as metadata
  • VectorDocument has the same ID, keeps the parent ID as metadata

(instead a simple ID we could also think of a more sophisticated refernece object or sth, but i'd be lazy on that)

Now
If a store, like chroma, supports referencing the original documents, that can be supported in the implementation promoting something from metadata

@dorrogeray
Copy link
Contributor Author

documents with chromadb is more like a document reference (name?) or the original content?

Its the original content, the string from which the embedding vectors are generated.

Other than that, with #57 I implemented a source field that references the "original" while splitting. Maybe we can adopt that if we're also talking about a reference here - maybe that also has some synergy with #158

Looking at #57, I can see that the text field is added into the metadata, which are then passed onto the VectorDocument, so the needed information gets delivered to the store implementation.

The text metadata field on the TextDocument seems to be a duplicate field since there already is TextDocument::$content, but because its there, it gets passed into the VectorDocument.

I think that what concerns me a bit is that there are now kind-of standardized metadata fields getting introduced, on which the individual store implementations need to start to rely on if they want to implement certain feature (for example, take the VectorDocument->metadata->text and use it to generate documents when batch storing data into Chroma DB).

The Metadata approach is flexible, but if there are going to be some standardized fields like:

  • text
  • parent_id
  • source

Then those should probably be enumerated in some MetadataFields enum, and then referenced in codebase using this enum, so that detecting the relationship of where these fields are relied on is simple in IDEs?

Alternative would be to have these (or some of these) as actual fields on the VectorDocument, like:

  • VectorDocument::$content (nullable)
  • VectorDocument::$parentId (nullable)
  • VectorDocument::$source (nullable)

And have them more strictly typed in this way, keeping the Metadata clean for more implementation specific fields.

I am not sure which approach is better overall, but VectorDocument::$content seems to me to be generic enough to be promoted to property. On the other hand, I am not sure why the text metadata field is needed if its already in TextDocument::$content (unless the reason is to pass it to VectorDocument via metadata) https://github.com/symfony/ai/pull/57/files#diff-b16e907e2249b7da92861730a93ea33b39b43cef67d7f937fea56ba771a8cd19R61

@dorrogeray
Copy link
Contributor Author

@chr-hertel Should I try to go for adding of private ?string $content = null field on the VectorDocument? Would that be acceptable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants