Skip to content

[wip] use html_text to get element text content in microdata #114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 19, 2019

Conversation

kmike
Copy link
Member

@kmike kmike commented Jun 4, 2019

This is an unfinished fix for #113, please don't merge. I've opened the PR just to share initial code. It has these problems:

  • cleaning can be too aggressive, e.g. <style> elements are removed. This may cause missing extractions, if microdata is set on these elements.
  • tests are failing (I haven't looked at them - probably the fix is not correct at all :)

The problem with not cleaning HTML tree once is that it could make algorithm O(N^2) - but maybe that's fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants