[wip] use html_text to get element text content in microdata #114

kmike · 2019-06-04T07:30:59Z

This is an unfinished fix for #113, please don't merge. I've opened the PR just to share initial code. It has these problems:

cleaning can be too aggressive, e.g. <style> elements are removed. This may cause missing extractions, if microdata is set on these elements.
tests are failing (I haven't looked at them - probably the fix is not correct at all :)

The problem with not cleaning HTML tree once is that it could make algorithm O(N^2) - but maybe that's fine.

[wip] use html_text to get element text content in microdata

7e952b0

jakubwasikowski mentioned this pull request Jul 17, 2019

Fix incorrectly formatted description property #119

Merged

jakubwasikowski merged commit 7e952b0 into master Jul 19, 2019

Provide feedback