Quantcast
Channel: KNIME RSS
Viewing all articles
Browse latest Browse all 4157

HTML parser incorrectly normalizes XML tags

$
0
0

Dear all,

Hope you can help me with the following.

When I use the HttpRetriever to request information from an API server, I sometimes receive some sort of "empty" XML tag that represents both the opening and closing XML tag. Here is an example <prism:pageRange /> (see point 1 below).

It seems that the HTML parser notes in KNIME are "normalizing" these type of "empty" XML tags, however it seems that this is not always correctly done if I use the current HtmlParser. It somehow thinks it's now the parent of the next tag (see point 2 below). The old deprecated NekoHtlmParser seems to have no problems "normalizing" these "empty" XML tags corretly (see point 3 below).

How come the HtmlParser node is causing this problem and how can I best solve this? Should I simply use the NekoHtlmParser instead?

Many thanks in advance,

Ruben

 

1. Retrieved result via Web Browser (Chrome):

    <entry>
      <prism:url>***</prism:url>
      <dc:title>***</dc:title> 
      <prism:pageRange /> 
      <prism:doi>***</prism:doi> 
    </entry>

2. Parsed result via HtmlParser (Palladian for KNIME 1.6.100.v201607071900)

    <entry ...>
        <prismU00003Aurl>***</prismU00003Aurl>
        <dcU00003Atitle>***</dcU00003Atitle>
        <prismU00003Apagerange>
            <prismU00003Adoi>***</prismU00003Adoi>
        </prismU00003Apagerange>
    </entry>

3. Parsed result via NekoHtmlParser

    <entry ...>
        <prism:url>***</prism:url>
        <dc:title>***</dc:title>
        <prism:pagerange>
        </prism:pagerange>

        <prism:doi>***</prism:doi>
    </entry>

 


Viewing all articles
Browse latest Browse all 4157

Trending Articles