Start a new topic

RSS/Atom: Unescaped characters in label attribute break certain XML parsers

Not sure if this should be a cohost problem or just an rss reader implementation problem to resolve, but I suppose I'll report it here anyway assuming that it's an easier fix to do on cohost's part.


The XML for cohost atom feeds represent post tags as category elements with a label attribute, like so:

<category label="css" term="css"/>

After troubleshooting why the rss extension I've been using stopped fetching feeds of certain users (and I checked that their posts were public), I found that the extension was using the DOMParser's parseFromString function that was returning "not well-formed" XML parsing errors pointing to label attributes with unescaped characters, such as the ampersand and less-than symbol (found in posts tagged "D&D", "q&a", ">_<", etc.).


Normally, the &, ", ', <. and > characters should be escaped inside of XML. I see that this is done in other parts of the atom file, but not for the values of these label attributes.


At the time of me writing this, I resolved the error on my end by modifying the extension code with a bit of hacky regex-ing to escape characters in this specific edge case. It would be more convenient if this were already handled when generating atom feeds on cohost's servers, though. I'm no expert in XML, so I wouldn't confidently know whether escaping attribute data is common practice or not, but this is probably worth looking into.


(In case it's needed: I encountered this issue on Firefox 113.0.2)


4 people have this problem

Chiming in with confirmation that this is definitely a cohost/server-side problem, as feeds containing unescaped symbol characters are not valid XML, which is what feed readers need.


As an example, the feed https://cohost.org/MercuryCDX/rss/public.atom contains post entries with the tag "sonic & knuckles" (tags being the source for the category.@label attribute), which fails validation. (as shown in the link below)


https://validator.w3.org/feed/check.cgi?url=https%3A%2F%2Fcohost.org%2FMercuryCDX%2Frss%2Fpublic.atom#l434


This feed also fails at being parsed by xmllint; with the following error:

-:434: parser error : xmlParseEntityRef: no name
        <category label="sonic & knuckles" term="sonic%20%26%20knuckles"/>
                                ^

For more information on the error, please read https://validator.w3.org/feed/docs/error/SAXError.html.

I can also confirm that this is an issue: a page I follow won't load in my rss app due to this bug.

I have also run into this issue with a number of Cohost feeds. Definitely needs to be fixed!

Login or Signup to post a comment