Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't extract summary with nested div #141

Open
swiftcoder opened this issue Oct 13, 2024 · 2 comments
Open

Can't extract summary with nested div #141

swiftcoder opened this issue Oct 13, 2024 · 2 comments

Comments

@swiftcoder
Copy link

I'm trying to parse an atom feed that has a div tag nested inside the summary element:

 <summary type='xhtml'><div xmlns='http://www.w3.org/1999/xhtml'>Old software joke: “After the apocalypse, all that’ll be left will be cockroaches, Keith Richards, and markup characters     that have been escaped (or unescaped) one too many (or few) times.</div></summary>

The summary object ends up looking like this:

{div: {#text: "Old software joke: “After the apocalypse, all that…", @_xmlns: "http://www.w3.org/1999/xhtml"}, @_type: "xhtml"}

It appears that getText() only checks for a #text property on the root object, so it returns an empty string here, because the #text property is on the nested div element.

@swiftcoder
Copy link
Author

swiftcoder commented Oct 13, 2024

As a workaround for this on my side, I've written a little helper function to recursively find the first text node:

function recursiveFindText(input: any): string {
  let text =
    input["_text"] ??
    input["#text"] ??
    input["_cdata"] ??
    input["$t"] ??
    (typeof input === "string" ? input : undefined);
  if (!text) {
    for (const property in input) {
      text = recursiveFindText(input[property]);
      if (text) {
        break;
      }
    }
  }
  return text;
}

function extractDescription(input: any) {
  const text = recursiveFindText(input);
  if (text) {
    return bellajs.truncate(bellajs.stripTags(text), 250);
  }
}

And I'm applying it in getExtraEntryFields to overwrite the blank value:

        getExtraEntryFields: (feedEntry: any) => {
          return {
            description:
              extractDescription(feedEntry.description) ??
              extractDescription(feedEntry.summary) ??
              extractDescription(feedEntry.content),
          };
        },

@ndaidong
Copy link
Collaborator

@swiftcoder thank you for noting this issue. Yes, your solution is great and it's the best way to fix this RSS source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants