Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I read a node that has a text value and child elements? #30

Open
wattsjp2 opened this issue Nov 25, 2019 · 4 comments
Open

How do I read a node that has a text value and child elements? #30

wattsjp2 opened this issue Nov 25, 2019 · 4 comments

Comments

@wattsjp2
Copy link

I have an XML element that looks like

<Card>
    4111111111111111
    <Type>VISA</Type>
</Card>

that I'm trying to write an XmlReader for. My case class looks like:

case class Card(
    number: String,
    cardType: String
)

The problem I'm having is trying to extract the card number. I've played around on the REPL and looked at the source code. My first thought was I could just read the root element

scala> val xml = <Card>
     |   4111111111111111
     |   <Type>VISA</Type>
     | </Card>
xml: scala.xml.Elem =
<Card>
	4111111111111111
	<Type>VISA</Type>
</Card>

scala> val path = __
path: com.lucidchart.open.xtract.XPath.type =

scala> path(xml)
res0: scala.xml.NodeSeq =
<Card>
	4111111111111111
	<Type>VISA</Type>
</Card>

scala> val reader = path.read[String]
reader: com.lucidchart.open.xtract.XmlReader[String] = com.lucidchart.open.xtract.XmlReader$$anon$1@52f47f0c

scala> reader.read(xml)
res6: com.lucidchart.open.xtract.ParseResult[String] =
ParseSuccess(
	4111111111111111
	VISA
)

This is reading everything under the root node though. My next thought was maybe I could loop through the child nodes:

scala> path.children(xml)
res8: scala.xml.NodeSeq = NodeSeq(<Type>VISA</Type>)

but that doesn't return the text node. My last thought was what if <Card> wasn't the root element. Would that change anything:

scala> val xml = <Root>
     |   <Card>
     |           4111111111111111
     |           <Type>VISA</Type>
     |   </Card>
     | </Root>
xml: scala.xml.Elem =
<Root>
	<Card>
		4111111111111111
		<Type>VISA</Type>
	</Card>
</Root>

scala> val path = (__ \ "Card")
path: com.lucidchart.open.xtract.XPath = /Card

scala> path(xml)
res10: scala.xml.NodeSeq =
NodeSeq(<Card>
		4111111111111111
		<Type>VISA</Type>
	</Card>)

scala> path.read[String].read(xml)
res11: com.lucidchart.open.xtract.ParseResult[String] =
ParseSuccess(
		4111111111111111
		VISA
	)

So that seems to be giving the same behavior. It looks like under the hood stringReader is using the text function on NodeSeq

  /**
   * [[XmlReader]] matches the text of a single node.
   */
  implicit val stringReader: XmlReader[String] = XmlReader { xml =>
    getNode(xml).map(_.text)
  }

It looks like this behavior comes from there

scala> val xml: NodeSeq = <Card>
     |   4111111111111111
     |   <Type>VISA</Type>
     | </Card>
xml: scala.xml.NodeSeq =
<Card>
  4111111111111111
  <Type>VISA</Type>
</Card>

scala> xml.text
res16: String =
"
  4111111111111111
  VISA
"
@tmccombs
Copy link
Contributor

Possibly related to #24?

@wattsjp2
Copy link
Author

I think the root problem is the way stringReader handles nodes that have a text element and children elements which I think is tied to how NodeSeq.text works. I think #24 is thwarting my efforts to work around the problem so in that way they're related.

@tmccombs
Copy link
Contributor

But what should the behaviour be? should it use the concatenation of all text nodes that are direct children (but not descendents), use just the first text node, fail if there are element node children, or use a concatenation of all descendent text nodes (as is currently done)?

Changing the default way from the current behaviour is probably a breaking change, although I'm not sure if any existing usages make use of the current behaviour in the presence of child elements. I'm not necessarily opposed to changing the default way of parsing text. But to be honest, I kind of think that failing if there are child elements makes the most sense to me as the default. I pointed at #24 because it is a more general solution, though perhaps not as convenient in your case.

I could probably add something to extract the first text node or all direct child text nodes to the XPath and/or XmlReader API. Although I'm not sure what good names for those methods would be. Would that work for you?

@wattsjp2
Copy link
Author

You know, I may have forgotten an XML element could have multiple text nodes. I think you're making a lot of sense then. Trying to read a node with children as a string is probably undefined behaviour. I think a solution that returned either the direct child text nodes or all child nodes makes sense and then the user can do a collectFirst or mkstring or whatever based on their use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants