How to use XPath expressions in shell scripting using xmllint

This is a minor tip I want to share. A little example of a nice software feature that made my day.

I've been messing with HTML scrapping and I took a look on xmllint (maybe new) features. My intention was to extract a particular pattern, for which the --xpath option could be fine. I've never been very good tuning xpath expressions so I made a search about how to approach this. I found an amazing feature of the xmllint shell mode. As explanation here I show the workflow used:

  • get your document, I used and HTML one
  • I didn't tested with broken HTML but you can test it with xmllint --html
  • get into shell: xmllint --html --shell [document], keep in mind [document] can be a remote URI.
  • in the shell mode you can search for a precise string, in my case I chose the one inside the desired pattern: grep [string]
  • here is when magic happens: xmllint answers with the xpath expression you can use for a xpath query
  • exit the shell
  • copy the extracted xpath expression to CLI: xmllint --html --xpath [xpath]
  • here it is.

You can tune your expressions adding new predicates, as using specific attributes, or extracting the text() node, etc.

Enjoy.



Related Posts