Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

How it works

XPath is a element of the XSLT (eXtensible Stylesheet Language Transformation) standard and provides the ability to select nodes in an XML document. HTML is a variant of XML, and hence XPath can work on on HTML document (although HTML can be improperly formed and mess up XPath parsing in those cases).

XPath itself is designed to model the structure of XML nodes, attributes, and properties. The syntax provides means of finding items in the XML that match the expression. This can include matching or logical comparison of any of the nodes, attributes, values, or text in the XML document.

XPath expressions can be combined to form very complex paths within the document. It is also possible to navigate the document based upon relative positions, which helps greatly in finding data based upon relative positions instead of absolute positions within the DOM.

Understanding XPath is essential for knowing how to parse HTML and perform web scraping. And as we will see, it underlies, and provides an implementation for, many of the higher level libraries such as lxml.