How to do it...
The first thing that we do is to load the HTML into an lxml "etree". This is lxml's representation of the DOM.
in [2]: tree = html.fromstring(page_html)
The tree variable is now an lxml representation of the DOM which models the HTML content. Let's now examine how to use it and XPath to select various elements from the document.
Out first XPath example will be to find all the the <tr> elements below the <table> element.
In [3]: [tr for tr in tree.xpath("/html/body/div/table/tr")]
Out[3]:
[<Element tr at 0x10cfd1408>,
<Element tr at 0x10cfd12c8>,
<Element tr at 0x10cfd1728>,
<Element tr at 0x10cfd16d8>,
<Element tr at 0x10cfd1458>,
<Element tr at 0x10cfd1868>,
<Element tr at 0x10cfd1318>,
<Element tr at 0x10cfd14a8>,
<Element tr at 0x10cfd10e8>,
<Element tr at 0x10cfd1778>,
<Element tr at 0x10cfd1638>]
This XPath navigates by tag name from the root of the document down to the <tr> element. This example looks similar to the property notation from Beautiful Soup, but ultimately it is significantly more expressive. And notice one difference in the result. All the the <tr> elements were returned and not just the first. As a matter of fact, the tags at each level of this path with return multiple items if they are available. If there was multiple <div> elements just below <body>, then the search for table/tr would be executed on all of those <div>.
The actual result was an lxml element object. The following gets the HTML associated with the elements but using etree.tostring() (albeit they have encoding applied):
In [4]: from lxml import etree
...: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr")]
Out[4]:
[b'<tr id="planetHeader"> \n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ',
b'<tr id="footerRow"> \n <td> ']
Now let's look at using XPath to select only the <tr> elements that are planets.
In [5]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr[@class='planet']")]
Out[5]:
[b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ']
The use of the [] next to a tag states that we want to do a selection based on some criteria upon the current element. The @ states that we want to examine an attribute of the tag, and in this cast we want to select tags where the attribute is equal to "planet".
There is also another point to be made out of the query that had 11 <tr> rows. As stated earlier, the XPath runs the navigation on all the nodes found at each level. There are two tables in this document, both children of a different <div> that are both a child or the <body> element. The row with id="planetHeader" came from our desired target table, the other, with id="footerRow", came from the second table.
Previously we solved this by selecting <tr> with class="row", but there are also other ways worth a brief mention. The first is that we can also use [] to specify a specific element at each section of the XPath like they are arrays. Take the following:
In [6]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[1]/table/tr")]
Out[6]:
[b'<tr id="planetHeader"> \n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ']
Arrays in XPath start at 1 instead of 0 (a common source of error). This selected the first <div>. A change to [2] selects the second <div> and hence only the second <table>.
In [7]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[2]/table/tr")]
Out[7]: [b'<tr id="footerRow"> \n <td> ']
The first <div> in this document also has an id attribute:
<div id="planets">
This can be used to select this <div>:
In [8]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr")]
Out[8]:
[b'<tr id="planetHeader"> \n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ']
Earlier we selected the planet rows based upon the value of the class attribute. We can also exclude rows:
In [9]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']")]
Out[9]:
[b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ']
Suppose that the planet rows did not have attributes (nor the header row), then we could do this by position, skipping the first row:
In [10]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div[@id='planets']/table/tr[position() > 1]")]
Out[10]:
[b'<tr id="planet1" class="planet" name="Mercury">',
b'<tr id="planet2" class="planet" name="Venus"> ',
b'<tr id="planet3" class="planet" name="Earth"> ',
b'<tr id="planet4" class="planet" name="Mars"> \n',
b'<tr id="planet5" class="planet" name="Jupiter">',
b'<tr id="planet6" class="planet" name="Saturn">
',
b'<tr id="planet7" class="planet" name="Uranus">
',
b'<tr id="planet8" class="planet" name="Neptune">',
b'<tr id="planet9" class="planet" name="Pluto"> ']
It is possible to navigate to the parent of a node using parent::*:
In [11]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::*")]
Out[11]:
[b'<table id="planetsTable" border="1"> \n ',
b'<table id="footerTable"> \n <tr id="']
This returned two parents as, remember, this XPath returns the rows from two tables, so the parents of all those rows are found. The * is a wild card that represents any parent tags with any name. In this case, the two parents are both tables, but in general the result can be any number of HTML element types. The following has the same result, but if the two parents where different HTML tags then it would only return the <table> elements.
In [12]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table")]
Out[12]:
[b'<table id="planetsTable" border="1"> \n ',
b'<table id="footerTable"> \n <tr id="']
It is also possible to specify a specific parent by position or attribute. The following selects the parent with id="footerTable":
In [13]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/parent::table[@id='footerTable']")]
Out[13]: [b'<table id="footerTable"> \n <tr id="']
A shortcut for parent is .. (and . also represents the current node):
In [14]: [etree.tostring(tr)[:50] for tr in tree.xpath("/html/body/div/table/tr/..")]
Out[14]:
[b'<table id="planetsTable" border="1"> \n ',
b'<table id="footerTable"> \n <tr id="']
And the last example finds the mass of Earth:
In [15]: mass = tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()
...: mass
Out[15]: '5.97'
The trailing portion of this XPath,/td[3]/text()[1], selects the third <td> element in the row, then the text of that element (which is an array of all the text in the element), and the first of those which is the mass.