Python Web Scraping Cookbook
上QQ阅读APP看书,第一时间看更新

Working with CSV and JSON data

Extracting data from HTML pages is done using the techniques in the previous chapter, primarily using XPath through various tools and also with Beautiful Soup. While we will focus primarily on HTML, HTML is a variant of XML (eXtensible Markup Language).  XML one was the most popular for  of expressing data on the web, but other have become popular, and even exceeded XML in popularity. 

Two common formats that you will see are JSON (JavaScript Object Notation) and CSV (Comma Separated Values).  CSV is easy to create and a common form for many spreadsheet applications, so many web sites provide data in that for, or you will need to convert scraped data to that format for further storage or collaboration. JSON really has become the preferred format, due to its easy within programming languages such as JavaScript (and Python), and many database now support it as a native data format.

In this recipe let's examine converting scraped data to CSV and JSON, as well as writing the data to files and also reading those data files from remote servers. The tools we will examine are the Python CSV and JSON libraries. We will also examine using pandas for these techniques.


Also implicit in these examples is the conversion of XML data to CSV and JSON, so we won't have a dedicated section for those examples.