
Data formats
When we are working with data for human consumption, the easiest way to store it is in text files. In this section, we will present parsing examples of the most common formats such as CSV, JSON, and XML. These examples will be very helpful in the following chapters.
Tip
The dataset used for these examples is a list of Pokemon by National Pokedex number, obtained from:
http://bulbapedia.bulbagarden.net/
All the scripts and dataset files can be found in the author's GitHub repository:
https://github.com/hmcuesta/PDA_Book/tree/master/Chapter3
CSV is a very simple and common open format for table-like data, which can be exported and imported by most of the data analysis tools. CSV is a plain text format; this means that the file is a sequence of characters, with no data that has to be interpreted instead, such as binary numbers.
There are many ways to parse a CSV file from Python, and here we will discuss two:
The first eight records of the CSV file (pokemon.csv
) look like this:
id, typeTwo, name, type 001, Poison, Bulbasaur, Grass 002, Poison, Ivysaur, Grass 003, Poison, Venusaur, Grass 006, Flying, Charizard, Fire 012, Flying, Butterfree, Bug 013, Poison, Weedle, Bug 014, Poison, Kakuna, Bug 015, Poison, Beedrill, Bug . . .
Parsing a CSV file with the CSV module
First, we need to import the csv
module:
import csv
Then, we open the file .csv
and with the csv.reader(f)
function we parse the file:
with open("pokemon.csv") as f: data = csv.reader(f) #Now we just iterate over the reader for line in data: print(" id: {0} , typeTwo: {1}, name: {2}, type: {3}" .format(line[0],line[1],line[2],line[3]))
Output:
[(1, b' Poison', b' Bulbasaur', b' Grass') (2, b' Poison', b' Ivysaur', b' Grass') (3, b' Poison', b' Venusaur', b' Grass') (6, b' Flying', b' Charizard', b' Fire') (12, b' Flying', b' Butterfree', b' Bug') . . .]
Parsing CSV file using NumPy
First, we need to import the NumPy library:
import numpy as np
NumPy provides us with the genfromtxt
function, which receives four parameters. First, we need to provide the name of the file pokemon.csv
. Then, we skip the first line as a header (skip_header
). Next, we need to specify the data type (dtype
). Finally, we will define the comma as the delimiter:
data = np.genfromtxt("pokemon.csv" ,skip_header=1 ,dtype=None ,delimiter=',')
Then, just print the result.
print(data)
Output:
id: id , typeTwo: typeTwo, name: name, type: type id: 001 , typeTwo: Poison, name: Bulbasaur, type: Grass id: 002 , typeTwo: Poison, name: Ivysaur, type: Grass id: 003 , typeTwo: Poison, name: Venusaur, type: Grass id: 006 , typeTwo: Flying, name: Charizard, type: Fire . . .
JSON
JSON is a common format to exchange data. Although it is derived from JavaScript, Python provides us with a library to parse JSON.
Parsing JSON file using the JSON module
The first three records of the JSON file (pokemon.json
) look like this:
[ { "id": " 001", "typeTwo": " Poison", "name": " Bulbasaur", "type": " Grass" }, { "id": " 002", "typeTwo": " Poison", "name": " Ivysaur", "type": " Grass" }, { "id": " 003", "typeTwo": " Poison", "name": " Venusaur", "type": " Grass" }, . . .]
First, we need to import the json
module and the pprint ("pretty-print")
module.
import json from pprint import pprint
Then, we open the pokemon.json
file and with the json.loads
function we parse the file:
with open("pokemon.json") as f: data = json.loads(f.read())
Finally, just print the result with the pprint
function:
pprint(data)
Output:
[{'id': ' 001', 'name': ' Bulbasaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 002', 'name': ' Ivysaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 003', 'name': ' Venusaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 006', 'name': ' Charizard', 'type': ' Fire', 'typeTwo': ' Flying'}, {'id': ' 012', 'name': ' Butterfree', 'type': ' Bug', 'typeTwo': ' Flying'}, . . . ]
XML
According to World Wide Web Consortium (W3C):
"XML (Extensible Markup Language) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere."
The first three records of the XML file (pokemon.xml
) look like this:
<?xml version="1.0" encoding="UTF-8" ?> <pokemon> <row> <id> 001</id> <typeTwo> Poison</typeTwo> <name> Bulbasaur</name> <type> Grass</type> </row> <row> <id> 002</id> <typeTwo> Poison</typeTwo> <name> Ivysaur</name> <type> Grass</type> </row> <row> <id> 003</id> <typeTwo> Poison</typeTwo> <name> Venusaur</name> <type> Grass</type> </row> . . . </pokemon>
Parsing XML in Python using the XML module
First, we need to import the ElementTree
object from the xml
module:
from xml.etree import ElementTree
Then, we open the pokemon.xml
file and with the ElementTree.parse
function we parse the file:
with open("pokemon.xml") as f: doc = ElementTree.parse(f)
Finally, just print each row
element with the findall
function:
for node in doc.findall('row'): print("") print("id: {0}".format(node.find('id').text)) print("typeTwo: {0}".format(node.find('typeTwo').text)) print("name: {0}".format(node.find('name').text)) print("type: {0}".format(node.find('type').text))
Output:
id: 001 typeTwo: Poison name: Bulbasaur type: Grass id: 002 typeTwo: Poison name: Ivysaur type: Grass id: 003 typeTwo: Poison name: Venusaur type: Grass . . .
YAML
YAML (YAML Ain't Markup Language) is a human-friendly data serialization. It's not as popular as JSON or XML, but it was designed to be easily mapped to data types common to most high-level languages. A Python parser implementation called PyYAML is available in the PyPI repository, and its implementation is very similar to the JSON module.
The first three records of the YAML file (pokemon.yaml
) look like this:
Pokemon: -id : 001 typeTwo : Poison name : Bulbasaur type : Grass -id : 002 typeTwo : Poison name : Ivysaur type : Grass -id : 003 typeTwo : Poison name : Venusaur type : Grass . . .