Practical Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Data formats

When we are working with data for human consumption, the easiest way to store it is in text files. In this section, we will present parsing examples of the most common formats such as CSV, JSON, and XML. These examples will be very helpful in the following chapters.

Tip

The dataset used for these examples is a list of Pokemon by National Pokedex number, obtained from:

http://bulbapedia.bulbagarden.net/

All the scripts and dataset files can be found in the author's GitHub repository:

https://github.com/hmcuesta/PDA_Book/tree/master/Chapter3

CSV is a very simple and common open format for table-like data, which can be exported and imported by most of the data analysis tools. CSV is a plain text format; this means that the file is a sequence of characters, with no data that has to be interpreted instead, such as binary numbers.

There are many ways to parse a CSV file from Python, and here we will discuss two:

The first eight records of the CSV file (pokemon.csv) look like this:

 id, typeTwo, name, type 
 001, Poison, Bulbasaur, Grass 
 002, Poison, Ivysaur, Grass 
 003, Poison, Venusaur, Grass 
 006, Flying, Charizard, Fire 
 012, Flying, Butterfree, Bug 
 013, Poison, Weedle, Bug 
 014, Poison, Kakuna, Bug 
 015, Poison, Beedrill, Bug 
. . . 

Parsing a CSV file with the CSV module

First, we need to import the csv module:

import csv 

Then, we open the file .csv and with the csv.reader(f) function we parse the file:

with open("pokemon.csv") as f: 
    data = csv.reader(f) 
    #Now we just iterate over the reader  
 
    for line in data: 
        print(" id: {0} , typeTwo: {1}, name:  {2}, type: {3}" 
              .format(line[0],line[1],line[2],line[3])) 

Output:

[(1, b' Poison', b' Bulbasaur', b' Grass') (2, b' Poison', b' Ivysaur', b' Grass') (3, b' Poison', b' Venusaur', b' Grass') (6, b' Flying', b' Charizard', b' Fire') (12, b' Flying', b' Butterfree', b' Bug') . . .]

Parsing CSV file using NumPy

First, we need to import the NumPy library:

import numpy as np 

NumPy provides us with the genfromtxt function, which receives four parameters. First, we need to provide the name of the file pokemon.csv. Then, we skip the first line as a header (skip_header). Next, we need to specify the data type (dtype). Finally, we will define the comma as the delimiter:

data = np.genfromtxt("pokemon.csv" 
                        ,skip_header=1 
                        ,dtype=None 
                        ,delimiter=',') 

Then, just print the result.

print(data) 

Output:

id: id , typeTwo: typeTwo, name: name, type: type id: 001 , typeTwo: Poison, name: Bulbasaur, type: Grass id: 002 , typeTwo: Poison, name: Ivysaur, type: Grass id: 003 , typeTwo: Poison, name: Venusaur, type: Grass id: 006 , typeTwo: Flying, name: Charizard, type: Fire . . .

JSON

JSON is a common format to exchange data. Although it is derived from JavaScript, Python provides us with a library to parse JSON.

Parsing JSON file using the JSON module

The first three records of the JSON file (pokemon.json) look like this:

 [ 
    { 
        "id": " 001", 
        "typeTwo": " Poison", 
        "name": " Bulbasaur", 
        "type": " Grass" 
    }, 
    { 
        "id": " 002", 
        "typeTwo": " Poison", 
        "name": " Ivysaur", 
        "type": " Grass" 
    }, 
    { 
        "id": " 003", 
        "typeTwo": " Poison", 
        "name": " Venusaur", 
        "type": " Grass" 
    }, 
. . .] 

First, we need to import the json module and the pprint ("pretty-print") module.

import json 
from pprint import pprint 

Then, we open the pokemon.json file and with the json.loads function we parse the file:

with open("pokemon.json") as f: 
    data = json.loads(f.read()) 

Finally, just print the result with the pprint function:

pprint(data) 

Output:

[{'id': ' 001', 'name': ' Bulbasaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 002', 'name': ' Ivysaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 003', 'name': ' Venusaur', 'type': ' Grass', 'typeTwo': ' Poison'}, {'id': ' 006', 'name': ' Charizard', 'type': ' Fire', 'typeTwo': ' Flying'}, {'id': ' 012', 'name': ' Butterfree', 'type': ' Bug', 'typeTwo': ' Flying'}, . . . ]

XML

According to World Wide Web Consortium (W3C):

"XML (Extensible Markup Language) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere."

The first three records of the XML file (pokemon.xml) look like this:

<?xml version="1.0" encoding="UTF-8" ?> 
<pokemon> 
    <row> 
        <id> 001</id> 
        <typeTwo> Poison</typeTwo> 
        <name> Bulbasaur</name> 
        <type> Grass</type> 
    </row> 
    <row> 
        <id> 002</id> 
        <typeTwo> Poison</typeTwo> 
        <name> Ivysaur</name> 
        <type> Grass</type> 
    </row> 
    <row> 
        <id> 003</id> 
        <typeTwo> Poison</typeTwo> 
        <name> Venusaur</name> 
        <type> Grass</type> 
    </row> 
. . . 
</pokemon> 

Parsing XML in Python using the XML module

First, we need to import the ElementTree object from the xml module:

from xml.etree import ElementTree 

Then, we open the pokemon.xml file and with the ElementTree.parse function we parse the file:

with open("pokemon.xml") as f: 
    doc = ElementTree.parse(f) 

Finally, just print each row element with the findall function:

 for node in doc.findall('row'): 
     print("") 
     print("id: {0}".format(node.find('id').text)) 
     print("typeTwo: {0}".format(node.find('typeTwo').text)) 
     print("name: {0}".format(node.find('name').text)) 
     print("type: {0}".format(node.find('type').text)) 

Output:

id: 001 typeTwo: Poison name: Bulbasaur type: Grass id: 002 typeTwo: Poison name: Ivysaur type: Grass id: 003 typeTwo: Poison name: Venusaur type: Grass . . .

YAML

YAML (YAML Ain't Markup Language) is a human-friendly data serialization. It's not as popular as JSON or XML, but it was designed to be easily mapped to data types common to most high-level languages. A Python parser implementation called PyYAML is available in the PyPI repository, and its implementation is very similar to the JSON module.

The first three records of the YAML file (pokemon.yaml) look like this:

Pokemon: 
  -id       :  001 
typeTwo :  Poison 
name        :  Bulbasaur 
type        :  Grass 
  -id       :  002 
typeTwo :  Poison 
name        :  Ivysaur 
type        :  Grass 
  -id       :  003 
typeTwo :  Poison 
name        :  Venusaur 
type        :  Grass 
. . .