Practical Data Analysis(Second Edition)
上QQ阅读APP看书,第一时间看更新

Data sources

Data source is a term for all the technology related to the extraction and storage of data. A data source can be anything from a simple text file to a big database. The raw data can come from observation logs, sensors, transactions, or user behavior.

A dataset is a collection of data, usually presented in a tabular form. Each column represents a particular attribute, and each row corresponds to a given member of the data, as is showed in the following screenshot.

In this section, we will take a look at the most common forms for data sources and datasets.

Tip

The data in the preceding screenshot is from the classical Weather dataset of the UC Irvine Machine Learning Repository:

http://archive.ics.uci.edu/ml/

A dataset represents a logical implementation of a data source; the common features of a dataset:

  • Dataset characteristics (multivariate and univariate)
  • Number of instances
  • Area (life, business, and many more)
  • Attribute characteristics (real, categorical, and nominal)
  • Number of attributes
  • Associated tasks (classification or clustering)
  • Missing values? (yes or no)

Open data

Open data is data that can be used, reused, and redistributed freely by anyone for any purpose. This is a short list of repositories and databases for open data:

Note

Other interesting sources of data come from the data mining and knowledge discovery competitions such as ACM-KDD Cup or Kaggle platform, in most cases, the datasets are still available, even after the competition is closed.

Text files

Text files are commonly used for the storage of data because it is easy to transform into different formats and it is often easier to recover and continue processing the remaining contents than with other formats. Big amounts of data come in text format from logs, sensors, e-mails, and transactions. There are several formats for text files such as CSV, TSV (tab delimiter), XML (Extensible Markup Language), and JSON (see section Data formats).

Excel files

MS-Excel is probably the most used and also the most underrated data analysis tool. In fact, Excel has some features like filtering, aggregation functions, and using Visual Basic for Application (VBA) you can make queries using languages like SQL with the sheets or with an external database:

Some Excel disadvantages are that missing values are handled inconsistently and there is no record of how an analysis was accomplished. In the case of the Analysis ToolPak, it can only work with one sheet at a time.

SQL databases

A database is an organized collection of data. Structured Query Language (SQL) is a database language for managing and manipulating data in Relational Database Management Systems (RDBMS). The Database Management Systems (DBMS) is responsible for maintaining the integrity and security of stored data, and for recovering information if the system fails. SQL language is split into two subsets of instructions: the Data Definition Language (DDL) and Data Manipulation Language (DML).

The data is organized in schemas (database) and divided into tables related by logical relationships, where we can retrieve the data by making SQL queries to the main schema, as shown in the following screenshot:

DDL allows us to create, delete, and alter database tables. We can also define keys to specify relationships between tables and implement constraints between database tables:

  • CREATE TABLE : This creates a new table
  • ALTER TABLE : This alters a table
  • DROP TABLE : This one deletes a table

DML is a language which enables users to access and manipulate data:

  • SELECT : This command is used to retrieve data from the database
  • INSERT INTO : This is used to insert new data into the database
  • UPDATE : This is used to modify data in the database
  • DELETE : This one is used to delete data in the database

NoSQL databases

Not only SQL (NoSQL) is a term used in several technologies where the nature of the data does not require a relational model. NoSQL technologies allow working with a huge quantity of data, higher availability, scalability, and performance.

Tip

See Chapter 12, Data Processing and Aggregation with MongoDB, and Chapter 13, Working with MapReduce, for extended examples of document store database MongoDB.

The most common types of NoSQL data stores are:

  • Document store: Data is stored and organized as a collection of documents. The model schema is flexible and each collection can handle any number of fields. For example, MongoDB uses a document of type BSON (binary format of JSON) and CouchDB uses a JSON document.
  • Key-value store: Data is stored as key-value pairs without a predefined schema. Values are retrieved by their keys. For example, Apache Cassandra, Dynamo, HBase, and Amazon SimpleDB.
  • Graph-based store: Data is stored in graph structures with nodes, edges, and properties using the computer science graph theory for storing and retrieving data. This kind of database is excellent to represent social network relationships, for example, Neo4js, InfoGrid, and Horton.

For more information about NoSQL see also:

http://nosql-database.org/

Multimedia

The increasing number of mobile devices makes a priority of data analysis acquire the ability to extract semantic information from multimedia data sources. Data sources include directly perceivable media such as audio, image, and video. Some of the applications for these kinds of data source are:

  • Content-based image
  • Content-based video retrieval
  • Movie and video classification
  • Face recognition
  • Speech recognition
  • Audio and music classification

In Chapter 5, Similarity-Based Image Retrieval, we present a similarity-based image search engine using Caltech256, that is an image dataset with over 30,600 images.

Web scraping

When we want to obtain data, a good place to start is on the Web. Web scraping refers to an application that processes the HTML of a webpage to extract data for manipulation. Web scraping applications will simulate a person viewing a website with a browser. In the following example, we want to get the current gold price from the website www.gold.org as is shown in the following screenshot:

Then, we need to inspect the Gold Spot Price element on the website, where we will find the HTML tag:

<td class="value" id="spotpriceCellAsk">1,573.85</td> 

We can observe an ID spotpriceCellAsk in the td tag; this is the element we will get with the next Python code.

Tip

For this example, we will use the BeautifulSoup library version 4. In Linux, we need to open a Terminal and execute the next command to install it from the system package manager:

$ apt-get install python-bs4

On windows, we need to download the library from:

http://crummy.com/software/BeautifulSoup/bs4/download/

To install it, just execute in the command line:

$ python setup.py install

First, we need to import the BeautifulSoup and urllib.request libraries:

from bs4 import BeautifulSoup 
import urllib.request 
from time import sleep 
from datetime import datetime 

Then, we use the function getGoldPrice to retrieve the current price from the website; in order to do this, we need to provide the url parameter to make the request and read the entire page:

req = urllib.request.urlopen(url) 
page = req.read() 

Finally, we use BeautifulSoup to parse the page (creating a list of all the elements of the page) and ask for the element td with the ID spotpriceCellAsk:

scraping = BeautifulSoup(page) 
price=  scraping.findAll("td",attrs={"id":"spotpriceCellAsk"})[0].text 

Now, we return the variable price with the current gold price, this value changes every minute in the website. In this case, we want all the values in an hour, so we call the function getGoldPrice in a for loop 60 times, making the script wait 59 seconds for each call:

for x in range(0,60): 
 
  sleep(59) 

Finally, we save the result in a file goldPrice.out and include the current date/time in the format HH:MM:SS[AM or PM] (11:35:42 PM for example) separated by a comma:

with open("goldPrice.out","w") as f: 
... 
        sNow = datetime.now().strftime("%I:%M:%S%p") 
        f.write("{0}, {1} \n ".format(sNow, getGoldPrice())) 

The function datetime.now().strftime creates a string representing the time under the control of an explicit format string "%I:%M:%S%p" where %I represents hour as a decimal number from 0 to 12, %M represents minutes as a decimal number from 00 to 59, %S represents seconds as a decimal number from 00 to 61, and %p represents either AM or PM.

A list of complete format directives can be found at the following link:

http://docs.python.org/3.4/library/datetime.html

Here is the full script:

from bs4 import BeautifulSoup 
import urllib.request 
from time import sleep 
from datetime import datetime 
def getGoldPrice(): 
    url = "http://gold.org" 
    req = urllib.request.urlopen(url) 
    page = req.read() 
    scraping = BeautifulSoup(page) 
    price= scraping.findAll        ("td",attrs={"id":"spotpriceCellAsk"})[0] 
    .text 
    return price 
 
with open("goldPrice.out","w") as f: 
    for x in range(0,60): 
        sNow = datetime.now().strftime("%I:%M:%S%p") 
        f.write("{0}, {1} \n ".format(sNow, getGoldPrice())) 
        sleep(59) 

Tip

You can download the full script (WebScraping.py) from the author's GitHub repository:

https://github.com/hmcuesta/PDA_Book/tree/master/Chapter2

The output file goldPrice.out will look like this:

11:35:02AM, 1481.25 
11:36:03AM, 1481.26 
11:37:02AM, 1481.28 
11:38:04AM, 1481.25 
11:39:03AM, 1481.22 
...