Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Case 2 – reading a dataset using the open method of Python

pandas is a very robust and comprehensive library to read, explore, and manipulate a dataset. But, it might not give an optimal performance with very big datasets as it reads the entire dataset, all at once, and blocks the majority of computer memory. Instead, you can try one of the Python's file handling methods—open. One can read the dataset line by line or in chunks by running a for loop over the rows and delete the chunks from the memory, once they have been processed. Let us look at some of the use case examples of the open method.

Reading a dataset line by line

As you might be aware that while reading a file using the open method, we can specify to use a particular mode that is read, write, and so on. By default, the method opens a file in the read-mode. This method can be useful while reading a big dataset, as this method reads data line-by-line (not at once, unlike what pandas does). You can read datasets into chunks using this method.

Let us now go ahead and open a file using the open method and count the number of rows and columns in the dataset:

data=open('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt','r')
cols=data.next().strip().split(',')
no_cols=len(data.next().strip().split(','))

A couple of points about this snippet:

  • 'r' has been explicitly mentioned and hence the file will be opened in the read mode. To open it in the write mode, one needs to pass 'w' in place of 'r'.
  • The next method navigates the computer memory to the line next to the header. The strip method is used to remove all the trailing and leading blank spaces from the line. The split method breaks down a line into chunks separated by the argument provided to the split method. In this case, it is ','.

Finding the number of the rows is a bit tedious, but here lies the key trick to reading a huge file in chunks:

counter=0

main_dict={}
for col in cols:
    main_dict[col]=[]

Basically, we are doing the following two tasks in the preceding code snippet:

  • Defining a counter variable that will increment its value by 1 on passing each line and hence will count the number of rows/lines at the end of the loop
  • Defining a dictionary called main_dict with column names as the keys and the values in the columns as the values of the dictionary

Now, we are all set to run a for loop over the lines in the dataset to determine the number of rows in the dataset:

for line in data:
    values = line.strip().split(',')
    for i in range(len(cols)):
        main_dict[cols[i]].append(values[i])
    counter += 1

print "The dataset has %d rows and %d columns" % (counter,no_cols)

The explanation of the code-snippet is as follows:

  1. Running a for loop over the lines in the dataset and splitting the lines in the values by ','. These values are nothing but the values contained in each column for that line (row).
  2. Running a second for loop over the columns for each line and appending the column values to the main_dict dictionary, which we defined in the previous step. So, for each key of the main_dict dictionary, all the column values are appended together. Each key of the main_dict becomes the column name of the dataset, while the values of each key in the dictionary are the values in each column.
  3. Printing the number of rows and columns of the dataset that are contained in counter and no_cols respectively.

The main_dict dictionary, in a way, contains all the information in the dataset; hence, it can be converted to a data frame, as we have read already in this chapter that a dictionary can be converted to a data frame using the DataFrame method in pandas. Let us do that:

import pandas as pd
df=pd.DataFrame(main_dict)
print df.head(5)

This process can be repeated after a certain number of lines, say 10000 lines, for a large file; it can be read in and processed in chunks.

Changing the delimiter of a dataset

Earlier in this chapter, we said that juggling and managing delimiters is a great skill to master. Let us see one example of how we can change the delimiter of a dataset.

The Customer Churn Model.txt has comma (',') as a delimiter. It looks something similar to the following screenshot:

Fig. 2.4: A chunk of Customer Churn Model.txt dataset with default delimiter comma (',')

Note that, any special character can be a delimiter. Let us change the delimiter to a 'slash t' ('/t'):

infile='E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt'
outfile='E:/Personal/Learning/Datasets/Book/Tab Customer Churn Model.txt'
with open(infile) as infile1:
  with open(outfile,'w') as outfile1:
    for line in infile1:
      fields=line.split(',')
      outfile1.write('/t'.join(fields))

This code snippet will generate a file called Tab Customer Churn Model.txt in the specified directory. The file will have a '/t' delimiter and will look something similar to the following screenshot:

Fig. 2.5: A chunk of Tab Customer Churn Model.txt with changed delimiter ('/t')

The code snippet can be explained as follows:

  1. Creating two variables called infile and outfile. The infile variable is the one whose delimiter we wish to change and outfile is the one in which we will write the results after changing the delimiter.
  2. The infile is opened in the read mode, while outfile is opened in the write mode.
  3. The lines in the infile are split based on the existing delimiter that is ',' and the chunks are called fields. Each line will have several fields (equal to the number of columns).
  4. The lines in the outfile are created by joining the fields of each line separated by the new delimiter of our choice that is '/t'.
  5. The file is written into the directory specified in the definition of the outfile.

To demonstrate this, the read_csv method, as described earlier, can be used to read datasets that have a delimiter other than a comma, we will try to read the dataset with a '/t' delimiter, we just created:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Tab Customer Churn Model.txt',sep='/t')