Case 2 – reading a dataset using the open method of Python
pandas is a very robust and comprehensive library to read, explore, and manipulate a dataset. But, it might not give an optimal performance with very big datasets as it reads the entire dataset, all at once, and blocks the majority of computer memory. Instead, you can try one of the Python's file handling methods—open
. One can read the dataset line by line or in chunks by running a for
loop over the rows and delete the chunks from the memory, once they have been processed. Let us look at some of the use case examples of the open
method.
Reading a dataset line by line
As you might be aware that while reading a file using the open
method, we can specify to use a particular mode that is read, write, and so on. By default, the method opens a file in the read-mode. This method can be useful while reading a big dataset, as this method reads data line-by-line (not at once, unlike what pandas does). You can read datasets into chunks using this method.
Let us now go ahead and open a file using the open
method and count the number of rows and columns in the dataset:
data=open('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt','r') cols=data.next().strip().split(',') no_cols=len(data.next().strip().split(','))
A couple of points about this snippet:
'r'
has been explicitly mentioned and hence the file will be opened in the read mode. To open it in the write mode, one needs to pass'w'
in place of'r'.
- The
next
method navigates the computer memory to the line next to the header. Thestrip
method is used to remove all the trailing and leading blank spaces from the line. Thesplit
method breaks down a line into chunks separated by the argument provided to thesplit
method. In this case, it is','
.
Finding the number of the rows is a bit tedious, but here lies the key trick to reading a huge file in chunks:
counter=0 main_dict={} for col in cols: main_dict[col]=[]
Basically, we are doing the following two tasks in the preceding code snippet:
- Defining a
counter
variable that will increment its value by1
on passing each line and hence will count the number of rows/lines at the end of the loop - Defining a dictionary called
main_dict
with column names as the keys and the values in the columns as the values of the dictionary
Now, we are all set to run a for
loop over the lines in the dataset to determine the number of rows in the dataset:
for line in data: values = line.strip().split(',') for i in range(len(cols)): main_dict[cols[i]].append(values[i]) counter += 1 print "The dataset has %d rows and %d columns" % (counter,no_cols)
The explanation of the code-snippet is as follows:
- Running a
for
loop over the lines in the dataset and splitting the lines in the values by','
. These values are nothing but the values contained in each column for that line (row). - Running a second
for
loop over the columns for each line and appending the column values to themain_dict
dictionary, which we defined in the previous step. So, for each key of themain_dict
dictionary, all the column values are appended together. Each key of themain_dict
becomes the column name of the dataset, while the values of each key in the dictionary are the values in each column. - Printing the number of rows and columns of the dataset that are contained in counter and
no_cols
respectively.
The main_dict
dictionary, in a way, contains all the information in the dataset; hence, it can be converted to a data frame, as we have read already in this chapter that a dictionary
can be converted to a data frame using the DataFrame
method in pandas. Let us do that:
import pandas as pd df=pd.DataFrame(main_dict) print df.head(5)
This process can be repeated after a certain number of lines, say 10000 lines, for a large file; it can be read in and processed in chunks.
Changing the delimiter of a dataset
Earlier in this chapter, we said that juggling and managing delimiters is a great skill to master. Let us see one example of how we can change the delimiter of a dataset.
The Customer Churn Model.txt
has comma (',') as a delimiter. It looks something similar to the following screenshot:
Fig. 2.4: A chunk of Customer Churn Model.txt dataset with default delimiter comma (',')
Note that, any special character can be a delimiter. Let us change the delimiter to a 'slash t' ('/t')
:
infile='E:/Personal/Learning/Datasets/Book/Customer Churn Model.txt' outfile='E:/Personal/Learning/Datasets/Book/Tab Customer Churn Model.txt' with open(infile) as infile1: with open(outfile,'w') as outfile1: for line in infile1: fields=line.split(',') outfile1.write('/t'.join(fields))
This code snippet will generate a file called Tab Customer Churn Model.txt
in the specified directory. The file will have a '/t'
delimiter and will look something similar to the following screenshot:
Fig. 2.5: A chunk of Tab Customer Churn Model.txt with changed delimiter ('/t')
The code snippet can be explained as follows:
- Creating two variables called
infile
andoutfile
. Theinfile
variable is the one whose delimiter we wish to change andoutfile
is the one in which we will write the results after changing the delimiter. - The
infile
is opened in the read mode, whileoutfile
is opened in the write mode. - The lines in the
infile
are split based on the existing delimiter that is','
and the chunks are called fields. Each line will have several fields (equal to the number of columns). - The lines in the
outfile
are created by joining the fields of each line separated by the new delimiter of our choice that is'/t'
. - The file is written into the directory specified in the definition of the
outfile
.
To demonstrate this, the read_csv
method, as described earlier, can be used to read datasets that have a delimiter other than a comma, we will try to read the dataset with a '/t'
delimiter, we just created:
import pandas as pd data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Tab Customer Churn Model.txt',sep='/t')