上QQ阅读APP看书,第一时间看更新
The read_csv method
The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt
files having delimiters of various kinds can be read using this method.
Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv
method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.
The general form of a read_csv
statement is something similar to:
pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)
Now, let us understand the significance and usage of each of these arguments one by one:
filepath
:filepath
is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory in which the file is stored and the full name of the file with its extension. Remember to use a forward slash (/) in the directory address. Later in this chapter, we will see that the filepath can be a URL as well.sep
:sep
allows us to specify the delimiter for the dataset to read. By default, the method assumes that the delimiter is a comma (,). The various other delimiters that are commonly used are blank spaces ( ), tab (|), and are called space delimiter or tab demilited datasets. This argument of the method also takes regular expressions as a value.dtype
: Sometimes certain columns of the dataset need to be formatted to some other type, in order to apply certain operations successfully. One example is the date variables. Very often, they have a string type which needs to be converted to date type before we can use them to apply date-related operations. Thedtype
argument is to specify the data type of the columns of the dataset. Suppose, two columnsa
andb
, of the dataset need to be formatted to the typesint32
andfloat64
; it can be achieved by passing{'a':np.float64, 'b'.np.int32}
as the value ofdtype
. If not specified, it will leave the columns in the same format as originally found.header
: The value of aheader
argument can be aninteger
or alist
. Most of the times, datasets have a header containing the column names. The header argument is used to specify which row to be used as the header. By default, the first row is the header and it can be represented asheader =0
. If one doesn't specify the header argument, it is as good as specifyingheader=0
. If one specifiesheader=None
, the method will read the data without the header containing the column names.names
: The column names of a dataset can be passed off as a list using this argument. This argument will takelists
orarrays
as its values. This argument is very helpful in cases where there are many columns and the column names are available as a list separately. We can pass the list of column names as a value of this argument and the column names in the list will be applied.skiprows
: The value of askiprows
argument can be aninteger
or alist
. Using this argument, one can skip a certain number of rows specified as the value of this argument in the read data, for exampleskiprows=10
will read in the data from the 11th row and the rows before that will be ignored.index_col
: The value of anindex_col
argument can be aninteger
or asequence
. By default, no row labels will be applied. This argument allows one to use a column, as the row labels for the rows in a dataset.skip_blank_lines
: The value of askip_blank_lines
argument takes Boolean values only. If its value is specified asTrue
, the blank lines are skipped rather than interpreting them asNaN
(not allowed/missing values; we shall discuss them in detail soon) values. By default, its value is set toFalse
.na_filter
: The value of ana-filter
argument takes Boolean values only. It detects the markers for missing values (empty strings andNA
values) and removes them if set toFalse
. It can make a significant difference while importing large datasets.