Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

The read_csv method

The name of the method doesn't unveil its full might. It is a kind of misnomer in the sense that it makes us think that it can be used to read only CSV files, which is not the case. Various kinds of files, including .txt files having delimiters of various kinds can be read using this method.

Let's learn a little bit more about the various arguments of this method in order to assess its true potential. Although the read_csv method has close to 30 arguments, the ones listed in the next section are the ones that are most commonly used.

The general form of a read_csv statement is something similar to:

pd.read_csv(filepath, sep=', ', dtype=None, header=None, skiprows=None, index_col=None, skip_blank_lines=TRUE, na_filter=TRUE)

Now, let us understand the significance and usage of each of these arguments one by one:

  • filepath: filepath is the complete address of the dataset or file that you are trying to read. The complete address includes the address of the directory in which the file is stored and the full name of the file with its extension. Remember to use a forward slash (/) in the directory address. Later in this chapter, we will see that the filepath can be a URL as well.
  • sep: sep allows us to specify the delimiter for the dataset to read. By default, the method assumes that the delimiter is a comma (,). The various other delimiters that are commonly used are blank spaces ( ), tab (|), and are called space delimiter or tab demilited datasets. This argument of the method also takes regular expressions as a value.
  • dtype: Sometimes certain columns of the dataset need to be formatted to some other type, in order to apply certain operations successfully. One example is the date variables. Very often, they have a string type which needs to be converted to date type before we can use them to apply date-related operations. The dtype argument is to specify the data type of the columns of the dataset. Suppose, two columns a and b, of the dataset need to be formatted to the types int32 and float64; it can be achieved by passing {'a':np.float64, 'b'.np.int32} as the value of dtype. If not specified, it will leave the columns in the same format as originally found.
  • header: The value of a header argument can be an integer or a list. Most of the times, datasets have a header containing the column names. The header argument is used to specify which row to be used as the header. By default, the first row is the header and it can be represented as header =0. If one doesn't specify the header argument, it is as good as specifying header=0. If one specifies header=None, the method will read the data without the header containing the column names.
  • names: The column names of a dataset can be passed off as a list using this argument. This argument will take lists or arrays as its values. This argument is very helpful in cases where there are many columns and the column names are available as a list separately. We can pass the list of column names as a value of this argument and the column names in the list will be applied.
  • skiprows: The value of a skiprows argument can be an integer or a list. Using this argument, one can skip a certain number of rows specified as the value of this argument in the read data, for example skiprows=10 will read in the data from the 11th row and the rows before that will be ignored.
  • index_col: The value of an index_col argument can be an integer or a sequence. By default, no row labels will be applied. This argument allows one to use a column, as the row labels for the rows in a dataset.
  • skip_blank_lines: The value of a skip_blank_lines argument takes Boolean values only. If its value is specified as True, the blank lines are skipped rather than interpreting them as NaN (not allowed/missing values; we shall discuss them in detail soon) values. By default, its value is set to False.
  • na_filter: The value of a na-filter argument takes Boolean values only. It detects the markers for missing values (empty strings and NA values) and removes them if set to False. It can make a significant difference while importing large datasets.