Data preprocessing and data analysis
In this section, we will mainly cover data preprocessing and data analysis. As a part of data preprocessing, we are preparing our training dataset. You may be wondering what kind of data preparation I'm talking about, considering we already have the data. Allow me to tell you that we have two different datasets and both datasets are independent. So, we need to merge the DJIA dataset and NYTimes news article dataset in order to get meaningful insights from these datasets. Once we prepare our training dataset, we can train the data using different machine learning (ML) algorithms.
Now let's start the coding to prepare the training dataset. We will be using numpy
, csv
, JSON
, and pandas
as our dependency libraries. Here, our code is divided into two parts. First, we will prepare the dataset for the DJIA index dataset and then we will move to the next part, which is preparing the NYTimes news article dataset. During the preparation of the training dataset, we will code the basic data analysis steps as well.
Preparing the DJIA training dataset
You can see the code snippet in the following screenshot. You can find the code at this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
As you can see in the preceding code snippet, we are reading the csv file that we downloaded from the Yahoo Finance page earlier. After that, we convert the data into a list format. We also separated the header and actual data from the list. Once we have the data in list format, we convert the data into a numpy array. We have selected only three columns from the DIJA dataset, as follows:
- Date
- Close price
- Adj close price
You may have one question in mind: why have we considered only close price and Adj close price from the DJIA csv file? Let me clarify: as we know that open price is mostly a nearby value of the last day's close price, we haven't considered the open price. We haven't considered the high price and low price because we don't know in which particular timestamp these high and low prices occurred. For the first iteration, it is quite complicated to predict when the stock index reach a high or low value, so, in the meantime, we ignore these two columns. We are mainly interested in the overall trend for the DJIA index. If we figure out the trend precisely, we can predict the high and low price values later on. Here, we restrict our goal to predicting the closing prices for the DJIA index for future trading days.
Now back to the coding part: we built the pandas dataframe in such a way that the date column acts as the index column, and close price and adj close price are the two other columns of the dataset. You can see the output of the dataframe defined in the form of the df
variable in the code snippet given in Figure 2.5. You can see the output of dataframe df in the following figure:
Hopefully now you have a clear understanding of the kind of steps we have followed so far. We have created the basic dataframe, so now we will move on to the basic data analysis part for a DJIA dataset.
Basic data analysis for a DJIA dataset
In this section, we will perform basic data analysis on a DJIA dataset. This dataset has the date value, but if you look at the values of the date carefully, then you will see that there are some missing dates. Suppose data is missing for 30-12-2006, 31-12-2006, 1-1-2007, and many other dates. In such cases, we will add the date values that are missing. You can refer to the code snippet given in Figure 2.7, as well as find the code for this on this GitHub: https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
As you can see in the preceding figure, we come across another challenge after adding these missing date values. We have added the date value, but there is no close price or adj close price available corresponding to each of them, so we need to replace the NaN values logically, not randomly.
In order to replace the NaN values of close price and adj close price, we will use the pandas interpolation functionality. We use linear interpolation to generate the missing values for NaN. There are many types of interpolation available, but here we are using linear interpolation, and the mathematical equation for linear interpolation is as follows:
If the two known points are given by the coordinates (x1,y_1) and (x_3,y_3), the linear interpolant is the straight line between these points.
You can refer to the code snippet in the following screenshot:
The code for this is available on GitHub at https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
As you can see in the code snippet, we haven't defined which type of interpolation should be performed on our dataset; in this case, linear interpolation has been performed by default. So after applying the linear interpolation, we can replace the NaN values with the actual logical values. We have also removed three records from the year 2006. So now, we have a total of 3653 records.
This is the kind of basic data preprocessing and data analysis we did for the DJIA index dataset. Now let's move on to the NYTimes news article dataset. We need to prepare the training dataset first, so let's begin with it.
Preparing the NYTimes news dataset
In this section, we will see how we can prepare the NYTimes news dataset. We have downloaded the whole news article dataset but we have not put in a filtering mechanism for choosing news article categories. Perform the following steps when preparing the NYTimes dataset:
- Converting publication date into the YYYY-MM-DD format.
- Filtering news articles by their category.
- Implementing the filter functionality and merge the dataset.
- Saving the merged dataset in the pickle file format.
So, let's start coding for each of these steps.
Converting publication date into the YYYY-MM-DD format
First, we will convert the publication date of the news articles into the YYYY-MM-DD format so that we can merge DJIA and NYTimes news article datasets later on. In order to achieve this, you can refer to the following code snippet:
Here, we have written a function that can parse and convert the publication date format into the necessary YYYY-MM-DD format. We will call this function later on when we read the JSON files in which we have stored the JSON response.
Filtering news articles by category
The other thing that we are going to do here is filter our news article dataset by news category. We have downloaded all types of news articles, but for the stock market price prediction application, we need news articles that belong to specific news categories. So, we need to implement filters that will help us extract the necessary subset of news articles. You can refer to the following code snippet:
You can refer to the code provided at this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
As shown in the preceding figure, we are extracting news articles that belong to the following news categories:
- Business
- National
- World
- U.S.A.
- Politics
- Opinion
- Tech
- Science
- Health
- Foreign
Implementing the filter functionality and merging the dataset
Now, we need to iterate each of the JSON files and extract the news articles that have one of the news categories defined in the previous section. You can refer to the code snippet for the implementation of the filter functionality. In the upcoming code snippet, you can also find the implementation for merging the DJIA dataset and the NYTimes news articles dataset. To merge the two datasets, we are adding each of the news article headlines to the pandas dataframe,and from this we will generate our final training dataset. This functionality is shown in the following screenshot:
We have also coded a bit of the exceptional handling functionality. This is done so that if any JSON response does not have the value for the data attributes section_name, news_desk, or type_of_material, then this code will throw an exception. You can refer to the code snippet in the following screenshot:
We will consider news articles that have no section_name
and news_desk
as well. We will add all the news article headlines to our dataset and put them into the pandas dataframe. You can see the code snippet in the following screenshot:
You can see the final merged dataset in the form of the pandas dataframe, as shown in the following screenshot:
Here, for each date, we correspond all the news headlines that belong to the business, national, world, U.S.A., politics, opinion, technology, science, and heath categories. We have downloaded 1,248,084 news articles, and from these articles, we have considered 461,738 news articles for our model.
You can access the code using this GitHub link: https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
Saving the merged dataset in the pickle file format
Once we merge the data, we need to save the data objects, so we will use the pickle module of Python. Pickle helps us serialize and de-serialize the data. The pickle dependency library is fast because the bulk of it is written in C, like the Python interpreter itself. Here, we save our training dataset as a .pkl
file format. You can refer to the following code snippet:
We have saved the dataset as the pickled_ten_year_filtered_lead_para.pkl
file. You can find the code on GitHub at https://github.com/jalajthanaki/stock_price_prediction/blob/master/datapreparation.ipynb.
In the next section, we will mainly focus on the feature engineering part. We will also perform some minor data cleaning steps. So let's jump to the next section.