Exploring Time Series Analysis: Techniques for Data Smoothing and Forecasting

Blogs

Web Scraping with Python and Beautiful Soup
July 22, 2024
Temporary Tables vs. CTEs – The Ultimate Showdown!
July 26, 2024

Exploring Time Series Analysis: Techniques for Data Smoothing and Forecasting

Time series analysis is a statistical technique for analysing time-ordered data. It is widely used in various fields, including finance, economics, weather forecasting, and more. Understanding the components and methods of time series analysis can significantly enhance the accuracy of predictions and insights derived from data.

Components of Time Series

  1. Trend: The long-term movement in a time series. It shows the overall direction of the data over a prolonged period.
    • Example: The increasing trend in global temperatures over the years due to climate change.
  2. Seasonality: Regular patterns or cycles that repeat at fixed intervals within a time series.
    • Example: Retail sales spikes during holiday seasons like Christmas.
  3. Cycle: Fluctuations in a time series that occur over longer periods, often influenced by economic or business cycles.
    • Example: Economic booms and recessions.
  4. Variation: Random or irregular fluctuations that do not follow a pattern or cycle. These are often unpredictable.
    • Example: Sudden stock market crashes or unexpected events like natural disasters.

Use Cases:

  • Finance: Predicting stock prices or exchange rates.
  • Economics: Forecasting economic indicators like GDP or unemployment rates.
  • Weather Forecasting: Predicting temperature, rainfall, and other meteorological variables.
  • Healthcare: Monitoring patient vitals over time to predict health outcomes.

Practical Example: Analysing Passenger Data

Let’s walk through a practical example using Python, showcasing the essential steps and techniques for analysing and modelling time series data. We will use a dataset of monthly passengers.

Importing Libraries

Explanation: We import necessary libraries. pandas is for data manipulation, numpy for numerical operations, matplotlib for plotting, and warnings to ignore any warnings for cleaner output.

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

Loading the Data

Explanation: We load the dataset and display the first few rows to understand its structure. The data contains monthly passenger numbers.

data = pd.read_csv(r'D:Passengers.csv')
data.head()
Month Passengers
1949-01 112
1949-02 118
1949-03 132
1949-04 129
1949-05 121

Preprocessing the Data

Explanation: We convert the ‘Month’ column to a datetime format and set it as the index for easier time series manipulation.

data['Month'] = pd.to_datetime(data['Month'], infer_datetime_format=True)
data = data.set_index(['Month'])
data.tail(5)

Month Passengers
01-08-1960 606
01-09-1960 508
01-10-1960 461
01-11-1960 390
01-12-1960 432

Plotting the Data

Explanation: We plot the time series data to visualize trends, seasonality, and any patterns.

plt.figure(figsize=(20,10))
plt.xlabel("Month")
plt.ylabel("Number of Passengers")
plt.plot(data)

Output:

Rolling Mean and Standard Deviation

Rolling Mean and Standard Deviation: These are statistical measures used to analyze time series data.

  • Rolling Mean (Moving Average): This is the average of a fixed number of data points, updated as you move through the dataset. It smooths out short-term fluctuations and highlights longer-term trends or cycles.
  • Rolling Standard Deviation: This measures the variability or volatility of the data over a fixed number of periods, updating as you move through the dataset.

Explanation: We calculate the rolling mean and standard deviation with a window of 12 months to smooth the time series and observe trends more clearly. Rolling mean helps to identify the long-term trend, while rolling standard deviation shows the variability over the window.

rolmean = data.rolling(window=12).mean()
rolstd = data.rolling(window=12).std()
print(rolmean, rolstd)

Plotting Rolling Statistics

Explanation: We plot the actual data, rolling mean, and rolling standard deviation to visualize how the rolling statistics smooth out the time series and highlight the trend.

plt.figure(figsize=(20,10))
actual = plt.plot(data, color='red', label='Actual')
mean_6 = plt.plot(rolmean, color='green', label='Rolling Mean')
std_6 = plt.plot(rolstd, color='black', label='Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

Output:

Dickey-Fuller Test

This is a statistical test used to check for stationarity in a time series. A stationary time series has constant mean and variance over time, which is essential for many time series models.

  • Test Statistic: This value helps determine if the series is stationary.
  • p-value: If this value is less than 0.05, it indicates strong evidence against the null hypothesis, thus the series is stationary.
  • Critical Values: These are benchmark values for different confidence levels (1%, 5%, 10%).

Explanation: The Dickey-Fuller test checks for stationarity in the time series. Stationary data has constant mean and variance over time, which is essential for many time series models. The test statistic and p-value indicate whether the time series is stationary. If the p-value is less than 0.05, we reject the null hypothesis and conclude that the series is stationary.

from statsmodels.tsa.stattools import adfuller
print('Dickey-Fuller Test: ')
dftest = adfuller(data['Passengers'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs'])
for key, value in dftest[4].items():
    dfoutput['Critical Value (%s)' % key] = value
print(dfoutput)

Output:

Log Transformation

This is applied to stabilize the variance of a time series. By compressing the range of values, log transformation can make a time series more stationary.

Explanation: We apply a log transformation to stabilize the variance of the time series. Log transformation compresses the range of values and makes the series more stationary.

plt.figure(figsize=(20,10))
data_log = np.log(data)
plt.plot(data_log)

Output:

Rolling Mean and Standard Deviation of Log Data

Explanation: We calculate and plot the rolling mean and standard deviation for the log-transformed data to observe the smoothed trends.

plt.figure(figsize=(20,10))
MAvg = data_log.rolling(window=12).mean()
MStd = data_log.rolling(window=12).std()
plt.plot(data_log)
plt.plot(MAvg, color='red')

Output:


Stationarity Function

This function calculates and plots rolling statistics and performs the Dickey-Fuller test to check for stationarity of the time series.

Explanation: This function calculates and plots the rolling statistics and performs the Dickey-Fuller test. It helps to check the stationarity of the time series. The rolling mean and standard deviation plot shows trends and variability, while the Dickey-Fuller test results indicate if the series is stationary.

def stationarity(timeseries):
    rolmean = timeseries.rolling(window=12).mean()
    rolstd = timeseries.rolling(window=12).std()
    plt.figure(figsize=(20,10))
    actual = plt.plot(timeseries, color='blue', label='Actual')
    mean_6 = plt.plot(rolmean, color='red', label='Rolling Mean')
    std_6 = plt.plot(rolstd, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    print('Dickey-Fuller Test: ')
    dftest = adfuller(timeseries['Passengers'], autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','Lags Used','No. of Obs'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    print(dfoutput)
    stationary(data_log_diff)

Output:

Log Transformation for Smoothing

This technique uses exponential moving averages to smooth the time series, highlighting the overall trend while reducing short-term fluctuations.

Explanation: We apply an exponential moving average to the log-transformed data to smooth it. Exponential moving average gives more weight to recent data points, making it responsive to recent changes while smoothing short-term fluctuations.

expma_log = data_log.ewm(halflife=12).mean()
plt.figure(figsize=(20,10))
plt.plot(data_log)
plt.plot(expma_log, color='red')

Output:



Difference Between Log Data and Exponential Moving Average

Explanation: We calculate the difference between the log-transformed data and its exponential moving average to remove trends and make the series more stationary. We then check the stationarity of the differenced data using the stationarity function.

log_diff = data_log - expma_log
plt.figure(figsize=(20,10))
log_diff.dropna(inplace=True)
plt.plot(log_diff)
stationarity(log_diff)
Output:
 


Conclusion

Time series analysis provides essential insights into data that is sequentially ordered over time. By understanding and applying techniques like trend analysis, seasonality decomposition, and exponential moving averages, one can uncover valuable patterns and make accurate forecasts. The methods demonstrated—from basic visualization to sophisticated ARIMA modelling—are crucial for transforming raw time series data into actionable insights.

This exploration highlights the process of smoothing data and preparing it for modelling, particularly the importance of transforming and comparing different forms of data, such as log-transformed data and its exponential moving average. As we continue to develop more advanced models, maintaining clarity and focus in our analysis remains crucial to achieve reliable and actionable results.


Neha Vittal Annam

Leave a Reply

Your email address will not be published. Required fields are marked *