首页 > 分享 > 季节性时间序列模型与多重季节性模型（代码+分析）

季节性时间序列模型与多重季节性模型（代码+分析）

花匠小妙招
2024-09-18 22:20

专题三季节性模型

3.1 季节性时间序列模型

季节性时间序列模型是一种用于处理时间序列数据中包含季节性模式的方法。它通常用于预测和分析在固定时间间隔（如每季度、每月或每周）内出现的重复性模式。

常见的季节性时间序列模型包括：

季节性自回归移动平均模型（Seasonal Autoregressive Integrated Moving Average，SARIMA）：它是ARIMA模型的季节性扩展。ARIMA模型是自回归（AR）和移动平均（MA）的组合，而SARIMA模型还考虑了季节性成分。它的参数包括季节性自回归阶数、季节性移动平均阶数、非季节性自回归阶数、非季节性移动平均阶数和差分阶数。

季节性分解模型（Seasonal Decomposition of Time Series，STL）：它通过将时间序列拆分成季节性、趋势和残差三个成分，来捕捉季节性模式。这种方法可以帮助我们理解时间序列数据中季节性和趋势的变化。

季节性指数平滑模型（Seasonal Exponential Smoothing，ETS）：它是指数平滑模型的季节性扩展。指数平滑模型用于对时间序列进行平滑和预测，而ETS模型还考虑了季节性因素。

季节性自回归集成移动平均模型（Seasonal Autoregressive Integrated Moving-Average Exogenous，SARIMAX）：它是SARIMA模型的扩展，可以包含外生变量（如其他影响因素）来改进预测性能。

季节性时间序列模型的选择取决于数据的特点、季节性模式的复杂性以及是否有外生变量等因素。在应用季节性时间序列模型之前，通常需要对数据进行季节性调整、平稳化处理，以及选择合适的模型参数。这些模型在处理季节性数据方面都有一定的优势和适用场景。

# 载入相关的库 from scipy import stats import statsmodels.api as sm # 统计相关的库 import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set(color_codes=True) #seaborn设置背景 from statsmodels.graphics.tsaplots import plot_acf as ACF #自相关图 from statsmodels.tsa.stattools import adfuller as ADF #平稳性检测 from statsmodels.graphics.tsaplots import plot_pacf as PACF #偏自相关图 from statsmodels.stats.diagnostic import acorr_ljungbox as LjungBox #白噪声检验 from statsmodels.tsa.arima.model import ARIMA 123456789101112131415

# 导入数据 data=pd.read_csv('AirPassengers.csv') data.head() 123 Unnamed: 0Month#Passengers001949-01-01112111949-02-01118221949-03-01132331949-04-01129441949-05-01121

del data['Unnamed: 0'] data=data.set_index(['Month']) data.head() 123 #PassengersMonth1949-01-011121949-02-011181949-03-011321949-04-011291949-05-01121

# 可视化时间序列 data.plot(figsize=(12,4)) plt.xlabel('Date') plt.ylabel('Number of air passengers') 12345

在这里插入图片描述

总体上可以看出可以看出序列有着季节性，我们在看看每年的情况

compare={} for index in data.index: year = index[:4] month = index[5:7] if year not in compare.keys(): compare.update({year:{month:data['#Passengers'][index]}}) else: compare[year].update({month:data['#Passengers'][index]}) compare = pd.DataFrame(compare) compare.plot(figsize=(20,6)) plt.legend(loc=0) 123456789101112

在这里插入图片描述

可以看出序列整体走势非常相似

3.1.1 季节指数

所谓季节指数就是用简单平均法计算的周期内各时期季节性影响的相对数。

x i j = x ˉ ⋅ S j + I i j large x_{ij} = bar x cdot S_j + I_{ij} xij=xˉ⋅Sj+Iij

首先计算周期内各期平均数 x ˉ k = ∑ i = 1 n x i k n , k = 1 , 2 ⋯ , m large bar x_k = frac {sum_{i=1}^n x_{ik}}{n} , k=1,2 cdots ,m xˉk=n∑i=1nxik,k=1,2⋯,m

然后计算总平均数 x ˉ = ∑ i = 1 n ∑ k = 1 m x i k n m large bar x = frac{sum_{i=1}^n sum_{k=1}^m x_{ik}}{nm} xˉ=nm∑i=1n∑k=1mxik

再计算季节指数 S k = x ˉ k x ˉ , k = 1 , 2 ⋯ , m large S_k = frac{bar x_k}{bar x},k=1,2 cdots ,m Sk=xˉxˉk,k=1,2⋯,m

季节指数反映了该季度与总平均值之间的一种比较稳定的关系：

如果比值大于1，说明该季度的值常常会高于总平均值；

如果比值小于1，说明该季度的值常常低于总平均值；

如果序列的季节指数都近似为1，就说明该序列没有明显的季节性。

我们来计算一下上面数据的季节指数：

ave_k = compare.mean(axis=1) ave = ave_k.mean() compare['S'] = ave_k/ave compare 1234 194919501951195219531954195519561957195819591960S011121151451711962042422843153403604170.862473021181261501801961882332773013183423910.838392031321411781932362352673173563624064190.963853041291351631812352272693133483483964610.952853051211251721832292342703183553634204720.969799061351491782182432643153744224354725351.111909071481701992302643023644134654915486221.253425081481701992422722933474054675055596061.252533091361581842092372593123554044044635081.078909101191331621912112292743063473594074610.951069111041141461721802032372713053103623900.830662121181401661942012292783063363374054320.934123

根据季节指数来看数据具有一定的季节性，各月份的季节指数存在差异

3.1.2 季节性差分

在有些应用中，季节性是次要的，我们需要把它从数据中消除，这个过程叫季节调整，其中季节性差分化是一种常见的方法。

在之前我们介绍过差分（正规差分化），其形式为： Δ x t = x t − x t − 1 large Delta x_t = x_t - x_{t-1} Δxt=xt−xt−1

我们将它推广，如果一个序列具有周期性，且周期为 s s s，则季节性差分化为： Δ s x t = x t − x t − s large Delta_s x_t = x_t - x_{t-s} Δsxt=xt−xt−s

dataDiff = data.diff(12)[12:] dataDiff.plot(figsize=(15,6)) 12

在这里插入图片描述

可以看出，季节性基本上被消除，但是序列不够平稳，我们不妨再做1阶差分看看

diff1 = dataDiff.diff()[1:] diff1.plot(figsize=(15,6)) 12

在这里插入图片描述

t = sm.tsa.stattools.adfuller(diff1) # ADF检验 print("p-value: ",t[1]) 12

p-value: 1.856511600123444e-28 1

经过ADF检验，我们发现p-value小于显著性水平0.05，序列是平稳的。

这里我也写了一个函数来进行季节性检验

#用于检测季节性的函数 def test_stationarity(timeseries): #设置滚动窗口 movingAverage = timeseries.rolling(window=12).mean() movingSTD = timeseries.rolling(window=12).std() #绘制滚动平均值和方差 plt.figure(figsize=(15,4)) # 使用range函数生成连续的数字列表作为横坐标 x_values = list(range(len(timeseries))) orig = plt.plot(x_values,timeseries, color='blue', label='Original') mean = plt.plot(x_values,movingAverage, color='red', label='Rolling Mean') std = plt.plot(x_values,movingSTD, color='black', label='Rolling Std') plt.legend(loc='best') plt.title('Rolling Mean & Standard Deviation') plt.show(block=False) #ADF检验: print('Results of Dickey Fuller Test:') dftest = ADF(timeseries['#Passengers'], autolag='AIC') dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used']) for key,value in dftest[4].items(): dfoutput['Critical Value (%s)'%key] = value print(dfoutput)

123456789101112131415161718192021222324252627'

diff1.dropna(inplace=True) test_stationarity(diff1) 12

在这里插入图片描述

Results of Dickey Fuller Test: Test Statistic -1.559562e+01 p-value 1.856512e-28 #Lags Used 0.000000e+00 Number of Observations Used 1.300000e+02 Critical Value (1%) -3.481682e+00 Critical Value (5%) -2.884042e+00 Critical Value (10%) -2.578770e+00 dtype: float64 123456789

移动平均值和标准差几乎平行于 x 轴，因此它们没有趋势性。这说明季节性已经被消除了。

3.2 多重季节性模型

上一节中，经过了季节性差分和正规差分后，序列成为了平稳时间序列，则我们可以用ARMA模型对多重差分后的模型建模。则我们有模型：
( 1 − B s ) ( 1 − B ) x t = a t − θ a t − 1 − Θ a t − s + θ Θ a t − s − 1 ∣ θ ∣ < 1 , ∣ Θ ∣ < 1 large (1-B^s)(1-B)x_t = a_t - theta a_{t-1} - Theta a_{t-s} + theta Theta a_{t-s-1} large |theta| < 1 ,|Theta| < 1 (1−Bs)(1−B)xt=at−θat−1−Θat−s+θΘat−s−1∣θ∣<1,∣Θ∣<1

其中， B B B为向后推移算子，代表前一时刻的值。 B B B的 s s s次方则为前 s s s时刻的值。 { a t a_t at}为白噪声序列， s s s为序列的周期。上述模型为航空模型。

另外，序列经过了季节性差分和正规差分消除序列相关性（序列平稳），则可以认为是间隔 s s s和间隔1的周期共同影响，因此该模型称为多重季节性ARMA模型

3.2.1 定阶

#ACF & PACF plots lag_acf = ACF(diff1, lags=30) lag_pacf = PACF(diff1, lags=30) plt.show() 12345

在这里插入图片描述

从图中可知，q的取值在9阶、23阶都有可能，p的取值在9阶、22阶都有可能

进一步结合信息准则来判断

print(sm.tsa.arma_order_select_ic(diff1,max_ar=10,max_ma=10,ic='aic')['aic_min_order']) 1

(6, 9) 1

3.2.2 建模和预测

from statsmodels.tsa.arima.model import ARIMA temp = np.array(dataDiff.values) # 载入passengers序列，使用去除季节性后的数据 model =ARIMA(temp,order=(6, 1, 9)) res = model.fit() print(res.summary()) plt.rcParams['font.sans-serif'] = ['simhei'] #字体为黑体 plt.rcParams['axes.unicode_minus'] = False #正常显示负号 plt.figure(figsize=(10,4)) plt.plot(dataDiff.values,'b',label='#passengers') plt.plot(res.fittedvalues[1:], 'r',label='SARIMA model') # plt.title('RSS: %.4f'%sum((res.fittedvalues - dataDiff['#Passengers'])**2)) plt.xticks(None) plt.legend() plt.show()

12345678910111213141516

SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 132 Model: ARIMA(6, 1, 9) Log Likelihood -492.804 Date: Sun, 06 Aug 2023 AIC 1017.607 Time: 13:36:26 BIC 1063.610 Sample: 0 HQIC 1036.300 - 132 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 -0.5441 0.199 -2.738 0.006 -0.934 -0.155 ar.L2 -0.5765 0.177 -3.257 0.001 -0.923 -0.230 ar.L3 -0.0131 0.260 -0.051 0.960 -0.523 0.496 ar.L4 0.2455 0.203 1.210 0.226 -0.152 0.643 ar.L5 0.6091 0.200 3.045 0.002 0.217 1.001 ar.L6 0.7437 0.145 5.140 0.000 0.460 1.027 ma.L1 0.2867 42.163 0.007 0.995 -82.351 82.924 ma.L2 0.5228 30.148 0.017 0.986 -58.567 59.612 ma.L3 -0.1872 52.095 -0.004 0.997 -102.292 101.917 ma.L4 -0.4904 60.006 -0.008 0.993 -118.100 117.119 ma.L5 -0.6672 39.349 -0.017 0.986 -77.789 76.455 ma.L6 -0.8580 67.335 -0.013 0.990 -132.833 131.117 ma.L7 0.2039 31.195 0.007 0.995 -60.938 61.346 ma.L8 -0.1734 22.657 -0.008 0.994 -44.581 44.234 ma.L9 0.3646 15.300 0.024 0.981 -29.622 30.351 sigma2 96.9965 4077.036 0.024 0.981 -7893.847 8087.840 =================================================================================== Ljung-Box (L1) (Q): 0.23 Jarque-Bera (JB): 9.62 Prob(Q): 0.63 Prob(JB): 0.01 Heteroskedasticity (H): 2.23 Skew: 0.06 Prob(H) (two-sided): 0.01 Kurtosis: 4.32 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).

12345678910111213141516171819202122232425262728293031323334353637

在这里插入图片描述

res.fittedvalues 1

array([ 0.00000000e+00, 2.99987745e+00, 6.63906357e+00, 8.47045456e+00, 6.43421418e+00, 3.96721019e+00, 1.16104199e+01, 1.95559745e+01, 1.98457537e+01, 1.79953720e+01, 1.64304319e+01, 1.27013626e+01, 1.69093980e+01, 2.55311769e+01, 2.63350764e+01, 3.15830421e+01, 2.97081039e+01, 4.10358898e+01, 2.82549533e+01, 2.60002427e+01, 2.95319863e+01, 3.20272994e+01, 2.09427277e+01, 3.13813568e+01, 2.49872644e+01, 3.19489266e+01, 2.55643820e+01, 1.63835304e+01, 1.46053515e+01, 1.75140003e+01, 3.26483739e+01, 3.74955701e+01, 3.24416747e+01, 2.37433037e+01, 3.44731413e+01, 2.36261594e+01, 2.68838810e+01, 2.11135456e+01, 2.24752008e+01, 3.52466452e+01, 5.65939181e+01, 3.89366952e+01, 2.56592326e+01, 2.96908778e+01, 3.69353872e+01, 2.77045383e+01, 1.39875950e+01, 1.15654045e+01, 1.81691062e+01, 7.18002453e+00, -9.15149400e+00, 4.82689410e-01, 4.88001848e-02, 7.29863475e+00, 1.74440408e+01, 2.96417253e+01, 2.38426846e+01, 2.63265181e+01, 1.14139002e+01, 2.48281715e+01, 2.24997723e+01, 3.57679424e+01, 4.62938499e+01, 3.97311106e+01, 2.16377247e+01, 3.94896840e+01, 4.90119607e+01, 6.13330608e+01, 5.29900976e+01, 4.36337683e+01, 4.01339935e+01, 4.24827977e+01, 4.35634260e+01, 4.38982085e+01, 3.87204841e+01, 4.58979163e+01, 4.75860322e+01, 4.77075795e+01, 5.08555424e+01, 4.69759556e+01, 5.48545596e+01, 4.72034517e+01, 3.18699450e+01, 2.98443004e+01, 2.95219985e+01, 3.03749871e+01, 3.31955506e+01, 3.15639656e+01, 3.41521806e+01, 3.74938971e+01, 4.06140179e+01, 5.58595243e+01, 5.85403370e+01, 4.25735494e+01, 3.80869760e+01, 3.85368106e+01, 3.17146301e+01, 2.54165139e+01, 1.75410261e+01, 8.10108958e+00, 9.64188329e+00, 8.27189345e+00, 1.14725762e+01, 2.58387040e+01, 3.64049736e+01, 1.07058552e+01, 7.00774219e+00, 1.94196267e+00, 8.52177415e+00, 1.86848853e+01, 2.79325668e+01, 3.60733657e+01, 4.97577773e+01, 4.16140939e+01, 4.43206968e+01, 5.45823638e+01, 4.54610159e+01, 5.62632173e+01, 4.80053929e+01, 4.77196774e+01, 6.64954835e+01, 6.03449924e+01, 3.36602549e+01, 2.73630528e+01, 5.33440593e+01, 5.79827318e+01, 6.09581713e+01, 5.48110579e+01, 6.02274489e+01, 4.55147931e+01, 4.48001212e+01, 2.47040159e+01])

123456789101112131415161718192021222324252627282930313233

a=dataDiff.values.tolist() one_dim_array = [item for sublist in a for item in sublist] delta = res.fittedvalues - one_dim_array # 残差 plt.figure(figsize=(10,6)) plt.plot(delta,'r',label=' residual error') plt.legend(loc=0) 123456

在这里插入图片描述

acf,q,p = sm.tsa.acf(delta,nlags=10,qstat=True) ## 计算自相关系数及p-value out = np.c_[range(1,11), acf[1:], q, p] output=pd.DataFrame(out, columns=['lag', "AC", "Q", "P-value"]) output = output.set_index('lag') output 12345 ACQP-valuelag1.0-0.0408260.2250470.6352212.0-0.0315120.3601550.8352063.0-0.0555890.7838610.8533234.0-0.0218180.8496390.9316725.0-0.0316300.9889800.9634506.00.0006640.9890420.9860247.00.0009010.9891570.9950028.00.0379451.1945390.9966969.0-0.0376571.3984590.99783210.0-0.0681142.0711030.995766

观察p-value可知，该序列可以认为没有相关性，近似得可以认为残差序列接近白噪声。说明模型是显著的。

score = 1 - delta.var()/temp.var() print(score) 12

0.6643583483988917 1

model =ARIMA(temp,order=(6, 1, 9)).fit() y = model.forecast(5) #使用.forecast()方法可以直接外推时间线预测（仅ARIMA模型可以） y 123

array([39.89301892, 32.95434062, 39.57820142, 36.28694024, 31.5818024 ]) 1

3.2.3 还原序列季节性

predictions_ARIMA_diff = pd.Series(res.fittedvalues[1:], copy=True) print(predictions_ARIMA_diff.head()) 12

0 2.999877 1 6.639064 2 8.470455 3 6.434214 4 3.967210 dtype: float64 123456

predictions_ARIMA = np.cumsum(predictions_ARIMA_diff.values) 1

# 使用步长12进行差分后，进行还原操作 restored_data = np.zeros(len(predictions_ARIMA_diff) + 12) # 创建一个与差分数据长度+12的数组 for i in range(12, len(restored_data)): restored_data[i] = restored_data[i - 12] + predictions_ARIMA_diff.values[i - 12] 12345

plt.plot(data.values,'b') plt.plot(restored_data,'orange') 12

在这里插入图片描述

3.3 SARIMA

SARIMA（Seasonal Autoregressive Integrated Moving Average）模型是一种时间序列预测模型，是ARIMA模型的扩展，用于处理具有季节性的时间序列数据。SARIMA模型在预测和分析季节性数据方面非常有用，它结合了自回归（AR）、差分（I）和移动平均（MA）的概念，以及季节性分析。

SARIMA模型由以下几个部分组成：

季节性成分（Seasonal Component）：SARIMA模型考虑了数据的季节性成分，该成分是在时间序列中以一定周期（例如每年、每月、每周）重复出现的模式。SARIMA模型使用季节性差分来处理这种季节性成分。

自回归成分（Autoregressive Component）：这部分和ARIMA模型中的自回归部分类似，表示当前时刻的观测值与过去若干时刻的观测值之间的关系。

差分成分（Integrated Component）：这部分和ARIMA模型中的差分部分类似，表示通过差分操作将时间序列转化为平稳序列，以便进行建模。

移动平均成分（Moving Average Component）：这部分和ARIMA模型中的移动平均部分类似，表示当前时刻的观测值与过去若干时刻的误差项之间的关系。

SARIMA模型的表示形式为 SARIMA(p, d, q)(P, D, Q)s，其中：

p: 自回归阶数（Autoregressive Order）d: 差分阶数（Differencing Order）q: 移动平均阶数（Moving Average Order）P: 季节性自回归阶数（Seasonal Autoregressive Order）D: 季节性差分阶数（Seasonal Differencing Order）Q: 季节性移动平均阶数（Seasonal Moving Average Order）s: 季节性周期（Seasonal Period）

SARIMA模型的建立通常涉及参数选择、模型拟合、残差分析和预测等步骤。它可以通过统计方法、优化算法等进行参数估计，然后利用已知数据进行模型拟合，最终用来预测未来的时间序列值。

import numpy as np import pandas as pd import matplotlib.pyplot as plt from statsmodels.tsa.statespace.sarimax import SARIMAX from statsmodels.graphics.tsaplots import plot_acf, plot_pacf # 示例数据（假设你有一个时间序列数据） np.random.seed(0) n = 200 time = np.arange(n) data = 10 + np.sin(2*np.pi*time/12) + np.random.normal(0, 1, n) # 创建时间序列 DataFrame df = pd.DataFrame({'data': data}, index=pd.date_range(start='2020-01-01', periods=n, freq='M')) # 可视化时间序列 plt.figure(figsize=(10, 4)) plt.plot(df.index, df['data']) plt.title('Time Series Data') plt.xlabel('Date') plt.ylabel('Value') plt.show()

1234567891011121314151617181920212223

在这里插入图片描述

# 绘制 ACF 和 PACF 图来帮助选择 SARIMA 参数 fig, axes = plt.subplots(1, 2, figsize=(10, 4)) plot_acf(df['data'], lags=20, ax=axes[0]) plot_pacf(df['data'], lags=20, ax=axes[1]) plt.show() 12345

在这里插入图片描述

# 创建并拟合 SARIMA 模型 model = SARIMAX(df['data'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12)) results = model.fit() # 打印模型摘要 print(results.summary()) # 预测未来的数据 forecast_steps = 12 forecast = results.forecast(steps=forecast_steps) # 可视化原始数据和预测结果 plt.figure(figsize=(10, 4)) plt.plot(df.index, df['data'], label='Original Data') plt.plot(forecast.index, forecast, label='Forecast') # plt.fill_between(forecast.index, forecast.conf_int()['lower data'], forecast.conf_int()['upper data'], color='gray', alpha=0.2) plt.title('SARIMA Forecast') plt.xlabel('Date') plt.ylabel('Value') plt.legend() plt.show()

123456789101112131415161718192021

SARIMAX Results ========================================================================================== Dep. Variable: data No. Observations: 200 Model: SARIMAX(1, 1, 1)x(1, 1, 1, 12) Log Likelihood -280.985 Date: Sun, 06 Aug 2023 AIC 571.969 Time: 14:29:44 BIC 588.125 Sample: 01-31-2020 HQIC 578.515 - 08-31-2036 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ar.L1 0.0134 0.082 0.164 0.870 -0.146 0.173 ma.L1 -0.9094 0.041 -22.086 0.000 -0.990 -0.829 ar.S.L12 -0.0475 0.091 -0.523 0.601 -0.226 0.131 ma.S.L12 -0.9990 9.996 -0.100 0.920 -20.590 18.593 sigma2 0.9696 9.635 0.101 0.920 -17.915 19.855 =================================================================================== Ljung-Box (L1) (Q): 0.04 Jarque-Bera (JB): 1.27 Prob(Q): 0.84 Prob(JB): 0.53 Heteroskedasticity (H): 0.97 Skew: 0.07 Prob(H) (two-sided): 0.90 Kurtosis: 2.63 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).

1234567891011121314151617181920212223242526

在这里插入图片描述