When you want to forecast the time series data in R, you typically would use a package called ‘forecast’, with which you can use models like ARIMA. But then, beginning of this year, a team at Facebook released ‘Prophet’, which utilizes a Bayesian based curve fitting method to forecast the time series data. The cool thing about Prophet is that it doesn’t require much prior knowledge or experience of forecasting time series data since it automatically finds seasonal trends beneath the data and offers a set of ‘easy to understand’ parameters. Hence, it allows non statisticians to start using it and get reasonably good results that are often equal or sometimes even better than the ones produced by the experts.
We have added a support of Prophet in Exploratory and written an introductory post about it before.
Then, you would ask, should we always use Prophet, and not use forecast or more precisely ARIMA or other type of forecasting models at all? Is Prophet always outperform forecast better?
Recently, we’ve actually got this question from one of our customers who found that ‘Prophet’ was not performing well for their data.
So, I decided to compare the two, Prophet vs. forecast.
I’m going to use Exploratory’s out-of-the-box Prophet-based time series forecasting feature for Prophet while I use the model extension framework in Exploratory to bring in ‘forecast’ package by writing an R script.
I have a monthly aggregated data of US airline flights from 2005 to 2007. The goal is to use the data for 2005 and 2006 to build the forecasting model and forecast the number of the flights for 12 months period in 2007, and compare the forecasted values against the actual 2007 data.
I’m going to work with ‘forecast’ package with these two relatively new packages called ‘sweep’ and ‘timetk’, which make it possible to work with forecast in a tidy way.
I needed to install the following 3 packages in Exploratory before I start.
- forecast : The time series forecasting package.
- sweep : Package to handle time series forecasting in a tidy way.
- timetk : We use this package to convert between regular data.frame and time series data format called ‘ts’.
From Exploratory’s Project List page, click R Package menu.
Click Install tab.
Type in “forecast” and click Install button, which will install the forecast package.
Install sweep and timetk packages the same way.
Let’s add a custom model script that adds ARIMA model to Exploratory through forecast package.
If you want to know more details on how to use the model extension framework please take a look at this document page.
Create a new custom R script inside a project by clicking + icon for Scripts menu.
Then type in a name for the script, and click Create button.
Here is the R script to bring forecast’s ARIMA model in with the model extension framework inside Exploratory.
# Time series forecasting by forecast
# as Exploratory Custom Model Function
# https://docs.exploratory.io/user-defined-model-function.htmllibrary(forecast)
library(sweep)
library(timetk)build_forecast_model <- function(formula, data, freq = 12, periods = 5, ...) { # default freq 12 for monthly data
training_data <- datalhs_cols <- all.vars(lazyeval::f_lhs(formula)) # predicted variable
rhs_cols <- all.vars(lazyeval::f_rhs(formula)) # date column
all_cols <- c(lhs_cols, rhs_cols)training_data <- training_data[colnames(training_data) %in% all_cols]
start_date <- training_data[[rhs_cols]][[1]]
start <- year(start_date) + yday(start_date)/365.25 # get start yearmon for the following tk_ts()
training_data_ts <- training_data %>% tk_ts(start=start, freq = freq, silent = TRUE)(Video) Forecasting at Scale: How and Why We Developed Prophet for Forecasting at Facebookfit_arima <- auto.arima(training_data_ts)
augment.forecast_model <- function(x, data = NULL, newdata = NULL, ...) {
# fit_arima <- Arima(training_data_ts, order=c(1,1,0), seasonal=list(order=c(1,0,0)), method="ML")
# return model and periods as one object
ret <- list(model = fit_arima, periods = periods, date_col = rhs_cols, value_col = lhs_cols)
class(ret) <- c("forecast_model")
ret
}
fcast <- forecast::forecast(x$model, h = x$periods)fcast_df <- as.data.frame(fcast) %>%
glance.forecast_model <- function(x, ...){
rownames_to_column("date") %>%
mutate(date = as.Date((as.numeric(date) - 1970) * 365.25, origin = "1970-01-01"), key="forecast") # back to Date class
colnames(fcast_df)[colnames(fcast_df) == "Point Forecast"] <- x$value_col
colnames(fcast_df)[colnames(fcast_df) == "date"] <- x$date_col
actual_df <- data %>% mutate(key = "actual")
ret <- bind_rows(actual_df, fcast_df)
ret
}
sw_glance(x$model)
}tidy.forecast_model <- function(x, ...){
sw_tidy(x$model)
}
Copy the above script into the script editor, and click Save button.
Now we are ready to perform time series forecasting with ARIMA inside Exploratory. Let’s try it!
Download the data
Download the following 2 csv files.
- Number of flights in the US in 2005, 2006
- Number of flights in the US in 2007
Import the downloaded CSV Data into Exploratory
Import 2005_2006_flights.csv into Exploratory.
Let’s take a look at this data on line chart. It looks like below, showing seasonal trend like August having most flights each year.
Import 2007_flights.csv too, so that we can compare our forecast with this actual 2007 data later.
Now, let’s go back to 2005_2006_flights data frame to build model.
The custom model script for forecast we just saved uses auto.arima function, which automatically tries to pick the best meta-parameters for ARIMA.
In order to call the R function registered in the script we want to select ‘Custom Command’ from the plus button and type the function name.
Then, type in the following custom command into the command form.
build_model(model_func=build_forecast_model, formula=count~year_month, periods=12)
Then, click the green Run button.
That will create an ARIMA model like below.
Let’s forecast with this model. Click “Predict on Training Data” menu.
Click Run button in the dialog.
That will append the forecasted data at the end of the original data.
Let’s visualize it with Line chart like following.
Also, let’s show prediction interval around the forecasted values.
Select “Range” from the Y Axis configuration menu.
Then in the dialog, check Show Range checkbox and select Hi 80/Lo 80 columns for Upper/Lower Limit. (By the way, I have been seeing more of 80% prediction intervals in the area of time series forecasting literatures, than 95%, which is most popular in other area of statistics. Prophet’s default prediction interval is 80% too. I guess this is indicative of forecasting time series with high confidence being very difficult, or just impossible in some cases.)
Hmm.. The forecasted 2007 part looks pretty flat, including the prediction interval. Is this really a good forecast? Anyway, let’s compare it with actual 2007 data.
If you look at the forecast result data, the rows with actual data up to the end of the 2006 has “key” column with the value “actual”. What we want is continuation of this kind of rows for 2007 with the actual data.
So, let’s go to 2007_flights data frame and add “key” column with the value of “actual” so that we can use this data along with 2005_2006_flights data frame. We can do this by mutate command.
The key column is added like the following.
Now let’s go back to 2005_2006_flights data frame, and merge the 2007 data to this data frame.
Click Merge menu.
Select 2007_flights data frame, and click Run button.
2007 actual data is merged to the data frame we have been working on.
After that, go back to the Line Chart, and set the “key” column for Color By. 2007 data is appended to the blue actual line, and forecast result by auto.arima is shown in orange line and band.
Well, this is kind of uninteresting result. auto.arima is just returning flat line with average value, and basically giving up on forecasting with further details like smaller up and down.
Let’s see how Prophet does compared to this.
From the Source step, create a branch so that we can work on forecasting with Prophet there.
In the branch, select “Run Time Series Forecast” menu.
In the dialog, set columns like the screenshot below. Also, set Forecasting Time Period to 12, so that we forecast for one year. You can leave Parameters on the right-hand side with defaults. Click Run button.
That will add columns with forecasted values like the following.
Let’s rename the forecasted_value column to “count”, just like the main 2005_2006_flights data frame so that we can show them on the same line chart. (Before that I needed to remove the original “count” column.)
Then let’s add “key” column with the value “prophet”, just like we did for actual 2007 data, so that this can be shown as a line with a different color on the line chart.
“key” column is added.
Prophet returns not only the forecasted data but also the past data, in this case, that is 2005 and 2006 data. But we are interested in only the forecasted data for the comparison, we can filter out the 2005 and 2006 before going back to the main data frame and merge this result to the one with ARIMA.
We can go back to the main branch of 2005_2006_flights, and merge the forecasting result by Prophet to the main.
Now we can compare the Prophet result on the same line chart. The prophet result is the green line.
So, Prophet is showing a reasonable seasonal trend unlike auto.arima, even though the absolute values are kind of off from the actual 2007 data.
This showcases why Prophet was developed by a team at Facebook in the first place. For algorithm like ARIMA, oftentimes, further configuration is needed to make it produce reasonable results, which is out of reach of many people who are not not professionally trained experts.
But, is this the best we can do with ARIMA? auto.arima does not reproduce the seasonal pattern at all.
Let’s see what ARIMA can do with a little more hint.
Instead of auto.arima, let’s manually specify the parameters for ARIMA model, and also give ARIMA a clue that this data has yearly seasonality.
Open the Script we saved before, and comment out line 17, which was calling auto.arima function, and instead uncomment line 18 to enable the line with Arima function.
Click Save button.
This is line 18, with Arima function instead of auto.arima function.
fit_arima <- Arima(training_data_ts, order=c(1,1,0), seasonal=list(order=c(1,0,0), period=12))
Numbers in order parameter are the ARIMA’s meta-parameters. I manually adjusted them so that this ARIMA model fits well with our data.
ARIMA, which stands for AutoRegressive Integrated Moving Average model, is a combination of 2 types of models, namely, AutoRegressive model and Moving Average model, and a processing called differencing. The first number (1) goes to its AutoRegressive model part, and the 2nd number (0) goes to differencing part, and the 3rd number (1) goes to the Moving Average model part. For more detail, take a look at this free great textbook.
seasonal parameter is about seasonality. period parameter in it tells that the data has seasonality with the period of 12 month. order parameter in it again consists of 3 numbers that controls seasonal part of AutoRegressive model, differencing, and Moving Average model. This section of the same textbook has more detail on it.
With this change, let’s rebuild the ARIMA model. Go back to Build Model step and click the round green Run button.
Then go back to the last Bind Rows step, and open the line chart.
Here is the result with the new ARIMA model.
It’s the orange line and band. Let me show the chart in full screen.
This looks like a pretty good forecasting result compared to the previous flat line. And it’s even better than the one with Prophet (Green line)!
With the newly configured ARIMA model, we can see that the prediction interval is much narrower, which means ARIMA is rather confident with this forecast, and that the actual values are fitting well within the prediction interval range.
With Model Extension Framework, you can quickly bring your favorite R package and build models inside Exploratory. This time, we brought in ‘forecast’ package’s ARIMA model and built forecasting models, one with Auto-ARIMA and another with ARIMA with fine tuned parameters. And we compared the results against the one from Prophet with no parameter setting, and found that the ARIMA with fine tuned parameters produced the best result for this data.
Now, is it always like that?
In the next post, I will try with other scenarios and compare the results to answer, so stay tuned!
If you don’t have Exploratory Desktop yet, you can sign up from here for a 30 days free trial. If you are currently a student or teacher, then it’s free!