A time series is a sequence of data points ordered by the time. Time series analysis is a methodology for extracting useful and meaningful information from these data points. Scientific oriented languages such as R supports different time series forecasting models such as ARIMA, HoltWinters and ETS. In an earlier post, we desmonstrated using these models to forecast sales of SalesForce data (SFDC). A major drawback with R is that it is single-threaded, which limits its scalbility. In this post, we will exploite different languages: Python and Spark on Scala and PySpark. For bavarity, we will focus only on ARIMA model.
Sales Forecasting using PythonUnlike R, Python does not support automatic detection of the best ARIMA model. Further, its ARIMA implementation predicts a single data point at a time. The following code handles these limitations The code tries out different combinations of ARIMA parameters (p, d & q) at lines 7-9, and pick the best mode. The best ARIMA model for given data-set is the one with the lowest AIC parameter. In order to resolve the single point prediction, we append the predicted point to the given data-set, and re-predict again. This incremental prediction allowed us to predict N point instead the default Python behavior.
The following code is similar to the R code illustrated before, which forecast SFDC data. The code is self-explanatory and gives the same logic described before. Here we encapsulated the loading of data from database and saving the results in a custom util package. The training detects ARIMA model with the following parameters, and root mean square error = 1500959 The final model using the full data-set is with the following parameters and it gives the following predictions
Sales Forecasting using SparkBoth R and Python is single threaded, which is not suitable for processing (or loading) large data. Previously, we count on the DB to retrieve, aggregate and sort the data. Here, we will get use of the power of Spark to handle this. SparkSQL introduces SQL interface for converting SQL queries into Spark jobs. Additionally, a third party implementation for ARIMA is available at https://github.com/sryza/spark-timeseries/. In this and next section we will use Spark to model and forecast the data.
Sales Forecasting using PySparkPySpark is a Python interface for Spark. We can use the Python implementation explained before but we will change the following:
The util package to get use of Spark
The prediction method to use Spark ARIMA instead the default python implementation Unfortunately, the third party implementation for Spark on Python is not native. It just delegates the calls to Java and exploits Py4J to wire Python with Java.