Price prediction of goods is a vital point of research due to how common e-commerce platforms are. There are several efforts conducted to forecast the price of items using classic machine learning algorithms and statistical models. These models can predict prices of various financial instruments,

Today, making a decision about buying or selling any assets or gifts is a very difficult process. A lot of factors and complexities can influence your decision, such as the best time for buying or selling the products, goods, or seasonal gifts. In the financial market, the shareholders need to know when to sell, and the customers need to get the products at good prices. Thus, the price prediction issue has arisen. Price forecasts for trading stocks and commodities were primarily based on intuition. As trading grew, individuals tried to find methods and tools which could accurately forecast the rates, increasing their gains and reducing their risk. Many methods such as fundamental analysis, technical analysis, information mining, statistical approaches, and machine learning techniques are used for price predication. The most hopeful results of forecast systems might be a shift in people’s state of minds [

As discussed above, several methods can be used to resolve the price forecast issues, but we will focus on only two: statistical models and machine learning models. There are plenty of machine leaning models for evaluating and forecasting costs. Machine learning models can gain from historic correlation and patterns in the data to make data-driven forecasts or choices [

Regression analysis is widely used for forecasting and prediction, where its use has a high degree of overlap with machine learning. Regression analysis is likewise used to find the function that maps the independent variables to the predicted (

Many researchers have explored the problem of price prediction issues with the aid of various machine learning approaches. For example, in [

Statistical methods for price prediction problems in [

In the literature, many studies have focused on the subject of price prediction for a variety of financial instruments (

Despite these efforts, there is no existing research providing a model for predicting the prices of seasonal goods, to our best knowledge. Thus, there is a vital need for models that can predict the prices of seasonal goods and the ability to compare the performance of the machine learning model against the statistical models. These proposed models can help the seller evaluate the proper seasonal goods pricing, which attracts clients and increases profits based on historical data. Historical data of seasonal goods pricing can guide the seller to select the most suitable prices.

In this vein, we framed the task of predicting the price of seasonal goods as a regression task. We proposed a system for predicting the prices of the seasonal items using both machine learning methods and statistical methods. The proposed system utilized a real-life dataset of Christmas gifts. In addition, the proposed work utilized a set of machine learning models and the ARIMA statistical model price. The proposed models are thoroughly evaluated on different evaluation metrics. The main contributions of this paper can be summarized as follows:

To the best of the authors’ knowledge, we proposed the first price prediction models of seasonal goods.

We provided a comparison between the machine learning-based model and the statistical-based model to learn which model type resolves the problem better.

The proposed models are evaluated on several performance evaluation metrics.

The rest of the paper is organized as follows. Section 2 presents a brief theoretical background, explaining the methods and algorithms used. Section 3 describes the related work of price prediction. We exposed the proposed work in Section 4. Section 5 shows and discusses the results. Finally, Section 6 concludes the contributions of this study.

This section provides an overview of the regression methods utilized in this article. The background discusses the theory behind the following algorithms: Linear regression, ridge regression, support vector regression (SVR), random forest regression (RF), and ARIMA algorithms.

The fundamental premise of this regression method is to forecast an interesting time-series

In

The observations in

The Ridge regression method is a widely used algorithm for resolving the linear regression model’s multicollinearity issue [

Being approximated by the solution of the classic linear least squares equation [

where

The ridge regression model works by framing the following equation an optimization problem, and the target is to minimize the objective function below:

With regard to

where

where

The regression is a widespread categorization of machine learning models, where the main difference between the regression task and other tasks is that the regression output is continuous. On the other hand, the output of the classification task is discrete, from a finite set. Generally speaking, an ongoing multivariate function is estimated using a regression model. Vector machine support solves difficulties of binary classification by presenting them as issues of convex optimization. The optimization challenge involves determining the largest margin between the hyperplane and properly categorizing as many workouts as possible. SVMs with support vectors represent this ideal hyperplane. SVM generalization to SVR is achieved with the introduction of an e-insensitive area, dubbed the e-tube, as shown in

SVR uses an e-insensitive reduction function to address regression problems. This functionality makes it possible for an endurance degree to make mistakes not more than

When employing the ɛ-insensitive loss function, the objective is to design a function that fits the present training data with a deviation smaller than or equal to ɛ, while remaining as flat as feasible. This implies that one desires a small weight vector w. One technique to accomplish this is to minimize the vector’s quadratic norm w. Due to the possibility that this task is infeasible, slack variables

where

The dual variables are denoted by

The random forest (RF) regression method is a kind of ensemble learning in which a large number of regression trees are combined. A regression tree is a collection of criteria or constraints that are grouped hierarchically and applied sequentially from the root to the leaf of the tree. The RF starts with a large number of bootstrap samples selected at random from the original training dataset. Each of the bootstrap samples is fitted using a regression tree. For each node in the tree, a tiny subset of the entire set of input variables is randomly picked for binary partitioning [

For forecasting, the ARIMA approach makes use of prior values in the Series. George Box and Gwilym Jenkins devised it, and it is a commonly used forecasting model [

Using the lag operator denoted L; Autoregressive AR terms are lagged values of the dependent variable and refer to it as the lag order p, which is the number of time delays. The following formula may be used to express a non-seasonal

Moving Average MA terms are lagged forecast errors in forecasts between actual past values and their predicted values, and they are referred to as the order of moving typical q. A non-seasonal MA(q) can be created as follows:

where

where q, d, and p are positive numbers greater than or equal to zero that denote the order of the model’s autoregressive, integrated, and moving average components. The number d specifies the degree of differentiation [

Auto-Correlation Functions (ACFs) and Partial Auto-Correlation Functions (PACFs) are typically utilized to identify the dependence of a time series variable on its history. The fixed time series’ autoregressive and moving average terms are identified by observing the patterns in the graphs of ACFs and PACFs. Autocorrelation is the connection between time series and the lag of the very same time-series, while partial autocorrelations are the correlation coefficients between the standard time series and the lag of the exact same time series, but without the effect of the members in between. The ACF, which is a bar chart of time series of coefficients of connection of time-series and lags of itself, and the PACF, which is a plot of partial coefficients of correlation of time-series and lags of itself, can be used to identify the variety of AR and MA terms [

In this section, we discuss the related work of price prediction using machine learning and deep learning method. First, we expose several works utilizing the classic machine learning methods such as SVR, ANN, and decision trees. Second, we discuss the methods utilizing the deep learning techniques for predicting product prices.

Regression analysis is commonly utilized for predictive and forecasting purposes. Regression analysis may be implemented in many different ways. When applied to smaller effects or observational data, regression techniques may produce deceptive results, especially in multiple applications. There is no universally agreed upon real form of the data-generating process, and assumptions may be tested if sufficient data are supplied. Even when the assumptions of a model are only somewhat violated, the regression models may still be beneficial in pricing prediction. Because incorrect conclusions are possible, use caution when using these models. For instance, in some circumstances, correlation might be of use even though it is not directly related to correlation. Regression analysis techniques will provide better results if you have random or semi-random data producing random or semi-random data.

The authors of [

In [

Data-driven methods for predicting natural gas prices were described in [

A 24-year time-series dataset was used to build the decision tree (DT) model to anticipate crude oil prices in [

In [

In [

In [

In [

The research study’s independent variable was the closing rate of Bitcoin in USD as figured out by the CoinDesk Bitcoin Price Index. Efficiency evaluations of designs were carried out in order to determine their effectiveness. Data referring to the blockchain, consisting of the mining difficulty and hash rate are provided. The aim of the research work was to define new performance measures that might be beneficial to traders. Deep learning models such as the RNN and LSTM are effective for Bitcoin prediction in USD. LSTM surpassed RNN, but not substantially. Information of this nature is currently being gathered from CoinDesk on a daily basis for future usage. The data are not offered in a useful or real-time setting for predicting the future. The performance benefits acquired from the parallelization of machine learning algorithms on a GPU appear with 70.7% performance improvement for training the LSTM. The real efficiency of the ARIMA-based model upon error was significantly worse than the NN-based models [

Stock exchange prediction is difficult due to its nonlinear, dynamic, and complex nature. A successful forecast has some intriguing benefits that generally affect the decision of a financial trader on the purchase or sale of an instrument. Four data mining strategies (

The proposed system of seasonal price prediction is divided into five stages, namely, 1) data collection and preparation, 2) model designing, 3) training the models, 4) testing the models, and 5) deploying the models.

In the first stage, we proposed identifying the data sources. Generating the dataset is a very import step in the proposed system. The quantity and quality of the collected data determines the efficiency of the output. The more data available, the more accurate the prediction will be. The prepared data are utilized in the training and test stages. Data was cleaned and converted from raw to a useable format. Dataset were split for training and tested in a random way.

In the second stage, several machine learning models and the ARIMA statistical models were designed and their hyperparameter were set. The aim of this step is to build the machine learning models to predict the goods’ prices. This stage starts with determination of

In the models training stage, the proposed models are trained on a portion of the datasets (

During the models testing stage, after training the machine learning models on a portion of the dataset, the models are tested to evaluate their performance. In this stage, the models’ accuracy rate is validated by feeding them a test set (

Finally, in the last stage, the trained and tested models are deployed for final utilization by the users. Now, the proposed models are ready to be moved to the production stage for real situations.

One of the goals of online retailers is to increase the desirability and the value of the products they sell. Offering promotions and special offers is an effective method for driving ancillary traffic to the site. The collected data are real data from an online retailer^{1}

In

In the proposed models, we made use of features number 2 to 4, as these three input features capture important information that can be utilized by the proposed model to predict the gift price accurately. For instance, feature number 2 (

Features number 5, 6, 13, and 14 are not utilized in the proposed models, as these features are date and time features. In regression problems, the model can handle only numerical data. In addition, considering these four features should change the problem from a regression problem to a time-series analysis problem, the latter is beyond the scope of the current work.

No. | Feature | Description |
---|---|---|

1 | gift_id | Unique ID of gift |

2 | gift_type | Type of gift (clothes/perfumes/etc.) |

3 | gift_category | Category to which the gift belongs under that gift type |

4 | gift_cluster | Type of industry the gift belongs |

5 | instock_date | Date of arrival of stock |

6 | stock_update_date | Date on which the stock was updated |

7–12 | lsg_1–lsg_6 | Anonymized variables related to gift |

13, 14 | uk_date1, uk_date2 | Buyer related dates |

15 | is_discounted | Shows whether the discounted is applicable on the gift |

16 | Price | The total price |

Six features (

One of the most active study areas in machine learning is loss functions. Loss functions are critical in the development of machine learning algorithms and their performance optimization [

The Mean Squared Error (MSE) is a model assessment statistic that is often used in conjunction with regression models. The mean squared error of a model in relation to a test set is equal to the average of the squared prediction errors over all occurrences in the test set. The prediction error is defined as the difference between the actual and predicted values for a certain occurrence [

The Mean Absolute Error (MAE) is a metric for comparing the distinct values of two continuous variables. It is an average/mean of the absolute error that is computed using the saw me scale as the data source. It cannot be used to compare series with varying scales [

The Root Mean Square Error (RMSE) is computed by increasing the residuals by their square root. It represents the design’s outright fit to the information, suggesting how close the observed data points are to the design’s expected worth at the moment. The R-squared value is a relative measure of the fit, while the RMSE value is an absolute value. RMSE might additionally be understood as the standard deviation of the unexplained variance, because it is revealed in the exact same units as the response variable. Lowered RMSE readings indicate a much better match. The root mean square error is a helpful sign of how successfully the design anticipates the reaction [

R-squared is a statistic that indicates a model’s quality of fit. It is a statistical measure of how closely the regression line approximates the real data in the context of regression. It is so critical when a statistical model is employed to forecast future events or to evaluate hypotheses. There are other variations (for further information, see the remark below); the one shown here is the most generally used:

The sum squared regression is equal to the sum of the residuals squared, while the total sum of squares is equal to the sum of the data’s deviations from the mean squared. Because it is a percentage, it can only accept values between 0 and 1 [

Due to its scale independence and interpretability, the mean absolute percentage error (MAPE) is one of the most extensively used metrics of prediction accuracy. Let

where

The proposed models are implemented in the Python programming language. We mainly used the Scikit-learn machine learning library. It is open-source and includes efficient implementations of several machine learning algorithms. In addition, we used pandas and

For the data analysis task, we handled the outliers of the dataset by applying the percentile function to find the

We proposed using the default parameters of the utilized machine learning models. The exception to this assumption is the Ridge model. We set the alpha parameter to ridgecv.alpha_value, and we set the normalize parameter to

For the statistical model, we applied the grid search technique to find out the best combination of the values of the parameters p, d, and q. The grid search function is considered a brute-force approach to obtain the best prediction accuracy rate based on selecting the best values of the model three parameters.

The experiments were conducted on a computer with an Intel(R) Core(TM) i5-7200U CPU@ 2.50 GHz. The utilized OS is 64-bit Windows 10. All implementations are written in the Python programming language. We used the full dataset of Christmas gifts from an online retailer to train and test the proposed statistical and machine learning models. We used the split 80% and 20% for the training and test data, respectively. All the machine learning models’ parameters are set to the default values of the Scikit-learn package. The ARIMA model parameters

The first point of comparison is the basic evaluation metric scores. We evaluated the proposed model on four different metrics, namely, 1) MAE, 2) RMSE, 3) MAPE, and 4)

The second point of comparison is the visual results. In

The obtained results outline that the statistical ARIMA model yielded a comparable result to the machine learning-based model. While the best results are produced by the random forest model, the ARIMA model was the second-best model. However, the performance gap between the random forest and ARIMA was not large. Thus, the recommendation of these results is that both machine learning and statistical models are suitable for the task of price prediction for retailer seasonal goods.

Model | MAE | RMSE | MAPE | |
---|---|---|---|---|

SVR | 34.56 | 46.81 | 2.38 | 13.0 |

Ridge | 38.13 | 48.29 | 3.95 | 7.3 |

Random forest | 20.46 | 31.31 | 1.63 | 61 |

Linear regression | 38.13 | 48.29 | 3.93 | 7.3 |

ARIMA | 34.44 | 44.35 | 3.95 | 7.8 |

Model | Time in seconds |
---|---|

SVR | 7.266 |

Ridge | 0.013 |

Random forest | 1.047 |

Linear regression | 0.362 |

ARIMA | 6.827 |

In this work, we studied the task of predicting pricing for retail goods. We focused on predicting the prices of seasonal Christmas items. We employed a real dataset of Christmas gifts from an online retailer. We utilized four machine learning-based models (

This study recommends the random forest machine learning-based model and the ARIMA statistical-based model to address the problem of predicting seasonal goods’ pricing. The future directions include building hybrid models for the sake of improving the prediction quality. The hybrid models can be achieved using ensemble learning techniques. Moreover, we proposed framing the problem as a time-series problem and to include the date and time input features in the proposed models. For instance, we can utilize the well-known recurrent neural network for predicting the seasonal goods prices as a time series problem.