Invoice document digitization is crucial for efficient management in industries. The scanned invoice image is often noisy due to various reasons. This affects the OCR (optical character recognition) detection accuracy. In this paper, letter data obtained from images of invoices are denoised using a modified autoencoder based deep learning method. A stacked denoising autoencoder (SDAE) is implemented with two hidden layers each in encoder network and decoder network. In order to capture the most salient features of training samples, a undercomplete autoencoder is designed with non-linear encoder and decoder function. This autoencoder is regularized for denoising application using a combined loss function which considers both mean square error and binary cross entropy. A dataset consisting of 59,119 letter images, which contains both English alphabets (upper and lower case) and numbers (0 to 9) is prepared from many scanned invoices images and windows true type (

Digitizing paper documents is a crucial step in business process automation. This process helps industries to efficiently manage large volume of documents. The images obtained by scanning the paper documents are converted into a digital format using OCR (optical character recognition) software. Usually, during the scanning process, noise can get into the images in the form of background noise, blurred and faded letters due to dirt on paper or lens, water marks, moisture on the lens, or due to physical dealing of papers. Transmission errors and compression methods also add noise to the images [

The image denoising techniques have attracted researchers for half a century and, it remains a challenging and open task [

Machine learning methods for image denoising includes, sparse based methods, dictionary learning method, total variation regularization, gradient histogram estimation and preservation (GHEP) etc. [

Deep learning techniques are part of machine learning, which have significant applications in many fields [

In this paper, a modified stacked denoising autoencoder is implemented and used for receipt data denoising. The proposed method of autoencoder design can capture the most salient features of training samples. An undercomplete autoencoder is designed with non-linear encoder and decoder function. This autoencoder is regularized for denoising application using a combined loss function which considers both mean square error and binary cross entropy error. A two-level stacking is done to increase efficiency of the network. A dataset consisting of 59,119 letter images, is prepared from different scanned invoices(receipts) images and windows true type (

Autoencoders [^{d}, which can be a patch of an image. This input is mapped to a hidden representation ^{d}. Here ‘d’ represents the dimensionality of the vector space. The mapping is given by the

where s is a nonlinear function. ^{d}, in order to obtain the reconstructed input data. This reverse mapping is given by the

The model parameters are optimized to minimize cost function, which is the average reconstruction error,

given by

where L is a loss function.

This network adapts itself to extract features from images. So hand coded feature descriptors are not needed. Autoencoder can be used for classification and denoising applications. Autoencoder architecture is shown in

Denoising Autoencoder [

The mapping function for denoising autoencoder is given by the

where r is a random vector and

The second term in cost function is used to minimize correlations between input images.

Stacked Autoencoders is obtained by stacking one layer of autoencoder after the other [

Input is a vector x, which is passed through hidden layers and y is the decoded vector. Encoder is used for mapping the input data x into hidden representation (code), and decoder is used for reconstructing input data from the hidden representation. Here _{1} (first hidden layer) represents the hidden encoder vector calculated from _{2} (second hidden layer) represents the second hidden encoder vector calculated from layer _{1}. Similarly _{3} and _{4} are two hidden layers in the decoder section, which represents the hidden decoded vectors formed from the code generated by encoder. Here

where ^{th} hidden layer, ^{th} hidden layer, and _{n} is the bias vector in ^{th} hidden layer.

where ^{th} hidden layer, ^{th} hidden layer, and ^{th} decoder hidden layer.

End to end pre-training and Ladder wise pre-training are the two methods of training stacked autoencoders. After all the hidden layers are trained, backpropagation algorithm is used to minimize the cost function and update the weights by optimization process. The rectified linear units (Re-LU) activation function is used after each hidden layer vector calculation. Re-LU does not suffer from gradient diffusion or vanishing problems. The Re-LU function is

Sigmoid activation function is used in output layer.

Methodology of work is shown in

.

The steps in the experiment are detailed in

Invoices are often generated by Windows based system and it is logical to train the autoencoder with Windows fonts and letter sizes. A Python script is written to extract Windows true type (

Another Python script is written to read in scanned images of invoices and contours are drawn around letters and these text boxes are separated, labelled and added to the respective folders to augment the synthetic data set to yield 59,119 images. The pixel values of the images are converted to a Python array along with the labels, added controlled amount of noise and then pickled to form the training, test and cross validation data for the autoencoder.

Stacked denoising autoencoder is implemented in python using Py-Torch deep learning library. Pickled noisy images of size 59,119 × 60 × 40 is fed to the input of stacked denoising autoencoder. Adam optimizer is used. Learning rate used is 10−3 and batch size used is 16. The network is trained for 100 epochs on an HPCC with NVIDIA Tesla k20M GPU hardware.

During each epoch, mean square error and cross entropy error are calculated and this loss score is backpropagated through an optimizer in order to update the weights of the network. Two hidden layers having 512 neurons and 128 neurons respectively are included in encoder section. Another two hidden layers having 128 neurons and 512 neurons respectively are included in decoder section. Re-Lu activation function is used after each hidden layer. Sigmoid activation function is used at the output layer. Additive Gaussian noise of mean zero and variance of 20% of the peak signal value is used to get the noisy version of letter images.

The system shown in

SDAE (Stacked denoising autoencoder) outputs with combined loss function for a randomly selected letter images for 20%, 40% and 60% (of the peak signal value)) noise variances are shown in

Autoencoder output for a set of input letter images corrupted by 20% noise variance is shown in

Autoencoder output for the same set of input letter images corrupted by 40% noise variance is shown in

Autoencoder output for Input letter images corrupted by 60% noise variance is shown in

It is essential to compare the performance of other denoising filters for these letter images from invoice at different noise levels. Results of other filters for a randomly selected invoice image representing number “4” corrupted by 20% noise variance are shown in

Results of other filters for the same image corrupted by 60% percentage of noise is shown

Visual quality of SDAE is determined based on

Signal to Noise Ratio (SNR)

Peak Signal to Noise Ratio (PSNR)

Structural Similarity Index (SSIM)

Universal Image Quality Index (UQI)

The SNR is expressed as

Signal to noise ratio improvement of various denoising methods for Gaussian noise of zero mean and different noise variances is shown in

Peak signal to noise ratio (PSNR) is the ratio between the maximum possible power of an image and the power of corrupting noise that affects the quality of its representation. PSNR is defined as follows:

Here, M is the number of maximum possible intensity levels (minimum intensity level considered to be 0) in an image and RMSE is root mean square error.

Method | SNR (dB) | ||
---|---|---|---|

20% noise | 40% noise | 60% noise | |

SDAE | 21.83 | 21.0 | 19.19 |

NM filter | 11.59 | 8.55 | 6.73 |

AD filter | 12.31 | 9.82 | 7.77 |

Gaussian filter | 11.80 | 9.19 | 7.26 |

Mean filter | 12.56 | 10.39 | 8.63 |

Method | PSNR (dB) | ||
---|---|---|---|

20% noise | 40% noise | 60% noise | |

SDAE | 76.85 | 75.77 | 72.51 |

NLM filter | 74.32 | 68.33 | 65.04 |

AD filter | 71.76 | 69.69 | 66.81 |

Gaussian filter | 72.12 | 69.27 | 66.08 |

Mean filter | 69.84 | 68.83 | 67.29 |

Structural similarity index (SSIM) [

The parameters

SSIM comparison of five denoising methods, SDAE, NLM, Gaussian filter, Mean filter and anisotropic diffusion filter is shown in

UQI [

The proposed stacked denoising autoencoder with combined MSE and BCE loss function is compared with standard stacked denoising autoencoder with single binary cross entropy loss function, in terms of signal to noise ratio (SNR) and peak signal to noise ratio (PSNR). Comparison results for two different noise levels for a single selected letter ‘X’ are shown in

Method | SNR (dB) | PSNR (dB) | ||
---|---|---|---|---|

20% noise | 60% noise | 20% noise | 60% noise | |

Proposed SDAE | 20.42 | 19.58 | 76.74 | 73.08 |

Normal SDAE | 20.23 | 18.99 | 74.95 | 71.98 |

The proposed denoising method of letter images from invoice documents by using modified Stacked Denoising Autoencoder (SDAE) is observed to have excellent signal to noise ratio, structural similarity index and universal quality index, even under extreme noisy conditions. Using a combined loss function which considers both mean square error and binary cross entropy for regularizing the denoising function is used. Under complete representation of autoencoder used in this denoising method have better feature extraction properties. A dataset consisting of 59,119 letter images, which contains both English alphabets (upper and lower case) and numbers (0 to 9) is prepared from many scanned invoices images and windows true type (

This research was funded by the Deanship of Scientific Research at Princess Nourah bint Abdulrahman University through the Fast-track Research Funding Program.