An Efficient Video Inpainting Approach Using Deep Belief Network

The video inpainting process helps in several video editing and restoration processes like unwanted object removal, scratch or damage rebuilding, and retargeting. It intends to fill spatio-temporal holes with reasonable content in the video. Inspite of the recent advancements of deep learning for image inpainting, it is challenging to outspread the techniques into the videos owing to the extra time dimensions. In this view, this paper presents an efficient video inpainting approach using beetle antenna search with deep belief network (VIA-BASDBN). The proposed VIA-BASDBN technique initially converts the videos into a set of frames and they are again split into a region of 5*5 blocks. In addition, the VIABASDBN technique involves the design of optimal DBN model, which receives input features from Local Binary Patterns (LBP) to categorize the blocks into smooth or structured regions. Furthermore, the weight vectors of the DBN model are optimally chosen by the use of BAS technique. Finally, the inpainting of the smooth and structured regions takes place using the mean and patch matching approaches respectively. The patch matching process depends upon the minimal Euclidean distance among the extracted SIFT features of the actual and references patches. In order to examine the effective outcome of the VIA-BASDBN technique, a series of simulations take place and the results denoted the promising performance.


Introduction
Digital inpainting is widely employed in different applications such as object removal/image resolution and improvement restoration of images. The primary objective of digital inpainting is to complete the areas/ regions by lacking information that has been vanished/lost intentionally. The inpainting model was stimulated by the artist utilizing their individual capabilities and acquaintance for reconstructing/fixing the damage which appeared in sculptures/paintings. Currently, the ability of digitalizing different kinds of visual data creates the requirement for methods which repair digital damages, as made with painting [1]. Furthermore, in communication, audio/video applications, it is necessary to retrieve the signals which get corrupted by narrowband intrusions such as electric hum [2]. The intervention could be determined sparsely in frequency domain (natural form). Video inpainting is an approach for filling the lost gaps/regions, retrieve object in a video. Removing unwanted objects in the video could generate lost regions [3]. The aim is to fill the lost regions hence the outcomes are convincing visually in space and time. Video inpainting could assist many restoration and video editing processes like scratch or damage restoration, retargeting, and unwanted removal of objects. Most importantly, and excepting its traditional demand, video inpainting could be employed with Augmented Reality (AR) for great visual experiences; Eliminating present item provides greater opportunity beforehand overlaying novel elements in a scene [4]. Hence, as a Diminished Reality (DR) technique, it opens up novel opportunities to be paired with novel deep learning or realtime based AR technology. Furthermore, there are many semi-online streaming scenarios like visual privacy filtering and automatic content filtering. Only a smaller wait would results in significant latency, therefore make the speed themselves a significant challenge.
In spite of considerable advances on DL based inpainting of an individual image, still it is stimulating to expand this method to video domain because of the extra time dimensions. The complications that come from difficult motion and higher requirements on temporal consistency make video inpainting a complex challenge [5]. A direct approach for performing video inpainting is to employ image inpainting on all frames separately. But, this ignores motions regularity come from the video dynamics, and hence it is unable to estimate the nontrivial appearance change in image space over time [6]. Furthermore, this system inevitably brings temporal inconsistency and cause serious flickering artifact. In order to tackle the temporal consistency, various approaches were introduced for filling the lost motion field; with a greedy election of local spatio and temporal patches, for each iterative optimization/frame diffusion based method [7]. But, the initial 2 method treats flow assessment to be self-governing of color assessment and lastly based on time consuming optimization. This paper presents an efficient video inpainting approach using beetle antenna search with deep belief network (VIA-BASDBN). The proposed VIA-BASDBN technique initially converts the videos into a set of frames and they are again split into a region of 5*5 blocks. Followed by, the VIA-BASDBN technique involves the design of optimal DBN model, which receives input features from Local Binary Patterns (LBP) for the classification of blocks into smooth or structured regions. Moreover, the weight vectors of the DBN model are optimally chosen by the use of BAS technique. At last, the inpainting of the smooth and structured regions takes place using the mean and patch matching approaches respectively. For examining the superior performance of the VIA-BASDBN technique, a set of experimental analyses is carried out.

Literature Review
Patil et al. [8] proposed a novel inpainting method which depends on MS modeling, in which the original images are precisely attained by inpainting procedure in the masked image. Now, DWT method is employed to process is the digital images. Furthermore, to discover an optimum filter coefficient from DWT, a renowned optimization method called DA is employed. Furthermore, the smoothing of images is processed by regenerating kernel Hilbert smoothing models. Kim et al. [9] proposed a new DNN framework for faster video inpainting. Based on encoder-decoder models, this method is developed for collecting and refining data from neighboring frames and still synthesizes unknown regions. Simultaneously, the output is forced to be temporally reliable with a temporal memory and recurrent feedback model. Tudavekar et al. [10] proposed a new 2 phase architecture. Initially, subbands of wavelets of lower resolution images are attained by the dual tree complex WT. auto-regression and Criminisi algorithms are later employed to this sub-band for inpainting the lost region. The FL based histogram equalization is employed for additionally enhancing the image by maintaining the brightness of an image and enhance the local contrast. Next, the images are improved by a super resolution method. The procedure of inpainting, down-sampling and consequently improving the videos by the super resolution techniques reduce the video inpainting time. To decrease the computation complexity of image inpainting, an enhanced Criminisi algorithm is presented in [11] on the basis of BBO model. Based on characteristics of Criminisi, the novel methods use the mutation and migration of BBO model for searching optimal matching blocks and conquer the shortcoming of Criminisi algorithms through searching optimal matching blocks in the unbroken region in an image. Janardhana Rao et al. [12] aimed to improve new DL concepts for handling video inpainting. In the first step, motion tracing is carried out, i.e., the procedure of defining motion vector which describes the conversion from neighboring frame in a sequence video. Furthermore, the patches/regions of all the frames are classified by an enhanced RNN, where the regions are divided into a structure and smooth regions. This can be implemented by the texture features named grey level co-occurrence matrix. Now, the hybridization of 2 Meta heuristic approaches such as CS and MVO algorithms named CSMVO is employed for optimizing the patch matching and RNN model. Ding et al. [13] proposed a novel forensic refinement architecture for localizing the deep in painted region with the consideration of spatial temporal viewpoints. Initially, designed spatiotemporal convolutions for suppressing redundancy to highlight deep inpainting traces. Next, a detection model is made by 2 up sampling layers and 4 concatenated ResNet blocks, for achieving a rough localization map. Lastly, an adapted U-net based refinement modules are presented to the pixelwise location mapping. Zeng et al. [14] proposed to learn a joint STTN to video inpainting. Particularly, fill lost region in each input frame through self-attention, also proposed to enhance STTN through a spatial temporal adversarial loss. In order to exhibit the dominance of the presented approach, they performed qualitative and quantitative assessments with the aid of standard stationary mask.
Li et al. [15] presented a new context aggregation network for effectively exploiting short-term and long term frame data to video inpainting. In encoding phase, they proposed boundary aware short-term context aggregation, that align and aggregate, from neighboring frame, local region is related closely to the edge context of lost region to the targeting frames (represents the present input frames under inpainting.). Further, proposed a dynamic long term context aggregation for widely refining the feature map made in the encoding phase. Hirohashi et al. [16] proposed to adapt a newly presented video inpainting methods which are capable of restoring higher fidelity images in real time. It estimates optical flows with a CNN also utilizes them for matching occluded regions in the present frames to unconcluded regions in prior frames, restore the previous. Even though the direct applications doesn't lead to acceptable result because of the peculiarity of the SMC video, they demonstrate that 2 improvement makes it promising to attain better result. Zou et al. [17] proposed the Progressive Temporal Feature Alignment Network', that gradually enrich feature extracted from the present frames using the feature wrapped from neighboring frame with optical flows. This approach corrects the spatial misalignment in the temporal feature propagation phase, highly improves temporal consistency and visual quality of the inpainting video. Wu et al. [18] proposed a novel DAPC-Net model that utilizes temporal redundancy data amongst video sequences. Especially, create a DANet mode to align reference frames at the feature level. Afterward, alignment devised a PCNet for completing lost region of the targeting frames.

The Proposed Model
In this study, a novel VIA-BASDBN technique is derived for effective video inpainting process. Initially, the input videos are converted into frames and down-sampling process is carried out. Besides, the LBP technique extracts the features and fed into the ODBN model for the classification of regions into smooth/structure. Finally, the video inpainting of the smooth and structured regions is carried out by mean and patch matching approaches. Fig. 1 illustrates the overall block diagram of VIA-BASDBN model.

Frame Conversion and Down-Sampling
Primarily, the frame extraction process from the input video is carried out and the frames are down sampled to a factor of 2 to obtain low resolution images. The inpainting of the low-resolution image is not affected by noise and thereby the computation cost becomes low. When the low-resolution images are produced, the current image gets subtracted from the previous one. If the residual value results in 0 or less than predefined pixel value, it can be concluded that the inpainting process is not needed. Otherwise, the inpainting process takes place using the VIA-BASDBN technique.

Design of Optimal DBN Model for Smooth/Structure Region Classification
Once the videos are converted into a set of frames, every region of the frame is again split into 5*5 blocks which are then classified by ODBN model into smooth or structured regions. The ODBN model receives the features as input from the LBO model. At the time of training the DBN model, it retains the target as smooth or structure depending upon the standard deviation (SD) of the particular block. When the SD becomes lower than 5, the target is treated as smooth one. Otherwise, it is considered as the structure region, as given below.
In order to improve the performance of the DBN model, the weight vectors of the DBN model are optimally tuned using the BAS.

LBP Model
The LBP features are derived to provide input to the ODBN model for the classification of regions into smooth and structured regions. It offers a descriptor for the images by the use of grayscale values for every pixel. Traditionally, every pixel considers its neighboring eight pixels and creates a block of 3 × 3 pixels. Followed by, the relation with the intermediate pixel with other 8 neighboring pixels are determined [19]. When the grayscale value exceeds the intermediate pixel, it gets substituted by 1; else it becomes 0. The resultant binary pattern is transformed to a decimal number. The LBP operator can be represented as follows: (2) It is noticed that the neighboring pixel count and the radius are the variables which be altered or remain same.

DBN Based Classification Model
The DBN model receives the LBP features as input and categorizes the regions into smooth and structured ones. RBM consists of 2 two layers i.e., stochastic network: the visible layer and hidden layer. The collected data is denoted as visible layer, where the hidden layers attempt to learn features from visible layers focuses for demonstrating the probabilistic data distribution. The network is called restricted since the layer neuron comprises connection toward the neuron in other layers. Connections between layers are symmetric and bidirectional, allowing data transmission.
Eðv; hÞ ¼ À For each neuron pair in network, it is possible to assign probabilities through entropy function, in visible and hidden layers, provide a probabilistic distribution.
The vector probabilities from v as visible layers are provided as overall vector probabilities beyond the hidden layers pðvÞ ¼ P h e ÀEðv;hÞ P v;h e ÀEðv;hÞ Since the RBM doesn't comprise a connection between nearby neurons of related layers, the action is autonomous. It offers the limited probability calculation as The initial RBM version is constructed for resolving problems in binary data. Using the accurate distribution (8) & (9) is generated by Here, sigmoid function represents the sigm(x). The problem solving potentials are limited with RBM for binary data. The RBM forms are applied that enable subsequent data for enhancing the accuracy problems. Gauss types RBM is the popular type of RBM. The visible layer probable distribution in this RBM is adapted toward Gaussian distribution. This variance is called as Gaussian-Bernoulli RBM. The Gaussian-Bernoulli RBM probability distribution is given below The training RBM includes decreasing negative log-probability given as follows Learning rate is represented as e,〈▪〉 d &〈▪〉 m is used for denoting the model and desirable data rate respectively. For binary data, the expectation is drawn from Eqs (10) & (11) where, for constant data, the (14) & (15) gives expectation. The DBN has been generative graphical method which is a class of DNNs. Once the CD is established, Hinton [20] projected to stack trained RBM from the greedy approach for creating the called DBN. This is deep layer network with all layers being an RBM network stacked together for construction of a DBN.
The probabilities of bottom-up inference in the visible layer v to hidden layer h k , is determined as: where b k represents the bias to the layer k th Comparison, the top-down inference from the symmetric version of bottom-up inference that is expressed as [21]: where a k−1 signifies the bias to the layer (k − 1) th The trained of DBN is separated as to 2 phases: pre-training and fine-tuning utilizing BP. Pre-training subsequently fine-tuning has been great mechanism to train as DBN.

Optimal DBN Using BAS Technique
At the time of training the DBN model, the weights of the DBN are optimally selected by the use of BAS and thereby improves the classification performance. The major aim of the ODBN model is to minimize the error between the estimated and actual outcome, as defined below.
where CL E is the estimated output of the DBN model and CL O denotes the original target output. BAS algorithm is one of the smart optimization methods which simulate beetles forage behaviors. Once a beetle's forage, it would employ its right and left antennae for sensing the odor intensity of food. When the odor intensity obtained through the left antennae's are larger, it would fly to the left through the stronger odor intensity; or else, it would fly to the right. The modelling procedure of BAS algorithms are given in the following: b ¼ randsðDim; 1Þ krandsðDim; 1Þk (18) Whereas Dim represents the spatial dimension. The space coordinate of the beetles right and left sides and its antennae is made by Amongst other, x t represent the location of beetles' antennae in t-th iteration, x rt represent the location of beetles' right antennae in t-th iteration, x lt represent the location of beetles left antennae in t-th iteration, and d 0 represent the beetles' 2 locations [22]. Based on the elected FF, the corresponding fitness value of the right and left antennae are estimated, and the beetles move toward the antennae using a smaller fitness number. The position of beetles are iteratively upgraded by Amongst other, δ t denotes the step factor, sign indicates a sign function, and eta represents the variable step factor, that is generally 0.95.

Video Inpainting Process
At this stage, the video inpainting process takes place in two ways. When the classified region is smooth, the inpainting process takes place by taking the mean value of pixels related to the unmasked region of the patch. On the other hand, when the classification region is structured, the patch matching process takes place using SIFT and Euclidean distance.

Inpainting of Smooth Regions
In order to accomplish inpainting of every individual frame in the video, it is partitioned into a set of 28 × 28 blocks. Then, the masked regions are in black whose pixel is signified as '0'. They are given into the trained ODBN model to determine the particular block as smooth or structure. If no uniform texture data exist in the patch area, then it is treated as smooth. Then, the inpainting of smooth regions is carried out by placing the pixels of all blocks among the patches using the mean or average pixel values relevant to the unmasked region of the patch. For instance, in case of 28 × 28 patch, consider the unmasked pixel count as Nu and the pixel is defined by IN ue where ue = 1, 2, …, Nu. The mean of the pixel can be represented by Therefore, the ue th pixel IN ue has changed with novel mean pixel.

Inpainting of Structure Regions
At the time of inpainting the structure regions, the frames are split into 28 × 28 blocks. Furthermore, the SIFT features are derived for the respective blocks. It is an effective technique which is invariant to the modifications in the illumination and permits changes in the existence of occlusions, clutter, and noise. It is strong o affine the transformation, cluttering, and occlusion. When the SIFT features are determined for a specific block, it is needed to calculate the SIFT features of the residual patches in the image. Therefore, the patches get substituted with the patch holding lower Euclidean distance. This process gets iterated for every residual structure region that exists in the frame.

Performance Validation
It will speed up the review and typesetting process. This section investigates the performance of the VIA-BASDBN technique using MATLAB tool against four test videos. Videos 1 and 2 are gathered from [23] which consists of 104 frames. In addition, videos 3 and 4 are taken from [24], comprises 15 frames.      From the above mentioned tables and figures, it is apparent that the VIA-BASDBN technique has accomplished maximum video inpainting performance over the other recent techniques.

Conclusion
In this study, a novel VIA-BASDBN technique is derived for effective video inpainting process. Initially, the input videos are converted into frames and down-sampling process is carried out. Besides, the LBP technique extracts the features and fed into the ODBN model for the classification of regions into smooth/structure. Finally, the video inpainting of the smooth and structured regions is carried out by mean and patch matching approaches. For examining the superior performance of the VIA-BASDBN technique, a set of experimental analyses is carried out. The simulation results pointed out the supremacy of the VIA-BASDBN technique over the recent techniques interms of different aspects. Therefore, it can  be employed as an effective tool for video inpainting. In future, recent DL architectures with bioinspired algorithms can be used for patch matching process to improve the overall efficiency of the VIA-BASDBN technique.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.