<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">35442</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2023.035442</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Novel Capability of Object Identification and Recognition Based on Integrated mWMM</article-title>
<alt-title alt-title-type="left-running-head">A Novel Capability of Object Identification and Recognition Based on Integrated mWMM</alt-title>
<alt-title alt-title-type="right-running-head">A Novel Capability of Object Identification and Recognition Based on Integrated mWMM</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Zeeshan Sarwar</surname><given-names>M.</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Alatiyyah</surname><given-names>Mohammed Hamad</given-names></name><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Jalal</surname><given-names>Ahmad</given-names></name><xref ref-type="aff" rid="aff-1">1</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Shorfuzzaman</surname><given-names>Mohammad</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Alsufyani</surname><given-names>Nawal</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-6" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Park</surname><given-names>Jeongmin</given-names></name><xref ref-type="aff" rid="aff-4">4</xref><email>jmpark@tukorea.ac.kr</email></contrib>
<aff id="aff-1"><label>1</label><institution>Department of Computer Science, Air University</institution>, <addr-line>Islamabad</addr-line>, <country>Pakistan</country></aff>
<aff id="aff-2"><label>2</label><institution>Department of Computer Science, College of Sciences and Humanities in Aflaj, Prince Sattam Bin Abdulaziz University</institution>, <addr-line>Al-Kharj</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-3"><label>3</label><institution>Department of Computer Science, College of Computers and Information Technology, Taif University</institution>, <addr-line>Taif, 21944</addr-line>, <country>Saudi Arabia</country></aff>
<aff id="aff-4"><label>4</label><institution>Department of Computer Engineering, Tech University of Korea, 237 Sangidaehak-ro</institution>, <addr-line>Siheung-si, Gyeonggi-do, 15073</addr-line>, <country>Korea</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jeongmin Park. Email: <email>jmpark@tukorea.ac.kr</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic"><year>2023</year></pub-date>
<pub-date date-type="pub" publication-format="electronic"><day>24</day><month>1</month><year>2023</year></pub-date>
<volume>75</volume>
<issue>1</issue>
<fpage>959</fpage>
<lpage>976</lpage>
<history>
<date date-type="received"><day>21</day><month>8</month><year>2022</year></date>
<date date-type="accepted"><day>04</day><month>11</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Zeeshan Sarwar et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Zeeshan Sarwar et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_35442.pdf"></self-uri>
<abstract><p>In the last decade, there has been remarkable progress in the areas of object detection and recognition due to high-quality color images along with their depth maps provided by RGB-D cameras. They enable artificially intelligent machines to easily detect and recognize objects and make real-time decisions according to the given scenarios. Depth cues can improve the quality of object detection and recognition. The main purpose of this research study to find an optimized way of object detection and identification we propose techniques of object detection using two RGB-D datasets. The proposed methodology extracts image normally from depth maps and then performs clustering using the Modified Watson Mixture Model (mWMM). mWMM is challenging to handle when the quality of the image is not good. Hence, the proposed RGB-D-based system uses depth cues for segmentation with the help of mWMM. Then it extracts multiple features from the segmented images. The selected features are fed to the Artificial Neural Network (ANN) and Convolutional Neural Network (CNN) for detecting objects. We achieved 92.13&#x0025; of mean accuracy over NYUv1 dataset and 90.00&#x0025; of mean accuracy for the Redweb_v1 dataset. Finally, their results are compared and the proposed model with CNN outperforms other state-of-the-art methods. The proposed architecture can be used in autonomous cars, traffic monitoring, and sports scenes.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Artificial intelligence</kwd>
<kwd>convolutional neural network</kwd>
<kwd>depth images</kwd>
<kwd>interactive-object detection</kwd>
<kwd>machine learning</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1"><label>1</label><title>Introduction</title>
<p>The majority of current RGB-D datasets were gathered with depth sensors, such as Kinect or LiDAR. Kinect can only be utilised for inside situations, although LiDAR is frequently employed for outside scenes. Due to the variety of situations, it is challenging to achieve decent results in the wild while training on outdoor scene datasets. Accessibility of publicly available network datasets like as ImageNet [<xref ref-type="bibr" rid="ref-1">1</xref>], NYU v1 [<xref ref-type="bibr" rid="ref-2">2</xref>], and Streetview, as well as video classification benchmarking datasets such as Caltech 101 [<xref ref-type="bibr" rid="ref-3">3</xref>] has opened the path for significant advancements in object detection in recent years. Using an RGB-D (Kinect) camera, we have observed the beginning of a new generation of detection systems that are capable of producing exceptional colour and depth pictures. These strategies significantly improve the robots&#x2019; object detecting capabilities. As extracted features can increase the quality of object identification determination, this article conducts experiments with depth photographs. With the easy availability of RGB-D sensors and datasets, which can able to appropriate colour and depth, it is hoped that occlusion and lighting issues may be successfully resolved. In some circumstances, thin, dark, and foggy conditions might make object identification more difficult. In [<xref ref-type="bibr" rid="ref-4">4</xref>], the authors suggested a methodology that enhanced the effectiveness of 3D feature-based object identification. Utilizing 3D characteristics allows for high precision [<xref ref-type="bibr" rid="ref-5">5</xref>]. In furthermore, the depth of information in such situations is crucial for a various applications, including security monitoring, the medical profession, military, 3D interactive games, autonomous driving, and mapping. A model presented by [<xref ref-type="bibr" rid="ref-6">6</xref>] performs picture categorization using kernel characteristics. The authors of [<xref ref-type="bibr" rid="ref-7">7</xref>] described an architecture that performs localization and object recognition on RGB-D pictures by subtracting the background and segmenting the subjects. These approaches enhance the quality of object identification and recognition; hence, we will conduct studies with these depth pictures. The suggested architecture integrates Convolutional Neural Network (CNN) and convolution filters with form context-specific properties.</p>
<p>To develop robots capable of perceiving the environment as humans do, researchers have focused heavily on Scene Semantic Recognition (SSR), automated analysis of item placements [<xref ref-type="bibr" rid="ref-8">8</xref>] and structural connection between many objects in landscape photos. Improving the dependability of SSR in practical applications such as security navigation to automatically detect suspicious/violent situations [<xref ref-type="bibr" rid="ref-9">9</xref>], recognition of social interaction types in public settings, differentiating between diverse sports scenarios, and remote sensing still faces significant obstacles.</p>
<p>One difficulty encountered in 3D image processing is the mathematically consistent blending of various point clouds gathered from multiple angles. Various object detection techniques have been proposed for RGB-D photos and movies. Some systems employ the technique of object localization based on depth maps, such as establishing a Conditional Random Field (CRF) model and a system to comprehend interior scenes. Utilizing 3D characteristics yields outcomes with high precision. Proposed is a method that extracts the fusion of characteristics such as depth edges, 3D forms, and size features. Using this fusion, they were able to obtain considerable high performance with RGB-D pictures [<xref ref-type="bibr" rid="ref-10">10</xref>]. Image normal estimate is conceptually comparable to fitting a plane to a local point cloud in the three-dimensional space [<xref ref-type="bibr" rid="ref-11">11</xref>]. Using a multilayer perceptron for scene comprehension, [<xref ref-type="bibr" rid="ref-12">12</xref>] proposes a system based on a hybrid Histogram of Gradients (HOG) and local geometrical characteristics for multi-object identification and recognition. Another research [<xref ref-type="bibr" rid="ref-13">13</xref>] detects circles using the Hough transform. Using fundamental visual descriptors such as form, colour, and roughness in conjunction with their feature fusion for object recognition during pattern recognition. In a separate work, the author [<xref ref-type="bibr" rid="ref-14">14</xref>] described an efficient shape-matching procedure that makes use of form contexts. The similarity between the shapes of the target objects is estimated by locating a transformation between shape points.</p>
<p>A approach based on multi-object categorization is proposed to conduct scene classification on a variety of benchmark datasets in order to circumvent the issues inherent in scene classification. The suggested approach initially preprocesses the photos. In the second stage, the improved Watson Mixture Model (mWMM) technique is used to generate efficient segmentation results, and clustering is conducted. Multiple characteristics are retrieved in the third stage, including 3D-point clouds, form features, and a bag of words. In the final phase, the characteristics are provided to two distinct architectures, Artificial Neural Network (ANN) and CNN [<xref ref-type="bibr" rid="ref-15">15</xref>] in order to identify photos from two difficult datasets. The following are the key contributions of this research:
<list list-type="bullet">
<list-item><p>An approach for multi-object detection and scene understanding based on modified WMM, ANN, and CNN (Vgg-16) is proposed.</p></list-item>
<list-item><p>Improved segmentation for the detection of multiple regions of different objects using modified WMM and 3D-geometric features are the main contributions of this work.</p></list-item>
<list-item><p>Novel 3D-geometric features for scene understanding have refined the scene recognition accuracy with both ANN and CNN architectures.</p></list-item>
<list-item><p>The proposed model&#x2019;s efficiency and effectiveness are validated with two different publically available datasets.</p></list-item>
<list-item><p>Other state-of-the-art approaches are compared to the suggested method&#x2019;s outcomes. Section&#x00A0;2 discusses relevant work. Section 3 describes the approach and suggested scene categorization system in depth. The fourth section provides an analysis of the experimental outcomes and a comprehensive explanation of the information. Section 5 contains the paper&#x2019;s summary.</p></list-item>
</list></p>
</sec>
<sec id="s2"><label>2</label><title>Related Work</title>
<p>Our technique is connected to a large body of work on both CNNs for fusion and machine vision. In furthermore, we briefly analyses the appropriateness of CNN&#x2019;s detailed estimate. It is beyond the goal of this research to do a comprehensive literature review of CNNs for these three parameters; therefore, we will present a brief summary of the existing studies, with a focus on more recent publications.</p>
<sec id="s2_1"><label>2.1</label><title>CNN for Fusion</title>
<p>Reference [<xref ref-type="bibr" rid="ref-16">16</xref>] proposes a system based on convolutional and nonlinear machine learning for learning and classifying RGB-D image features. Iteratively artificial neural network retrieve high-level characteristic features, whereas convolutional layers extract low-level characteristics. This is updated by [<xref ref-type="bibr" rid="ref-17">17</xref>], which recommends a tractor trailer methodology that employs fewer labelled data but achieves comparable results to the state-of-the-art. Another publication [<xref ref-type="bibr" rid="ref-18">18</xref>] illustrates how to employ CNNs for fusing diverse gyroscope inputs, including infrared and RGB images. The results of these studies suggest that the effectiveness of object person identification can be improved by combining the data from both camcorders, as appeared differently in relation to simply employing one camera.</p>
<p>Early fusion [<xref ref-type="bibr" rid="ref-19">19</xref>], advanced fusion [<xref ref-type="bibr" rid="ref-16">16</xref>,<xref ref-type="bibr" rid="ref-18">18</xref>], and late blending [<xref ref-type="bibr" rid="ref-18">18</xref>,<xref ref-type="bibr" rid="ref-20">20</xref>] are the three basic classifications of existing fusion techniques according to the level of data indirection engaged for combining. In early fusion, also renowned as image blending, the raw sensor readings are combined to create fused data prior to the application of information extraction techniques. CNN performance in feature extraction, depth estimation, and eye shadow appreciation has been significantly improved by pixel-level fusion approaches [<xref ref-type="bibr" rid="ref-21">21</xref>]. Intermediate fusion, also referred to as feature fusion, consolidates each raw data&#x2019;s extracted features. In late combining, also recognised as decision-level fusion, the scanners are considered individually to each sensor, and their results are then combined to determine the final detection. For instance, [<xref ref-type="bibr" rid="ref-22">22</xref>] devised a late combination for amalgamating depth details and RGB relevant information to improve the performance of object characterization. This approach to information fusion includes two CNN infrastructures for each data modality and incorporates the highlights only after the connections have been run. Our proposed fusion architectural style actually creates a depth information from RGB as opposed to an RGB-D sensor like the Microsoft Kinect.</p>
</sec>
<sec id="s2_2"><label>2.2</label><title>CNN Detection</title>
<p>It has been demonstrated that CNN-based techniques perform better than exquisite methodology such as HOG [<xref ref-type="bibr" rid="ref-23">23</xref>] and SIFT [<xref ref-type="bibr" rid="ref-24">24</xref>]. In [<xref ref-type="bibr" rid="ref-25">25</xref>], the use of Region-based Convolutional Neural Networks (R-CNN) significantly enhanced the accuracy of pattern recognition. It identifies area specific recommendations (i.e., regions of significance that are likely to have particles) before classifying these provinces as different classifiers or backgrounds using a CNN. R-CNN has the limitation of individually calculating the CNN for each receptive field, which is both time-and electricity. To improve performance and scalability, faster R-CNN [<xref ref-type="bibr" rid="ref-26">26</xref>] omits the judicious search strategy for actually creating instrument region guidelines.</p>
<p>The significant proportion of CNN-based detection equipment consist of one-stage processes (e.g., SSD [<xref ref-type="bibr" rid="ref-27">27</xref>], YOLO [<xref ref-type="bibr" rid="ref-28">28</xref>]) and method adopted (e.g., R-CNN [<xref ref-type="bibr" rid="ref-25">25</xref>], Fast/Faster R-CNN [<xref ref-type="bibr" rid="ref-26">26</xref>,<xref ref-type="bibr" rid="ref-29">29</xref>], R-FCN [<xref ref-type="bibr" rid="ref-30">30</xref>]). Two-stage detection methods are slightly slower than separate detection equipment because a the outside module is needed to create targetable positions. Nevertheless, their classification performance is enhanced as a result of the stringent example consideration. The challenging instances are those for which the model makes poor predictions. In comparision, yet another object detection methods create a congested specimens of possible classifier is based more quickly and directly by omitting the second per-region categorization and merely predicting anchor boxes and associated model is trained. None of eachother evaluated consolidating multiple sources of data to improve detection and recognition.</p>
</sec>
<sec id="s2_3"><label>2.3</label><title>CNN for Depth Estimation</title>
<p>Attributed to the reason that mostght pictures include numerous bounding boxes, extensive textural variations, and intricate geometric elements, adjusted is a significant obstacle in image interpretation. Diverse depth estimation strategies employing supervised [<xref ref-type="bibr" rid="ref-31">31</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>] and unsupervised [<xref ref-type="bibr" rid="ref-33">33</xref>&#x2013;<xref ref-type="bibr" rid="ref-35">35</xref>] learner methods have been developed to address this problem. Recent supervised learning techniques [<xref ref-type="bibr" rid="ref-31">31</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>] use CNNs for 3d reconstruction to avoid manually-crafted object functionalities and computationally demanding test-time optimization. Also discussed in DeMon [<xref ref-type="bibr" rid="ref-31">31</xref>] is thickness learning utilising stereo data. Using two unconstrained frames, their method produces reliable depth predictions. This approach utilises a variety of supervisory techniques, including depth and optical flow.</p>
<p>CNNs exhibit promise effectiveness for this task, but supervised techniques require costly and time-consuming datasets with extensive labelling. Various studies apply a personality learning approach to estimate depth maps from unlabeled video sequences in order to overcome this issue. Self-supervised learning approaches overcome the difficulty of background subtraction by educating a network to predict the appearance of a target image from the perspective from another image. Df-net [<xref ref-type="bibr" rid="ref-33">33</xref>] is an independent pedagogical approach for simultaneously training depth estimation and optical flow estimation [<xref ref-type="bibr" rid="ref-34">34</xref>]. It generates 2D gaussian filter by backprojecting the produced 3D scenario flow utilising predicted scene depth and camera motion to generate 2D optical flow. SfMLearner [<xref ref-type="bibr" rid="ref-35">35</xref>] is a later part instructional strategy that use monocular video sequence only for training. It discovers depth by utilising the geometric relationship amongst depth and cameraman position.</p>
</sec>
</sec>
<sec id="s3"><label>3</label><title>Proposed Methodology</title>
<p>In this segment, the suggested architecture for object detection is discussed. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> demonstrates a general summary of the proposed architecture. Image normal is acquired from depth image and then clustering is performed using the Watson mixture model to assist the segmentation phase. After that, scale-dependent 3D geometric features are computed along with the color features that will be used to detect objects in the later phase of the architecture. Finally; for object detection purposes, ANN and CNN are applied.</p>
<fig id="fig-1"><label>Figure 1</label><caption><title>Block diagram of the proposed system for multiple objects recognition</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-1.tif"/></fig>
<sec id="s3_1"><label>3.1</label><title>Data Acquisition and Pre-Processing</title>
<p>A number of researchers have proposed surface normal in their recent work. In-depth images, image normal are the unit vectors having 3D properties that draw the positioning of the pixels. The most collective method used to compute normal was the plane fitting method [<xref ref-type="bibr" rid="ref-7">7</xref>]. In [<xref ref-type="bibr" rid="ref-36">36</xref>] a method was proposed by repeating shape patterns and appearance primitives in indoor RGB-D image data, and compared those primitives to new images to acquire a normal map against depth images. Normally image normal acquiring is considered to be a computationally expensive process because of 3D point cloud fitting. Since normal space selection is helpful for point cloud and image normal registration, different approaches for estimating surface normal from a 3D point cloud have been projected in the nonfiction. In the proposed approach, depth maps have been used to perform clustering using the image normal technique. 3D points (<italic>J</italic>, <italic>K</italic>, <italic>L</italic>) in the camera synchronize arrangement are projected onto a pixel (<italic>j</italic>, <italic>k</italic>) as from carrying a depth image to normal clustering space
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>J</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <italic>f</italic> is focal length, (<italic>C<sub>j</sub></italic>, <italic>C<sub>k</sub></italic>) is the optical midpoint of the depth camera, and they are acquired during the camera calibration procedure. Then, a 3D point (<italic>J</italic>, <italic>K</italic>, <italic>L</italic>) is parameterized as a function of a pixel (<italic>j</italic>, <italic>k</italic>).
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>J</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mi>f</mml:mi></mml:mfrac><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>K</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>C</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mi>f</mml:mi></mml:mfrac><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>With a depth sensor, <italic>L</italic> (<italic>x</italic>, <italic>y</italic>) can be obtained as a pixel assessment on a depth image. Reminder that the measure of <italic>L</italic> and <italic>f</italic> is calibrated in millimeters beforehand.</p>
<p>The sequence diagram presented in <xref ref-type="fig" rid="fig-2">Fig. 2</xref> provides a visual illustration of this concept. Furthermore, gradients of the profundity picture are obtained, and a normal tensor is generated for each pixel based on the differences. Perfectly natural averaging is then used to a satellite image derived from the depth picture.</p>
<fig id="fig-2"><label>Figure 2</label><caption><title>Image normal extraction through depth image</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-2.tif"/></fig>
</sec>
<sec id="s3_2"><label>3.2</label><title>Data Segmentation</title>
<p>Throughout data classification, the subsurface normal of distance pictures is computed. Normals are produced for each pixels by choosing neighboring pixels within a depth minimum and estimating a least-squares surface. Then, updated WMM is used to cluster the normal. Each constellation in the output is a collection of pixels from the same location. <xref ref-type="fig" rid="fig-3">Figs. 3a</xref> and <xref ref-type="fig" rid="fig-3">3b</xref> illustrate picture normals <italic>vs.</italic> depth images (d).</p>
<fig id="fig-3"><label>Figure 3</label><caption><title>Segmentation approach. (a) Depth map from NYU V1; (b) acquired image normal against NYU V1 depth map; (c) depth map from RedWeb V1; (d) acquired image normal RedWeb V1 depth&#x00A0;map</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-3.tif"/></fig>
</sec>
<sec id="s3_3"><label>3.3</label><title>Modified Watson Mixture Model (WMM)</title>
<p>Modified Watson Mixture Model is a productive model, which undertakes that the data models are issued from a combination of multivariate Watson distributions (mWDs) [<xref ref-type="bibr" rid="ref-37">37</xref>]. Usually, Watson distribution is used for modeling data. However, in the proposed organization, it is used to achieve segmentation and to assist the multi-object detection phase. In [<xref ref-type="bibr" rid="ref-18">18</xref>] researchers proposed an architecture that begins with the Bregman Soft Clustering technique for the mWMM.</p>
<p>After soft clustering, a set of mWMMs using hierarchical collective clustering is produced. In the end, it applies a model selection method to select the optimal mWMM is applied. In the proposed model, more than one mWMM is generated against a single depth map and an optimum mWMM is picked gives the best clusters. The proposed technique holds distributional information acquired from depth gradients and needs no repetitive numerical calculation during the optimization procedure. Moreover, the approach provides a lower certain on the peripheral likelihood. Consequences are shown in <xref ref-type="fig" rid="fig-4">Figs. 4a</xref> and <xref ref-type="fig" rid="fig-4">4b</xref>. The first two images, show RGB representation of those depth images against which Watson distribution was applied. The next two images are the results of the segmentation process.</p>
<fig id="fig-4"><label>Figure 4</label><caption><title>Multiple objects segmentation using mWMM based on depth information (a) represents the RGB object image (b) represents selected mWMM segmentation</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-4.tif"/></fig>
<p>The segmentation step starts with the initialization of the parameters that are variational in nature. The optimization of the variational subsequent distribution involves a series of optimization. Firstly, we use the current distributions over the model parameters to examine the responsibilities. Next, these obligations are engaged to re-estimate the subsequent distribution over the parameters. The process is guaranteed to converge as the lower bound. A summary of the process is presented in Algorithm 1.
</p>
<fig id="fig-10">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-10.tif"/>
</fig>
</sec>
<sec id="s3_4"><label>3.4</label><title>Features Descriptors</title>
<p>There are various feature extraction techniques including spatiotemporal motion variation features [<xref ref-type="bibr" rid="ref-38">38</xref>], hybrid features approach [<xref ref-type="bibr" rid="ref-39">39</xref>], ECG and GMM features [<xref ref-type="bibr" rid="ref-40">40</xref>], body joint features [<xref ref-type="bibr" rid="ref-41">41</xref>], and Ridge body parts features [<xref ref-type="bibr" rid="ref-42">42</xref>]. We have used three different feature descriptors. Those include geometric features, shape features, and a bag of words features.</p>
<sec id="s3_4_1"><label>3.4.1</label><title>Geometric Features</title>
<p>Some geometric features that show a given 3D lattice model lie on the model&#x2019;s surface. 3D geometric information assumes a significant part in numerous issues identified with computer vision applications.</p>
<p>By that, as it may, their scale-dependent nature, such as the relative variation in the spatial degrees of nearby geometric constructions. For this reason, a scale-space sort of portrayal is fabricated that dependably encodes the scale fluctuation of its surface geometry. The given geometry is addressed with its surface normal and a thick and ordinary 2D plane of it is processed by defining the surface on a 2D plane. At this point, a scale-space of this surface typical field is worked by determining and applying a scale-space administrator that effectively represents the geodesic distances on a superficial level. A 2D representation of the 3D geometry is given as a 3D lattice model by first opening up the outside of the model onto a 2D plane. A significant arrangement of scale-subordinate features can be procured from the subsequent typical space portrayal.</p>
<p>Geometric edges and sharp focuses are extracted at various scales. To set up these edges, the first-and second-request subsidiaries of the depth map are gotten attentively. The outcome is a bunch of scale-subordinate 3D geometric features that give a rich and extraordinary reason for the exhibition of 3D unique basis. In <xref ref-type="fig" rid="fig-6">Fig. 6</xref>, there are some visual instances of geometrical focuses procured over the depth properties of the given pictures.
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo>|</mml:mo><mml:mi>y</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mfrac><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="bold">v</mml:mi></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mi mathvariant="normal">y</mml:mi></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mfrac><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>v</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x2205;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mi>&#x2205;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mi>v</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
</sec>
<sec id="s3_4_2"><label>3.4.2</label><title>Shape Features</title>
<p>In feature extraction, specific geometric forms, such as cylindrical, rectangular, and other configurations, are used to extract significant features from various features in an image. In the suggested approach, form total number of positive on the profundity qualities of the material shapes are utilized. From depth information, it can be tried by fitting a quantized model based on boundaries, shape priors, and detected spatial properties. A contour-based approach often has minute applicability to specific shapes, such as generalized cylinder shapes. In some scenarios objects in a messy environment are turned into region of interest (ROI) parts. In this section, a straight-line strategy is applied to detect the shape of our interest regions. First of all, region contour is extracted by the boundary acquiring technique.</p><p>
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>D</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mo>=</mml:mo><mml:msqrt><mml:mo stretchy="false">(</mml:mo></mml:msqrt><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:msub><mml:mn>1</mml:mn><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>x</mml:mi><mml:msub><mml:mn>2</mml:mn><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn><mml:mo>+</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>y</mml:mi><mml:msub><mml:mn>1</mml:mn><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>y</mml:mi><mml:mn>2</mml:mn><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula>where <bold><italic>w<sub>D</sub></italic></bold> represents the distance and <italic>(x1i, y1j</italic>) and (<italic>x2i, y2j</italic>) are the <italic>x</italic>, <italic>y</italic> synchronizes of first and additional points, correspondingly. We can see some visuals demonstration computed over gradient properties in <xref ref-type="fig" rid="fig-5">Figs. 5a</xref> and <xref ref-type="fig" rid="fig-5">5b</xref>.</p>
<fig id="fig-5"><label>Figure 5</label><caption><title>Shape features extraction over NYU V1 Dataset (a) RGB image and (b) shape features extracted</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-5.tif"/></fig>
</sec>
<sec id="s3_4_3"><label>3.4.3</label><title>Bag of Word Features</title>
<p>A bag of words representation of the image features has also been used for better scene categorization. Since this algorithm does not account for spatial relationships between the features, it is bound to miss categorize some scenes.
<list list-type="bullet">
<list-item><p>Starting with the extraction of features from the training set, These features lead us to develop a vocabulary that will help in image classification.</p></list-item>
<list-item><p>The next step is to cluster all the features found in images and then to find a difference between features vocabulary by using the center of the cluster.</p></list-item>
<list-item><p>After taking the feature extracted from images, categorize each feature as the word it is closest to it in the vocabulary. In this way, a bag of words representation is made.</p></list-item>
<list-item><p>Against each bag of words, a histogram is constructed.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mi>S</mml:mi><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>z</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mi>x</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mi>z</mml:mi><mml:mi>i</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula>
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msup><mml:mi>A</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>B</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mfrac><mml:mn>1</mml:mn><mml:mi>M</mml:mi></mml:mfrac><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>F</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>g</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mi>B</mml:mi><mml:mrow><mml:mi>F</mml:mi><mml:mo>,</mml:mo><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>A</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></disp-formula></p></list-item>
</list></p>
</sec>
</sec>
<sec id="s3_5"><label>3.5</label><title>Objects Classification</title>
<p>Two different methods have been used for multi-object recognition and classification. ANN and CNN. Both are robust in multi object scenario and scene classification problems.</p>
<sec id="s3_5_1"><label>3.5.1</label><title>Artificial Neural Network (ANN)</title>
<p>ANN is a computer model used for modelling statistical data across non-linear data. It is a tool for computer science that is informed by the nervous system and replicates the actual brain&#x2019;s learning system to execute learning. ANN discovers various data correlations or input-output correlations utilizing artificially generated neurons. <xref ref-type="fig" rid="fig-6">Fig. 6</xref> depicts an ANN with input, production, and one maybe more convolutional units. Hidden layers are used to turn the input sequence into the activation function. Numerous methods may be utilized to create ANN; in the proposed method, a nutrient form multi-layer perceptron (MLP) approach is employed to recognize different object characteristics.
</p>
<fig id="fig-6"><label>Figure 6</label><caption><title>Block illustration of the artificial neural network for multiple matters appreciation</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-6.tif"/></fig>
<fig id="fig-11">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-11.tif"/>
</fig>
</sec>
<sec id="s3_5_2"><label>3.5.2</label><title>Convolutional Network (CNN)</title>
<p>A pre-trained CNN VGG-16 classical is also used for object classification in the proposed system. <xref ref-type="fig" rid="fig-7">Fig. 7</xref> shows that the input to the first convolutional layer of VGG-16. VGG-16 comprises 16 convolutional covers and three fully associated layers that depict the depth of the complex which leads the model toward high accuracy classification. The proposed approach uses VGG-model only for classification purposes with the help of our pre-acquired features like a bag of features, 3D geometric and shape features.</p>
<fig id="fig-7"><label>Figure 7</label><caption><title>Sequence of steps multiple objects recognition over an image from NYU V1 dataset using the convolutional neural network</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-7.tif"/></fig>
</sec>
</sec>
</sec>
<sec id="s4"><label>4</label><title>Experimental Setup and Results</title>
<p>This section describes the study&#x2019;s preparation and assessment procedure in details.</p>
<sec id="s4_1"><label>4.1</label><title>Dataset Descriptions</title>
<p>Two different datasets have been used to test the proposed methodology. These datasets comprise various scenes with multiple objects and various classes. A description of these datasets is given in the next section.</p>
<sec id="s4_1_1"><label>4.1.1</label><title>ReDWeb Dataset</title>
<p>ReDWeb V1 is a comprehensive database comprised of a variety of photos and their respective complex comparable depth maps. The ReDWeb V1 [<xref ref-type="bibr" rid="ref-2">2</xref>] dataset includes 3,600 RGB-D photos of residential and commercial settings. This collection includes synchronised RGB-D frames between Kinect v2 and Zed photogrammetry. We build differential maps and used an appropriate stereoscopic matching approach before converting them with calibration settings. Also included is a per-pixel accuracy map of discrepancy. Footage are taken in several locations, such as workplaces, rooms, dormitories, exposition centres, streets, and roads. The collection includes 200 distinct scenarios with varied items. We chose 500 crisp photos. Here, 450 photos are utilised for train, while the other 50 photographs are synthesised for testing. We evaluated the reliability of our transparency evaluation system predicated on NYUv1 and DIML. <xref ref-type="fig" rid="fig-8">Fig. 8</xref> depicts genuine RedWeb v1 photos.</p>
<fig id="fig-8"><label>Figure 8</label><caption><title>A set of sample images from the RedWebV1 dataset with corresponding depth images</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-8.tif"/></fig>
</sec>
<sec id="s4_1_2"><label>4.1.2</label><title>NYU V1 Dataset</title>
<p>The NYU V1 [<xref ref-type="bibr" rid="ref-4">4</xref>] data set comprises of image sequences collected by the Deep and RGB cameras of the Motion Capture system from a range of interior situations. It consists of seven distinct scenario types with a minimum of sixty-four scenes. The NYU v1 collection contains many perspectives on the same and distinct items, as well as RGB and complexity photos [<xref ref-type="bibr" rid="ref-4">4</xref>]. It comprises 64 distinct indoor sceneries, a variety of daily things viewed from various perspectives, and over a thousand distinct classes. To determine the effectiveness of the product recognition and scene classification challenge, the geometry attribute, 3D-points collection, and characteristics were utilised to train Random Forest (RF) and neural network-based classifiers. RF was educated on the chosen features and ANN was trained on a combination of both features. The experiments took place on a stochastic training and testing combination of 65&#x0025; and 35&#x0025; for laboratory investigation. Selected RGB and depth pictures from the NYU v1 Datasets are displayed in <xref ref-type="fig" rid="fig-9">Fig. 9</xref>.</p>
<fig id="fig-9"><label>Figure 9</label><caption><title>A set of sample images from the NYU V1 dataset with corresponding depth images</title></caption><graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_35442-fig-9.tif"/></fig>
</sec>
</sec>
<sec id="s4_2"><label>4.2</label><title>Experimental Results</title>
<p>In this part, the setup and assessment are explained in greater detail. In the evolutionary sense, precision of classification and comparability with established state approaches were tested by analysing all indoor photos. Because of the strong object segmentations (mWMM), which exhibits greater efficiency in object identification utilising ANN and Network architectures, the suggested system produced consistent results.</p>
<sec id="s4_2_1"><label>4.2.1</label><title>Experiment 1: Using RedWeb V1 Dataset</title>
<p>Considering the RedWeb V1 dataset, the proposed system was applied for scene classification accuracy. <xref ref-type="table" rid="table-1">Tables 1</xref> and <xref ref-type="table" rid="table-2">2</xref> show that the major object classes of the considered dataset produce significant performance in terms of accuracy.</p>
<table-wrap id="table-1"><label>Table 1</label><caption><title>Confusion matrix of individual object class accuracies over the Redweb v1 dataset using ANN</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Objects</th>
<th align="left">WH</th>
<th align="left">CF</th>
<th align="left">CR</th>
<th align="left">CHR</th>
<th align="left">CMP</th>
<th align="left">MD</th>
<th align="left">LIB</th>
<th align="left">LAB</th>
<th align="left">BS</th>
<th align="left">COR</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">WH</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.01</td>
<td align="left"><bold>0.89</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CR</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.88</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CHR</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CMP</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.00</td>
<td align="left"><bold>0.89</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">MD</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left"><bold>0.86</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">LIB</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left"><bold>0.84</bold></td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">LAB</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.02</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">BS</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left"><bold>0.85</bold></td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">COR</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.87</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn1_1"><p>Note: WH &#x003D; ware house, CR &#x003D; computer room, CH &#x003D; chair, CMP &#x003D; computer, MD &#x003D; mobile device, LAB &#x003D; laboratory, COR &#x003D; corridor.</p></fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-2"><label>Table 2</label><caption><title>Confusion matrix of detailed object class accuracies over the Redweb v1 dataset using CNN</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Objects</th>
<th align="left">WH</th>
<th align="left">CF</th>
<th align="left">CR</th>
<th align="left">CHR</th>
<th align="left">CMP</th>
<th align="left">MD</th>
<th align="left">LIB</th>
<th align="left">LAB</th>
<th align="left">BS</th>
<th align="left">COR</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">WH</td>
<td align="left"><bold>0.91</bold></td>
<td align="left">0.00</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.01</td>
<td align="left"><bold>0.90</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CR</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.89</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">CHR</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left"><bold>0.92</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CMP</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left"><bold>0.88</bold></td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">MD</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left"><bold>0.92</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">LIB</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left"><bold>0.90</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">LAB</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">BS</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.90</bold></td>
<td align="left">0.00</td>
</tr>
<tr>
<td align="left">COR</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left"><bold>0.88</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn2_1"><p>Note: WH &#x003D; ware house, CR &#x003D; computer room, CH &#x003D; chair, CMP &#x003D; computer, MD &#x003D; mobile device, LAB &#x003D; laboratory, COR &#x003D; corridor.</p></fn>
</table-wrap-foot>
</table-wrap>
</sec>
<sec id="s4_2_2"><label>4.2.2</label><title>Experimentation: Using the NYU V1 Dataset</title>
<p>Throughout experiments with the NYU V1 dataset, classification accuracy score of 89.1&#x0025; as shown in <xref ref-type="table" rid="table-3">Table 3</xref> using CNN and 78&#x0025; using ANN shown in <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<table-wrap id="table-3"><label>Table 3</label><caption><title>Confusion matrix of specific entity class precisions over the NYUv1 dataset using ANN</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Objects</th>
<th align="left">BR</th>
<th align="left">BD</th>
<th align="left">BS</th>
<th align="left">CF</th>
<th align="left">KIT</th>
<th align="left">LR</th>
<th align="left">OFF</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">BR</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.02</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
</tr>
<tr>
<td align="left">BD</td>
<td align="left">0.03</td>
<td align="left"><bold>0.85</bold></td>
<td align="left">0.02</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
</tr>
<tr>
<td align="left">BS</td>
<td align="left">0.02</td>
<td align="left">0.02</td>
<td align="left"><bold>0.88</bold></td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.03</td>
<td align="left">0.04</td>
<td align="left">0.03</td>
<td align="left"><bold>0.84</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
</tr>
<tr>
<td align="left">KIT</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left"><bold>0.87</bold></td>
<td align="left">0.03</td>
<td align="left">0.03</td>
</tr>
<tr>
<td align="left">LR</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.03</td>
<td align="left"><bold>0.85</bold></td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">OFF</td>
<td align="left">0.04</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.02</td>
<td align="left"><bold>0.86</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn3_1"><p>Note: BR &#x003D; bar room, BD &#x003D; bed room, B<bold>S</bold> &#x003D; bicycle, KIT &#x003D;kitchen, LR &#x003D; laboratory, OFF &#x003D; office.</p></fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-4"><label>Table 4</label><caption><title>Confusion matrix of individual object class accuracies over the NYUv1 dataset using CNN</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Objects</th>
<th align="left">BR</th>
<th align="left">BD</th>
<th align="left">BS</th>
<th align="left">CF</th>
<th align="left">KIT</th>
<th align="left">LR</th>
<th align="left">OFF</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">BR</td>
<td align="left"><bold>0.91</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">BD</td>
<td align="left">0.01</td>
<td align="left"><bold>0.92</bold></td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">BS</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.92</bold></td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.92</bold></td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
</tr>
<tr>
<td align="left">KIT</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.01</td>
<td align="left"><bold>0.90</bold></td>
<td align="left">0.03</td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">LR</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left"><bold>0.89</bold></td>
<td align="left">0.01</td>
</tr>
<tr>
<td align="left">OFF</td>
<td align="left">0.01</td>
<td align="left">0.02</td>
<td align="left">0.03</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.02</td>
<td align="left"><bold>0.91</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn4_1"><p>Note: BR &#x003D; bar room, BD &#x003D; bed room, B<bold>S</bold> &#x003D; bicycle, KIT &#x003D; kitchen, LR &#x003D; laboratory, OFF &#x003D; office.</p></fn>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="table" rid="table-5">Table 5</xref> illustrations the evaluation of the proposed prototypical with some state-of-the-art methods using equally NYU V1 and Redweb V1 dataset.</p>
<table-wrap id="table-5"><label>Table 5</label><caption><title>Assessment of Redweb V1 dataset with state of the art technique</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="2">Methods</th>
<th align="left">NYU V1</th>
<th align="left">Redweb V1</th>
</tr>
<tr>
<th align="left">Mean accuracy</th>
<th align="left">Mean accuracy</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">K. Chen&#x00A0;[<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td align="left">&#x2013;</td>
<td align="left">60.5</td>
</tr>
<tr>
<td align="left">S. Gupta&#x00A0;[<xref ref-type="bibr" rid="ref-5">5</xref>]</td>
<td align="left">&#x2013;</td>
<td align="left">65.0</td>
</tr>
<tr>
<td align="left">A. Zeng&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-6">6</xref>]</td>
<td align="left">&#x2013;</td>
<td align="left">78.1</td>
</tr>
<tr>
<td align="left">Silberman&#x00A0;et&#x00A0;al.&#x00A0;[<xref ref-type="bibr" rid="ref-2">2</xref>]</td>
<td align="left">70</td>
<td align="left">&#x2013;</td>
</tr>
<tr>
<td align="left">Multiscale convnet [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td align="left">51.1</td>
<td align="left">&#x2013;</td>
</tr>
<tr>
<td align="left"><bold>Proposed ANN Approach</bold></td>
<td align="left"><bold>88.0</bold></td>
<td align="left"><bold>89.1</bold></td>
</tr>
<tr>
<td align="left"><bold>Proposed CNN Approach</bold></td>
<td align="left"><bold>92.13</bold></td>
<td align="left"><bold>90.0</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The classification results of the three classifiers, i.e., Random forest, Artificial Neural Network (ANN), and Convolutional Neural Network on Nyuv1, and Redweb V1 datasets are reported in <xref ref-type="table" rid="table-6">Tables 6</xref> and <xref ref-type="table" rid="table-7">7</xref>. All three classifiers were trained on the training customary. The classification outcomes in <xref ref-type="table" rid="table-6">Table 6</xref> were obtained by using the testing set. As shown in <xref ref-type="table" rid="table-7">Table 7</xref>, the classification effects of the suggested structure are better for both datasets as they have higher F-measures, Precision, and Recall scores than those obtained with other classifiers. The overall results showed that the proposed method using CNN achieved better performance than other state-of-the-art methods.</p>
<table-wrap id="table-6"><label>Table 6</label><caption><title>The classification results of three classifiers on the Redweb V1 dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="2">Multi Objects</th>
<th align="center" colspan="3">Random Forest</th>
<th align="center" colspan="3">ANN</th>
<th align="center" colspan="3">CNN</th>
</tr>
<tr>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">WH</td>
<td align="left">0.781</td>
<td align="left">0.770</td>
<td align="left">0.762</td>
<td align="left">0.879</td>
<td align="left">0.862</td>
<td align="left">0.874</td>
<td align="left">0.891</td>
<td align="left">0.890</td>
<td align="left">0.885</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.750</td>
<td align="left">0.751</td>
<td align="left">0.759</td>
<td align="left">0.864</td>
<td align="left">0.859</td>
<td align="left">0.854</td>
<td align="left">0.889</td>
<td align="left">0.890</td>
<td align="left">0.885</td>
</tr>
<tr>
<td align="left">CR</td>
<td align="left">0.756</td>
<td align="left">0.749</td>
<td align="left">0.753</td>
<td align="left">0.880</td>
<td align="left">0.872</td>
<td align="left">0.880</td>
<td align="left">0.883</td>
<td align="left">0.880</td>
<td align="left">0.890</td>
</tr>
<tr>
<td align="left">CHR</td>
<td align="left">0.752</td>
<td align="left">0.751</td>
<td align="left">0.750</td>
<td align="left">0.879</td>
<td align="left">0.866</td>
<td align="left">0.876</td>
<td align="left">0.878</td>
<td align="left">0.875</td>
<td align="left">0.872</td>
</tr>
<tr>
<td align="left">CMP</td>
<td align="left">0.739</td>
<td align="left">0.745</td>
<td align="left">0.749</td>
<td align="left">0.880</td>
<td align="left">0.878</td>
<td align="left">0.878</td>
<td align="left">0.890</td>
<td align="left">0.885</td>
<td align="left">0.880</td>
</tr>
<tr>
<td align="left">MR</td>
<td align="left">0.748</td>
<td align="left">0.740</td>
<td align="left">0.742</td>
<td align="left">0.879</td>
<td align="left">0.874</td>
<td align="left">0.875</td>
<td align="left">0.878</td>
<td align="left">0.875</td>
<td align="left">0.870</td>
</tr>
<tr>
<td align="left">LIB</td>
<td align="left">0.749</td>
<td align="left">0.748</td>
<td align="left">0.7750</td>
<td align="left">0.880</td>
<td align="left">0.879</td>
<td align="left">0.879</td>
<td align="left">0.889</td>
<td align="left">0.880</td>
<td align="left">0.887</td>
</tr>
</tbody>
</table>
<table-wrap-foot><fn id="tfn6_1"><p>Note: WH &#x003D; ware house, CR &#x003D; computer room, CH &#x003D; chair, CMP &#x003D; computer, MD &#x003D; mobile device, LAB &#x003D; laboratory, COR &#x003D; corridor.</p></fn>
</table-wrap-foot>
</table-wrap><table-wrap id="table-7"><label>Table 7</label><caption><title>The classification results of three classifiers on the NYU V1 dataset</title></caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left" rowspan="2">Multi Objects</th>
<th align="center" colspan="3">Random Forest</th>
<th align="center" colspan="3">ANN</th>
<th align="center" colspan="3">CNN</th>
</tr>
<tr>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
<th align="left">Precision</th>
<th align="left">Recall</th>
<th align="left">F-Measures</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">BR</td>
<td align="left">0.840</td>
<td align="left">0.839</td>
<td align="left">0.843</td>
<td align="left">0.890</td>
<td align="left">0.890</td>
<td align="left">0.889</td>
<td align="left">0.911</td>
<td align="left">0.910</td>
<td align="left">0.899</td>
</tr>
<tr>
<td align="left">BD</td>
<td align="left">0.850</td>
<td align="left">0.848</td>
<td align="left">0.847</td>
<td align="left">0.889</td>
<td align="left">0.887</td>
<td align="left">0.886</td>
<td align="left">0.899</td>
<td align="left">0.900</td>
<td align="left">0.886</td>
</tr>
<tr>
<td align="left">BS</td>
<td align="left">0.843</td>
<td align="left">0.840</td>
<td align="left">0.842</td>
<td align="left">0.890</td>
<td align="left">0.886</td>
<td align="left">0.888</td>
<td align="left">0.901</td>
<td align="left">0.892</td>
<td align="left">0.898</td>
</tr>
<tr>
<td align="left">CF</td>
<td align="left">0.851</td>
<td align="left">0.850</td>
<td align="left">0.849</td>
<td align="left">0.889</td>
<td align="left">0.887</td>
<td align="left">0.890</td>
<td align="left">0.910</td>
<td align="left">0.910</td>
<td align="left">0.900</td>
</tr>
<tr>
<td align="left">KIT</td>
<td align="left">0.849</td>
<td align="left">0.850</td>
<td align="left">0.846</td>
<td align="left">0.890</td>
<td align="left">0.888</td>
<td align="left">0.882</td>
<td align="left">0.900</td>
<td align="left">0.887</td>
<td align="left">0.889</td>
</tr>
<tr>
<td align="left">LR</td>
<td align="left">0.851</td>
<td align="left">0.847</td>
<td align="left">0.845</td>
<td align="left">0.890</td>
<td align="left">0.884</td>
<td align="left">0.890</td>
<td align="left">0.910</td>
<td align="left">0.889</td>
<td align="left">0.899</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s5"><label>5</label><title>Conclusion</title>
<p>In this paper, a novel and effective approach for the segmentation and classification of single and heterogeneous objects is provided. Particles were segmented using the powerful algorithm Watson Hybrid Concept. Furthermore, many characteristics were extracted from both collections. The suggested scheme outperforms previous state-of-the-art technologies in terms of computational, segments outcomes, and precision, as determined by experiments performed. In future consideration, the scholars want to conduct an in-depth analysis of photos of outdoor space in order to increase the accuracy of semantic segmentation and discover a solution to the computational effort of semantic segmentation.</p>
</sec>
</body>
<back>
<sec><title>Funding Statement</title>
<p>This study was funded by the <funding-source>MSIT (Ministry of Science and ICT) of Korea</funding-source> through the <funding-source>ITRC (Information Technology Research Center) funding programme</funding-source> (<award-id>IITP-2023-2018-0-01426</award-id>) under the direction of the <funding-source>IITP (Institute for Information &#x0026; Communications Technology Planning and Evaluation)</funding-source>.</p></sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p></sec>
<ref-list content-type="authoryear"><title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B. C.</given-names> <surname>Russell</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Torralba</surname></string-name>, <string-name><given-names>K. P.</given-names> <surname>Murphy</surname></string-name> and <string-name><given-names>W. T.</given-names> <surname>Freeman</surname></string-name></person-group>, &#x201C;<article-title>LabelMe: A database and web-based tool for image annotation</article-title>,&#x201D; <source>International Journal of Computer Vision</source>, vol. <volume>77</volume>, no. <issue>3</issue>, pp. <fpage>157</fpage>&#x2013;<lpage>173</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Silberman</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Hoiem</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kohli</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Fergus</surname></string-name></person-group>, &#x201C;<article-title>Indoor segmentation and support inference from rgbd images</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on European Conf. on Computer Vision,</conf-name> Florence <conf-loc>Italy</conf-loc>, pp. <fpage>746</fpage>&#x2013;<lpage>760</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Yaguchi</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Dong</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Oka</surname></string-name></person-group>, &#x201C;<article-title>Image classification based on segmentation-free object recognition</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Image Processing</conf-name>, <conf-loc>Hong Kong, China</conf-loc>, pp. <fpage>2157</fpage>&#x2013;<lpage>2160</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Yao</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Guo</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Yu</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities</article-title>,&#x201D; <source>ACM Computing Surveys</source>, vol. <volume>54</volume>, no. <issue>4</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>40</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Gupta</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Arbelez</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Malik</surname></string-name></person-group>, &#x201C;<article-title>Learning rich features from rgb-d images for object detection and segmentation</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on European Conf. on Computer Vision</conf-name>, <conf-loc>Zurich, Switzerland</conf-loc>, pp. <fpage>345</fpage>&#x2013;<lpage>360</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Zeng</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Nie&#x00DF;ner</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Fisher</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Xiao</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>3dmatch: Learning local geometric descriptors from rgb-d reconstructions</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on IEEE Conf. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Hawaii, USA</conf-loc>, pp. <fpage>1802</fpage>&#x2013;<lpage>1811</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Fan</surname></string-name> and <string-name><given-names>T. W. S.</given-names> <surname>Chow</surname></string-name></person-group>, &#x201C;<article-title>Exactly robust kernel principal component analysis</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, vol. <volume>31</volume>, no. <issue>3</issue>, pp. <fpage>749</fpage>&#x2013;<lpage>761</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>M. Z.</given-names> <surname>Sarwar</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>RGB-D images for objects recognition using 3D point clouds and RANSAC plane fitting</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Applied Science and Technology</conf-name>, <conf-loc>Bhurban, Pakistan</conf-loc>, pp. <fpage>518</fpage>&#x2013;<lpage>523</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Fei-Fei</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Fergus</surname></string-name> and <string-name><given-names>P.</given-names> <surname>Perona</surname></string-name></person-group>, &#x201C;<article-title>One-shot learning of object categories</article-title>,&#x201D; <source>IEEE Transection on Pattern Analysis on Machine Intelligence</source>, vol. <volume>28</volume>, no. <issue>4</issue>, pp. <fpage>594</fpage>&#x2013;<lpage>611</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Computer Vision Pattern Recognition</conf-name>, <conf-loc>Hawaii, USA</conf-loc>, pp. <fpage>1475</fpage>&#x2013;<lpage>1483</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Farabet</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Couprie</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Najman</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>LeCun</surname></string-name></person-group>, &#x201C;<article-title>Learning hierarchical features for scene labeling</article-title>,&#x201D; <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>, vol. <volume>35</volume>, no. <issue>8</issue>, pp. <fpage>1915</fpage>&#x2013;<lpage>1929</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Akhter</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Pose estimation and detection for event recognition using sense-aware features and adaboost classifier</article-title>,&#x201D; in <conf-name>Proc. of. Conf. on Applied Sciences and Technologies (IBCAST)</conf-name>, <conf-loc>Islamabad, Pakistan</conf-loc>, pp. <fpage>500</fpage>&#x2013;<lpage>505</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Ghadi</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Akhter</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Alarfaj</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Syntactic model-based human body 3D reconstruction and event classification via association based features mining and deep learning</article-title>,&#x201D; <source>PeerJ Compututer Science</source>, vol. <volume>7</volume>, pp. <fpage>e764</fpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="thesis"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Akhter</surname></string-name></person-group>, &#x201C;<article-title>Automated posture analysis of gait event detection aia a hierarchical optimization algorithm and pseudo 2D stick-model</article-title>,&#x201D; M.S. Thesis, <publisher-name>Dept. Computer science, Air University</publisher-name>, <publisher-loc>Islamabad, Pakistan</publisher-loc>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Tsoi</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Lo</surname></string-name></person-group>, &#x201C;<article-title>Scale invariant feature transform flow trajectory approach with applications to human action recognition</article-title>,&#x201D; in <conf-name>Proc. Int. Joint Conf. on Neural Networks</conf-name>, <conf-loc>Beijing, China</conf-loc>, pp. <fpage>1197</fpage>&#x2013;<lpage>1204</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Socher</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Huval</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Bath</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Manning</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Ng</surname></string-name></person-group>, &#x201C;<article-title>Convolutional-recursive deep learning for 3d object classification</article-title>,&#x201D; in <conf-name>Proc. Advances in Neural Information Processing Systems</conf-name>, <conf-loc>Harrsha and Harveys, Lake Tahoe</conf-loc>, pp. <fpage>656</fpage>&#x2013;<lpage>664</lpage>, <year>2012</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Tan</surname></string-name></person-group>, &#x201C;<article-title>Semi-supervised learning and feature evaluation for rgb-d object recognition</article-title>,&#x201D; <source>Computer Vision and Image Understanding</source>, vol. <volume>139</volume>, pp. <fpage>149</fpage>&#x2013;<lpage>160</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Farahnakian</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Poikonen</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Laurinen</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Heikkonen</surname></string-name></person-group>, &#x201C;<article-title>Deep convolutional neural network-based fusion of RGB and IR images in marine environment</article-title>,&#x201D; in <conf-name>Proc. Intelligent Transportation Systems Conf.</conf-name>, <conf-loc>Auckland, New Zealand</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>26</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Feris</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Vasconcelos</surname></string-name></person-group>, &#x201C;<article-title>A unified multi-scale deep convolutional neural network for fast object detection</article-title>,&#x201D; in <conf-name>Proc. European Conf. on Computer Vision</conf-name>, <conf-loc>Amsterdam, Netherlands</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Farahnakian</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Poikonen</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Laurinen</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Heikkonen</surname></string-name></person-group>, &#x201C;<article-title>Deep convolutional neural network-based fusion of rgb and ir images in marine environment</article-title>,&#x201D; in <conf-name>Proc. IEEE Intelligent Transportation Systems Conf.</conf-name>, <conf-loc>Auckland, New Zealand</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>26</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Tang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Sebe</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Structured attention guided convolutional neural fields for monocular depth estimation</article-title>,&#x201D; in <conf-name>Proc. Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Schwarz</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Schulz</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Behnke</surname></string-name></person-group>, &#x201C;<article-title>Rgb-d object recog-nition and pose estimation based on pre-trained convolutional neural network features</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Robotics and Automation</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, pp. <fpage>1329</fpage>&#x2013;<lpage>1335</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Dalal</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Triggs</surname></string-name></person-group>, &#x201C;<article-title>Histograms of oriented gradients for human detection</article-title>,&#x201D; in <conf-name>Proc. Computer Vision and Pattern Recgnition</conf-name>, <conf-loc>San Diego, CA, USA</conf-loc>, pp. <fpage>886</fpage>&#x2013;<lpage>893</lpage>, <year>2005</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Lowe</surname></string-name></person-group>, &#x201C;<article-title>Distinctive image features from scale-invariant key-points</article-title>,&#x201D; <source>International Journal Computer Vision</source>, vol. <volume>60</volume>, pp. <fpage>91</fpage>&#x2013;<lpage>110</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Donahue</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Darrell</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Malik</surname></string-name></person-group>, &#x201C;<article-title>Region-based convolutional networks for accurate object detection and segmentation</article-title>,&#x201D; <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, vol. <volume>38</volume>, pp. <fpage>142</fpage>&#x2013;<lpage>158</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Faster r-cnn: Towards real-time object detection with region proposal networks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>28</volume>, pp. <fpage>91</fpage>&#x2013;<lpage>99</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Anguelov</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Reed</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Ssd: Single shot multibox detector</article-title>,&#x201D; in <conf-name>Proc. European Conf. on Computer Vision</conf-name>, <conf-loc>Amsterdam, The Netherlands</conf-loc>, pp. <fpage>21</fpage>&#x2013;<lpage>37</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Redmon</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Divvala</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Farhadi</surname></string-name></person-group>, &#x201C;<article-title>You only look once: Unified, real-time object detection</article-title>,&#x201D; in <conf-name>Proc. on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, pp. <fpage>779</fpage>&#x2013;<lpage>788</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Girshick</surname></string-name></person-group>, &#x201C;<article-title>Fast r-cnn</article-title>,&#x201D; in <conf-name>Proc. on Computer Vision</conf-name>, <conf-loc>Washington, DC, USA</conf-loc>, pp. <fpage>1440</fpage>&#x2013;<lpage>1448</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>K.</given-names> <surname>He</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>R-Fcn: Object detection via region-based fully convolutional networks</article-title>,&#x201D; <source>Advances in Neural Information Processing Systems</source>, vol. <volume>29</volume>, pp. <fpage>379</fpage>&#x2013;<lpage>387</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Ummenhofer</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Uhrig</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Mayer</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Ilg</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Demon: Depth and motion network for learning monocular stereo</article-title>,&#x201D; in <conf-name>Proc. Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, HI, USA</conf-loc>, pp. <fpage>456</fpage>&#x2013;<lpage>464</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Dai</surname></string-name> and <string-name><given-names>M.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>Monocular depth estimation with hierarchical fusion of dilated cnns and soft-weighted-sum inference</article-title>,&#x201D; <source>Pattern Recognition</source>, vol. <volume>83</volume>, pp. <fpage>328</fpage>&#x2013;<lpage>339</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Zou</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Luo</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<article-title>Df-net: Unsupervised joint learning of depth and flow using cross-task consistency</article-title>,&#x201D; in <conf-name>Proc. European Conf. Computer Vision</conf-name>, <conf-loc>Munich, Germany</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Godard</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Aodha</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Brostow</surname></string-name></person-group>, &#x201C;<article-title>Digging into self-supervised monocular depth estimation</article-title>,&#x201D; in <conf-name>Proc. Computer Vision</conf-name>, <conf-loc>Seoul, Korea</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Zhou</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Brown</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Snavely</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Lowe</surname></string-name></person-group>, &#x201C;<article-title>Unsupervised learning of depth and ego-motion from video</article-title>,&#x201D; in <conf-name>Proc. Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Honolulu, Hawaii</conf-loc>, pp. <fpage>234</fpage>&#x2013;<lpage>240</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Di</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Guangyong</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Daniel</surname></string-name>, <string-name><given-names>H. P.</given-names> <surname>-Ann</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Hui</surname></string-name></person-group>, &#x201C;<article-title>Cascaded feature network for semantic segmentation of rgb-d images</article-title>,&#x201D; in <conf-name>Proc. Int. Conf. on Computer Vision</conf-name>, <conf-loc>Venice, Italy</conf-loc>, pp. <fpage>1320</fpage>&#x2013;<lpage>1328</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kamal</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Farooq</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>A spatiotemporal motion variation features extraction approach for human tracking and pose-based action recognition</article-title>,&#x201D; in <conf-name>Proc. ICIE)</conf-name>, <conf-loc>Fukuoka, Japan</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. Y.</given-names> <surname>Ghadi</surname></string-name>, <string-name><given-names>I.</given-names> <surname>Akhter</surname></string-name>, <string-name><given-names>S. A.</given-names> <surname>Alsuhibany</surname></string-name>, <string-name><given-names>T.</given-names> <surname>al Shloul</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Multiple events detection using context-intelligence features</article-title>,&#x201D; <source>Intell. Autom.\&#x0026; Soft Comput</source>, vol. <volume>34</volume>, no. <issue>3</issue>, pp. <fpage>1455</fpage>&#x2013;<lpage>1471</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>I.</given-names> <surname>Akhter</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Adaptive pose estimation for gait event detection using context-aware model and hierarchical optimization</article-title>,&#x201D; <source>J. Electr. Eng.\&#x0026; Technol.</source>, vol. <volume>310</volume>, pp. <fpage>1</fpage>&#x2013;<lpage>9</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Batool</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Tahir</surname></string-name></person-group>, &#x201C;<article-title>Markerless sensors for physical health monitoring system using ecg and gmm feature extraction</article-title>,&#x201D; in <conf-name>Proc. on Applied Sciences and Technologies</conf-name>, <conf-loc>Islamabad, Pakistan</conf-loc>, pp. <fpage>340</fpage>&#x2013;<lpage>345</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kamal</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Depth map-based human activity tracking and recognition using body joints features and self-organized map</article-title>,&#x201D; in <conf-name>Proc. on Computing, Communications and Networking Technologies</conf-name>, <conf-loc>Hefei, China</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Jalal</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Kim</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Ridge body parts features for human pose estimation and recognition from RGB-D video data</article-title>,&#x201D; in <conf-name>Proc. of Fifth Int. Conf. on Computing, Communications and Networking Technologies (ICCCNT)</conf-name>, <conf-loc>Hefei, China</conf-loc>, pp. <fpage>1</fpage>&#x2013;<lpage>6</lpage>, <year>2014</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>