<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">IASC</journal-id>
<journal-id journal-id-type="nlm-ta">IASC</journal-id>
<journal-id journal-id-type="publisher-id">IASC</journal-id>
<journal-title-group>
<journal-title>Intelligent Automation &#x0026; Soft Computing</journal-title>
</journal-title-group>
<issn pub-type="epub">2326-005X</issn>
<issn pub-type="ppub">1079-8587</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">30298</article-id>
<article-id pub-id-type="doi">10.32604/iasc.2023.030298</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Robust Symmetry Prediction with Multi-Modal Feature Fusion for Partial Shapes</article-title><alt-title alt-title-type="left-running-head">Robust Symmetry Prediction with Multi-Modal Feature Fusion for Partial Shapes</alt-title><alt-title alt-title-type="right-running-head">Robust Symmetry Prediction with Multi-Modal Feature Fusion for Partial Shapes</alt-title>
</title-group>
<contrib-group content-type="authors">
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Xi</surname><given-names>Junhua</given-names></name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zheng</surname><given-names>Kouquan</given-names></name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Zhong</surname><given-names>Yifan</given-names></name>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Li</surname><given-names>Longjiang</given-names></name>
<xref ref-type="aff" rid="aff-3">3</xref>
</contrib>
<contrib id="author-5" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Cai</surname><given-names>Zhiping</given-names></name>
<xref ref-type="aff" rid="aff-1">1</xref><email>zpcai@nudt.edu.cn</email>
</contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Chen</surname><given-names>Jinjing</given-names></name>
<xref ref-type="aff" rid="aff-4">4</xref>
</contrib>
<aff id="aff-1"><label>1</label><institution>National University of Defense Technology</institution>, <addr-line>Changsha, Hunan</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Jiangxi University of Finance and Economics</institution>, <addr-line>Jiangxi</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>Unit 78111 of Chinese People&#x2019;s Liberation Army</institution>, <addr-line>Chengdu, Sichuan</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>Sungkyunkwan University</institution>, <country>Korea</country></aff>
</contrib-group><author-notes><corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Zhiping Cai. Email: <email>zpcai@nudt.edu.cn</email></corresp></author-notes>
<pub-date pub-type="epub" date-type="pub" iso-8601-date="2022-08-08"><day>08</day>
<month>08</month>
<year>2022</year></pub-date>
<volume>35</volume>
<issue>3</issue>
<fpage>3099</fpage>
<lpage>3111</lpage>
<history>
<date date-type="received"><day>23</day><month>3</month><year>2022</year></date>
<date date-type="accepted"><day>28</day><month>4</month><year>2022</year></date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2023 Xi et al.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Xi et al.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_IASC_30298.pdf"></self-uri>
<abstract>
<p>In geometry processing, symmetry research benefits from global geometric features of complete shapes, but the shape of an object captured in real-world applications is often incomplete due to the limited sensor resolution, single viewpoint, and occlusion. Different from the existing works predicting symmetry from the complete shape, we propose a learning approach for symmetry prediction based on a single RGB-D image. Instead of directly predicting the symmetry from incomplete shapes, our method consists of two modules, i.e., the multi-modal feature fusion module and the detection-by-reconstruction module. Firstly, we build a channel-transformer network (CTN) to extract cross-fusion features from the RGB-D as the multi-modal feature fusion module, which helps us aggregate features from the color and the depth separately. Then, our self-reconstruction network based on a 3D variational auto-encoder (3D-VAE) takes the global geometric features as input, followed by a prediction symmetry network to detect the symmetry. Our experiments are conducted on three public datasets: ShapeNet, YCB, and ScanNet, we demonstrate that our method can produce reliable and accurate results.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Symmetry prediction</kwd>
<kwd>multi-modal feature fusion</kwd>
<kwd>partial shapes</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Detecting the symmetry of 3D objects is a critical and fundamental problem for many computer vision applications. In geometric processing, finding symmetries in geometric data, such as point clouds, polygon meshes and voxels, could help numerous applications take advantage of symmetry information to solve their tasks or improve the algorithms, e.g., shape matching [<xref ref-type="bibr" rid="ref-1">1</xref>], segmentation [<xref ref-type="bibr" rid="ref-2">2</xref>], completion [<xref ref-type="bibr" rid="ref-3">3</xref>], etc. Among all the symmetry types, the most common and important one is planar reflective symmetry. Traditional planar symmetry detection methods are usually based on the observation that all points on the shape will coincide with the original shape after mirror transformation along the symmetry plane. e.g., a shape can be aligned to the principal axes, then planes formed by pairs of principal axes can be checked to see if they are symmetry planes. Recently, deep learning-based methods leverage a neural network to extract global features of the shape, which is used to capture possible symmetry [<xref ref-type="bibr" rid="ref-4">4</xref>,<xref ref-type="bibr" rid="ref-5">5</xref>].</p>
<p>However, these methods are all symmetry detection on complete 3D shapes. But the shape of an object captured in real-world applications is often incomplete due to the limited sensor resolution, single viewpoint, and occlusion, in this case, the above basic assumption for complete 3D shape will be broken.</p>
<p>For this reason, we propose to directly perform symmetry detection from partially observed objects/point clouds. A common application scenario is estimating symmetries of 3D shapes based on a single-view RGB-D image. Due to partial observations and object occlusion, it poses special challenges that are beyond the reach of geometric detection. For example, missing global geometric information leads to crucial difficulty to find local symmetry correspondences which are supported by mirror transformation. So, we try to dig out more potential information through the deep neural network, just as humans can infer symmetrical information through a large amount of learned knowledge when they see a new object.</p>
<p>In this work, we propose a learning approach for symmetry prediction based on a single RGB-D image. Trying to transform an incomplete shape into a complete shape is our main motivation. Our method consists of two modules, i.e., the multi-modal feature fusion module and the detection-by-reconstruction module. We build a channel-transformer network (CTN) to extract cross-fusion features from the RGB-D as the multi-modal feature fusion module. Then, our self-reconstruction network based on a 3D variational auto-encoder (3D-VAE) takes the cross-modal features as input, followed by a prediction symmetry network to detect the symmetry. <xref ref-type="fig" rid="fig-1">Fig. 1</xref> shows the pipeline of our method. Thanks to our reconstruction network, our detection network is trained in an unsupervised manner without any symmetry annotations. Extensive experiments on several benchmarks demonstrate the effectiveness of our method. To summarize, our main contribution is three-fold:<list list-type="bullet"><list-item>
<p>We propose a channel-transformer network for single-view RGB-D images to extract multimodal features.</p></list-item><list-item>
<p>We use a 3D-VAE network to perform the self-reconstruction, replacing the MLPs with 3D CNN in the latent layer.</p></list-item><list-item>
<p>We propose an end-to-end multi-feature network for symmetry prediction.</p></list-item></list></p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The architecture of our pipeline. At first, taking a RGB image and a depth image as input, cross-model features are extracted by the CTN and followed by the self-reconstruction network based on 3D-VAE. Then, aggregating the cross-features and enhanced geometric features to predict symmetry through the symmetry prediction network</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-1.png"/>
</fig>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Multimodal Fusion</title>
<p>Multimodal fusion can take advantage of data obtained from different sources/structures for classification or regression, so it becomes a central problem in machine learning [<xref ref-type="bibr" rid="ref-6">6</xref>]. A variety of works have been done towards deep multimodal fusion [<xref ref-type="bibr" rid="ref-7">7</xref>]. A simpler way is to use the aggregation-based method, which employs a certain operation (e.g., averaging [<xref ref-type="bibr" rid="ref-8">8</xref>], concatenation [<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>], and self-attention [<xref ref-type="bibr" rid="ref-11">11</xref>]) to combine multimodal sub-networks into a single network. Or use alignment-based fusion [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>], which adopts a regulation loss to align the embedding of all sub-networks while keeping full propagation for each of them. But the aggregation-based fusion is prone to underestimating the intra-modal propagation once the multimodal sub-networks have been aggregated. And the alignment-based fusion always delivers ineffective inter-modal fusion owing to the weak message exchanging by solely training the alignment loss. To balance the two methods, [<xref ref-type="bibr" rid="ref-14">14</xref>] proposed a channel exchanging method which is parameter-free and could dynamically exchange channels between sub-networks of different modalities. But by judging the parameter of BN layer approaches zero or not to exchange the channel directly would lead to no information attention between the corresponding channels, which may result in information faulting. So, our method opts to optimize the feature handling process, which includes attention mechanism between the corresponding channels.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Learning-based Symmetry Detection</title>
<p>In pattern recognition and computer vision literature, there exists a significant number of papers dedicated to finding symmetries in images [<xref ref-type="bibr" rid="ref-15">15</xref>], two-dimensional [<xref ref-type="bibr" rid="ref-16">16</xref>&#x2013;<xref ref-type="bibr" rid="ref-18">18</xref>], and three-dimensional shapes [<xref ref-type="bibr" rid="ref-19">19</xref>&#x2013;<xref ref-type="bibr" rid="ref-21">21</xref>]. For symmetry detection of 3D shapes, we usually use the Euclidean distance between points to measure the symmetry of a shape, and the planar reflective symmetry is the most fundamental one. Early methods [<xref ref-type="bibr" rid="ref-22">22</xref>] used some entropy features to help find the reflective symmetry or create different metrics to detect intrinsic symmetry [<xref ref-type="bibr" rid="ref-23">23</xref>]. Most recently, deep learning has been adopted for the task of 3D symmetry detection. In [<xref ref-type="bibr" rid="ref-24">24</xref>] utilize structured random forests to detect curved reflectional symmetries. In [<xref ref-type="bibr" rid="ref-25">25</xref>] develop an unsupervised deep learning approach for effective real-time global planar reflective symmetry detection, which demonstrates excellent results. However, just like PRS-Net proposed by Gao et al. [<xref ref-type="bibr" rid="ref-25">25</xref>], most of the work is aimed at the symmetry detection of complete 3D shapes, and they are not applicable to incomplete 3D shapes. Up to now, the existing works for incomplete 3D shapes mainly focus on learning-based shape descriptors [<xref ref-type="bibr" rid="ref-26">26</xref>], which makes it difficult to detect symmetry directly if the shape is seriously missing (e.g., rendered RGB-D). Our network aims to utilize the generate model to reconstruct the global geometric features of incomplete 3D shapes as much as possible, which is helpful to detect symmetry.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Method</title>
<p>The symmetry of a 3D object is easily measurable when its shape is complete. Global geometric features take enough information for conventional symmetry detection pipelines to find the positive symmetry plane. However, shapes captured in real-world applications are often incomplete due to the limited sensor resolution, single viewpoint, and occlusion. So, we take a single RGB-D image as input, to predict the symmetry of the incomplete shape.</p>
<p>As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, our solution consists of three major components. 1) Given an RGB-D image, a channel transformer network (CTN) extracts cross-modal features by fusing information from the RGB image and the depth image. 2) A self-reconstruction network (3D-VAE) predicts the complete geometric shape based on voxelized cross-modal features. 3) A symmetry prediction network, taking the cross-modal features from the CTN and the enhanced geometric features from 3D-VAE as multi-features input.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Channel-Transformer Network for RGB-D Images</title>
<p>Traditional methods often treat an RGB-D image as a 4-channel feature, while literature [<xref ref-type="bibr" rid="ref-14">14</xref>] found that RGB and depth are two modal features, the RGB takes texture information, and the depth contributes contour information. So, aggregation-based methods [<xref ref-type="bibr" rid="ref-7">7</xref>,<xref ref-type="bibr" rid="ref-9">9</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>] and alignment-based methods [<xref ref-type="bibr" rid="ref-12">12</xref>,<xref ref-type="bibr" rid="ref-13">13</xref>] process them separately, but internal feature communication is ignored.</p>
<p>Therefore, we set up two sub-networks to deal with RGB and depth respectively and propose Channel-Transformer Network (CTN) to achieve the internal feature communication. Instead of using aggregation, alignment, or channel-exchanging methods as before, CTN dynamically transforms the channels between sub-networks for fusion. Inspired by [<xref ref-type="bibr" rid="ref-14">14</xref>], we utilize the scaling factor (i.e., <italic>&#x03B3;</italic>) of Batch-Normalization (BN) [<xref ref-type="bibr" rid="ref-27">27</xref>] as the importance measurement of each corresponding channel and use a transformer between the channels associated with close-to-zero factors of each modality. Such message transformer is self-adaptive, as it is dynamically controlled by the scaling factors that are determined by the training itself.</p>
<p>We conduct multi-modal fusion on RGB-D images corresponding to the same image content. In this scenario, all modalities are homogeneous in the sense that they are just different views of the same input. Thus, the parameters except BN layers of the CTN are shared with each other. By using private BNs, we can determine the channel importance for each individual modality, and by sharing convolutional filters, the corresponding channels among different modalities are embedded with the same mapping, which could yield promisingly expressive power.</p>
<p>Suppose we have the <italic>i</italic>-th input data of <italic>M</italic> (<italic>M</italic>&#x2009;&#x003D;&#x2009;2) modalities, <inline-formula id="ieqn-1">
<mml:math id="mml-ieqn-1"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>C</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>H</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>W</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msubsup></mml:math>
</inline-formula>, where C denotes the number of channels, H and W denote the height and width of the feature map. We first process each modality with a 2D U-Net sub-network <italic>f</italic><sub><italic>m</italic></sub>(<italic>x</italic>), which shares all parameters with each other including convolutional filters, except BN layers. The output of the sub-network is y.<disp-formula id="eqn-1"><label>(1)</label>
<mml:math id="mml-eqn-1" display="block"><mml:msub><mml:mi>y</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>m</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mspace width="1em" /><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mi>r</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>g</mml:mi><mml:mi>b</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mi>w</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>R</mml:mi><mml:mi>G</mml:mi><mml:mi>B</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mi>w</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>t</mml:mi><mml:mi>h</mml:mi><mml:mi>e</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mspace width="thickmathspace" /><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi><mml:mi>t</mml:mi><mml:mi>h</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>Prior to introducing the channel transformer process, we first review the BN layer [<xref ref-type="bibr" rid="ref-27">27</xref>], which is used widely in deep learning to eliminate covariate shift and improve generalization. We denote by <italic>x</italic><sub><italic>m</italic>,<italic>l</italic></sub> the <italic>l</italic>-th layer feature maps of the <italic>m</italic>-th sub-network, and by <italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> the <italic>c</italic>-th channel. The BN layer performs a normalization of <italic>x</italic><sub><italic>m</italic>,<italic>l</italic></sub> followed by an affine transformation, namely,<disp-formula id="eqn-2"><label>(2)</label>
<mml:math id="mml-eqn-2" display="block"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mstyle></mml:math>
</disp-formula></p>
<p><italic>&#x03BC;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> and <italic>&#x03C3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> compute the mean and the standard deviation, respectively, of all activations over all pixel locations (H and W) for the current mini-batch data; <italic>&#x03B3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> and <italic>&#x03B2;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> are the trainable scaling factor and offset, respectively; <inline-formula id="ieqn-400">
<mml:math id="mml-ieqn-400"><mml:mtext>&#x03B5;</mml:mtext></mml:math></inline-formula> is a small constant to avoid divisions by zero. The (<italic>l</italic> &#x002B; 1)-th layer takes <inline-formula id="ieqn-2">
<mml:math id="mml-ieqn-2"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mrow><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mi>c</mml:mi></mml:msub></mml:math>
</inline-formula> as input after a non-linear function.</p>
<p>The factor <italic>&#x03B3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> evaluates the correlation between the input <italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> and the output <inline-formula id="ieqn-3">
<mml:math id="mml-ieqn-3"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math>
</inline-formula> during training. The gradient of the loss w.r.t. <italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> will approach 0 if <italic>&#x03B3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub>&#x2009;&#x2192;&#x2009;0, implying that <italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> will lose its influence on the final prediction and become redundant thereby. In other words, once the current channel <italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> becomes redundant due to <italic>&#x03B3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub>&#x2009;&#x2192;&#x2009;0 at a certain training step, it will almost do henceforth.</p>
<p>Thus, we pick out the channels of small scaling factors and the corresponding ones of other sub-networks, which are fed into the CTN, and then we replace the channels with the features from the CTN. The CTN is presented in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. To be specific, it contains three cross-attention layers, each followed by two AddNorm layers and one feed-forward layer [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>]. Suppose <italic>X</italic>&#x2009;&#x003D;&#x2009;<italic>x</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub> and the corresponding channel of another sub-network is <inline-formula id="ieqn-4">
<mml:math id="mml-ieqn-4"><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math>
</inline-formula>. The cross-attention layer of CTN is:</p>
<p><disp-formula id="eqn-3"><label>(3)</label>
<mml:math id="mml-eqn-3" display="block"><mml:mi>S</mml:mi><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mo>,</mml:mo><mml:mi>K</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:msup><mml:mi>K</mml:mi><mml:mi>T</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mi>V</mml:mi></mml:math>
</disp-formula></p>
<p>where <italic>Q</italic>&#x2009;&#x003D;&#x2009;<italic>XW</italic><sup><italic>Q</italic></sup>, <italic>K</italic>&#x2009;&#x003D;&#x2009;<italic>YW</italic><sup><italic>K</italic></sup>, <italic>V</italic>&#x2009;&#x003D;&#x2009;<italic>YW</italic><sup><italic>V</italic></sup> are the query vector, the key vector, and the value vector respectively. <italic>W</italic><sup><italic>Q</italic></sup>, <italic>W</italic><sup><italic>K</italic></sup>, <italic>W</italic><sup><italic>V</italic></sup> are the learned weights. The AddNorm layer is:<disp-formula id="eqn-4"><label>(4)</label>
<mml:math id="mml-eqn-4" display="block"><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>d</mml:mi><mml:mi>d</mml:mi><mml:mi>N</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>X</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mi>a</mml:mi><mml:mi>y</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>N</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>+</mml:mo><mml:mi>S</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula>where LayerNorm(&#x22C5;) is the layer normalization operation. The output of CTN is the attention-aware denoised feature <inline-formula id="ieqn-5">
<mml:math id="mml-ieqn-5"><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup></mml:math>
</inline-formula>.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>The framework of the CTN. If a channel in a BN layer has a small scaling factor (<italic>&#x03B3;</italic><sub><italic>m</italic>,<italic>l</italic>,<italic>c</italic></sub>&#x2009;&#x003C;&#x2009;<italic>&#x03B8;</italic>), the corresponding channels in the feature map before the BN layer would be picked out from all sub-networks (as the red line shows in the figure). Then feed the features in these channels into the CTN to get new features</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-2.png"/>
</fig>
<p>Thus,<disp-formula id="eqn-5"><label>(5)</label>
<mml:math id="mml-eqn-5" display="block"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:msqrt></mml:mrow></mml:mfrac></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x03B2;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mi>i</mml:mi><mml:mi>f</mml:mi><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>&#x03B3;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>&#x003E;</mml:mo><mml:mi>&#x03B8;</mml:mi></mml:mstyle></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mrow><mml:mi>C</mml:mi><mml:mi>T</mml:mi><mml:mi>N</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mi>l</mml:mi><mml:mo>,</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:mi>e</mml:mi><mml:mi>l</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:math>
</disp-formula></p>
<p>In a nutshell, if one channel of one modality has little impact on the final prediction, then we replace it with a new channel from the transformer network. We apply <xref ref-type="disp-formula" rid="eqn-5">Eq. (5)</xref> for each modality before feeding them into the nonlinear activation followed by the convolutions in the next layer.</p>
<p>It is known in [<xref ref-type="bibr" rid="ref-29">29</xref>,<xref ref-type="bibr" rid="ref-30">30</xref>] that leveraging private BN layers is able to characterize the traits of different domains or modalities. In our method, specifically, different scaling factors <xref ref-type="disp-formula" rid="eqn-2">(Eq. (2)</xref>) evaluate the importance of the channels of different modalities, and they should be decoupled. Except for BN layers, all sub-networks <italic>f</italic><sub><italic>m</italic></sub> share all parameters with each other including convolutional filters.</p>
<p>Then we combine all the outputs via an aggregation operation followed by a global mapping. In formal, it computes the output by<disp-formula id="eqn-6"><label>(6)</label>
<mml:math id="mml-eqn-6" display="block"><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mi>f</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>x</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>g</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>f</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>x</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula>where h is the global network and Agg is the aggregation function. The aggregation can be implemented as averaging [<xref ref-type="bibr" rid="ref-4">4</xref>], concatenation [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-10">10</xref>], and self-attention [<xref ref-type="bibr" rid="ref-11">11</xref>], and we use concatenation here simply. To the end, we voxelized the point clouds together with their color features <italic>x</italic><sub><italic>rgb</italic></sub> and depth features <italic>x</italic><sub><italic>d</italic></sub> from CTN.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Voxel-Based 3D Variational Autoencoders</title>
<p>Due to partial observations, methods for symmetry prediction of complete shapes do not work anymore. Because missing global geometric information leads to crucial difficulty to find local symmetry correspondences which are supported by mirror transformation. In this situation, we use a generative network 3D Variable Autoencoders (3D-VAE) to reconstruct the complete 3D shape from an incomplete one, which provides enhanced geometric features for the following symmetry prediction network.</p>
<p>We take the cross-modal features from the CTN as input and reconstruct the corresponding complete shape, including more enhanced geometric features than the initial geometric features only from depth. As shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>, the 3D-VAE network consists of an encoder network, a latent layer, and a decoder network. We use an approximate 3D U-Net architecture to define the encoder and decoder network, except for the latent layer. Different from many other works [<xref ref-type="bibr" rid="ref-31">31</xref>,<xref ref-type="bibr" rid="ref-32">32</xref>], we design the latent layer as a 3D network to preserve global geometric features as much as possible. We encode the cross-modal features into an implicit space, which obeys the normal distribution (here we regress two parameters <italic>&#x03BC;</italic><sup>(<italic>i</italic>)</sup> and <italic>&#x03C3;</italic><sup>(<italic>i</italic>)</sup> to define the normal distribution), and then sample from the implicit space to reconstruct through the decoder network.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>The 3D variational autoencoders</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-3.png"/>
</fig>
<p>We use a 3D down-sampling network for the probabilistic encoder <inline-formula id="ieqn-6">
<mml:math id="mml-ieqn-6"><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">&#x2205;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula> ((the approximation to the posterior of the generative model <inline-formula id="ieqn-7">
<mml:math id="mml-ieqn-7"><mml:msub><mml:mi>p</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula>) and where the parameters <inline-formula id="ieqn-8">
<mml:math id="mml-ieqn-8"><mml:mi mathvariant="normal">&#x2205;</mml:mi></mml:math>
</inline-formula> and <italic>&#x03B8;</italic> are optimized jointly with the VAE algorithm.</p>
<p>Let the prior over the latent variables be the centered isotropic multivariate Gaussian <inline-formula id="ieqn-9">
<mml:math id="mml-ieqn-9"><mml:msub><mml:mi>p</mml:mi><mml:mi>&#x03B8;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>z</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mo>;</mml:mo><mml:mspace width="thickmathspace" /><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>I</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula>. We let <inline-formula id="ieqn-10">
<mml:math id="mml-ieqn-10"><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">&#x2205;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:msup><mml:mi>x</mml:mi><mml:mi>i</mml:mi></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula> be a multivariate Gaussian (in case of real-valued data) or Bernoulli (in case of binary data) whose distribution parameters are computed from z with a 3D network (another 3D layer following the encoder network). In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:<disp-formula id="eqn-7"><label>(7)</label>
<mml:math id="mml-eqn-7" display="block"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">&#x2205;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mo>;</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>&#x03BC;</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>I</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula>where the mean and standard deviation of the approximate posterior, <italic>&#x03BC;</italic> and <inline-formula id="ieqn-10a">
<mml:math id="mml-ieqn-10a"><mml:mi>&#x03C3;</mml:mi></mml:math>
</inline-formula>, are outputs of the last 3D layer following the encoder network, which consists of three 3D convolutional layers to the latent layer.</p>
<p>Then, we introduce the latent layer. All the outputs here are 3D matrices with one channel. We sample from the posterior <inline-formula id="ieqn-11">
<mml:math id="mml-ieqn-11"><mml:mspace width="thickmathspace" /><mml:msup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x223C;</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mi mathvariant="normal">&#x2205;</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>z</mml:mi><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula> using <inline-formula id="ieqn-12">
<mml:math id="mml-ieqn-12"><mml:msup><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>&#x03BC;</mml:mi><mml:mo>+</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo>&#x2299;</mml:mo><mml:mi>&#x03B5;</mml:mi></mml:math>
</inline-formula> where <inline-formula id="ieqn-13">
<mml:math id="mml-ieqn-13"><mml:mi>&#x03B5;</mml:mi><mml:mo>&#x223C;</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="normal">N</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mi>I</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</inline-formula>. With <inline-formula id="ieqn-14">
<mml:math id="mml-ieqn-14"><mml:mo>&#x2299;</mml:mo></mml:math>
</inline-formula> we signify an element-wise product, and the <italic>z</italic><sup>&#x2032;</sup> is the input of decoder network.</p>
<p>In this case, the KL divergence can be computed and differentiated:<disp-formula id="eqn-8"><label>(8)</label>
<mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:mrow><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>d</mml:mi></mml:munderover><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mi>&#x03BC;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:msubsup><mml:mi>&#x03C3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msubsup><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mstyle></mml:math>
</disp-formula></p>
<p>The decoder network is a 3D up-sampling network for reconstruction, which has an identical, but inverted, architecture, and its weights are not tied to the encoders. The output of the decoder network is the result of the reconstruction, for which we use a specialized form of Binary Cross-Entropy (BCE). The standard BCE loss is:<disp-formula id="eqn-9"><label>(9)</label>
<mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">g</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>o</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula>where <italic>t</italic> is the target value in &#x007B;0, 1&#x007D; and <italic>o</italic> is the output of the network in &#x007B;0, 1&#x007D;at each output element. The derivative of the BCE with respect to <italic>o</italic> severely diminishes as <italic>o</italic> approaches <italic>t</italic>, which can result in vanishing gradients during training. Additionally, the standard BCE weights false positives and false negatives equally; because over 95&#x0025; of the voxel grid in the training data is empty, the network can confidently plunge into a local optimum of the standard BCE by outputting all negatives:<disp-formula id="eqn-10"><label>(10)</label>
<mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mi>t</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>o</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03B3;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi mathvariant="normal">l</mml:mi><mml:mi mathvariant="normal">o</mml:mi><mml:mi mathvariant="normal">g</mml:mi></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>o</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula></p>
<p>During training, we set <italic>&#x03B3;</italic> to 0.95, strongly penalizing false negatives while reducing the penalty for false positives. Setting too high results in noisy reconstructions, while setting too low results in reconstructions which neglect salient object details and structure. Thus, the loss of the 3D VAE network is defined as<disp-formula id="eqn-11"><label>(11)</label>
<mml:math id="mml-eqn-11" display="block"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>w</mml:mi><mml:msub><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x03BC;</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>&#x03C3;</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:msub></mml:math>
</disp-formula>where <italic>w</italic> is the weight for balancing the reconstruction loss and the KL divergence.</p>
<p>It should be noted that the output of each element of the final layer can be interpreted as the predicted probability that a voxel is present at a given location. Down-sampling in the encoder network is accomplished via stridden convolutions (as opposed to pooling) in every second layer. Up-sampling in the decoder network is accomplished via fractionally stridden convolutions, in every second layer.</p>
<p>The network is initialized with Glorot Initialization [<xref ref-type="bibr" rid="ref-33">33</xref>], and all the output layers are Batch Normalized [<xref ref-type="bibr" rid="ref-27">27</xref>]. The variance and mean parameters of the latent layer are individually Batch Normalized, such that the output of the latent layer during training is still stochastic under the VAE parameterization trick.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Multi-feature for Symmetry Prediction Network</title>
<p>In this section, we propose the symmetry prediction network by combining the cross-modal features from CTN and geometric features from 3D-VAE. The overall network is presented in <xref ref-type="fig" rid="fig-4">Fig. 4</xref>.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>The architecture of the symmetry prediction network</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-4.png"/>
</fig>
<p>Thanks to the 3D-VAE, we have a relatively complete shape and contain stronger geometric features than the initial depth, which gives us an opportunity to learn from some symmetry detection methods [<xref ref-type="bibr" rid="ref-34">34</xref>,<xref ref-type="bibr" rid="ref-35">35</xref>] that have been proved to be effective. Inspired by PRS-Net [<xref ref-type="bibr" rid="ref-25">25</xref>], we define a Convolutional Neural Network(CNN) to predict a fixed number (three in practice) of symmetry planes, which may not all be valid. Duplicated or invalid symmetry planes are removed in the validation stage.</p>
<p>From the above section, we get a variety of features, containing cross-modal features from the CTN, as well as enhanced geometric features from the 3D VAE. How do we aggregate these features to predict symmetry? A simple way is to directly use the self-reconstruction result from the 3D VAE, which is close to the complete shape. This has proved to be a valuable method, which can transform the symmetry detection of incomplete shape into the symmetry detection of complete shape, by providing global geometric features for the network. However, there are two problems. One is that the result of the VAE is only relatively complete, not absolutely complete, and it is inevitable to lose lots of detailed information through the multi-layer network and random sampling; Second, we do not make full use of the additional information we already have, such as color features and fine depth features from RGB-D, the former has rich texture information, while the latter contains a lot of contour details. Although there is only one view, it has been effectively extracted through the CTN.</p>
<p>Thus, we put these existing features into the corresponding voxel according to their respective characteristics as the final features of the voxel. Suppose that the cross-modal feature is <italic>F</italic><sub><italic>c</italic></sub>, coming from the CTN and are high-dimensional. The geometric features from 3D VAE are 0&#x2013;1 feature, which is used to judge whether the voxel is or not the surface. Considering the dimensional consistency, we take the features of the previous layer of the last layer of VAE as the geometric features, <italic>F</italic><sub><italic>g</italic></sub>.<disp-formula id="eqn-12"><label>(12)</label>
<mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mi>F</mml:mi><mml:mi>A</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>F</mml:mi><mml:mi>c</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mspace width="thickmathspace" /><mml:mspace width="thickmathspace" /><mml:msub><mml:mi>F</mml:mi><mml:mi>g</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math>
</disp-formula></p>
<p>Then, we feed <italic>F</italic><sub><italic>A</italic></sub> into the symmetry prediction network, which has six 3D convolution layers of kernel size 3, padding 1, and stride 1. After each 3D convolution, a max-pooling of kernel size 2 and leaky ReLU [<xref ref-type="bibr" rid="ref-36">36</xref>] activation is applied. These are followed by fully connected layers to predict the parameters of symmetry planes.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Results and Applications</title>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Datasets</title>
<p>Our experiments are conducted on three public datasets: ShapeNet [<xref ref-type="bibr" rid="ref-37">37</xref>], YCB [<xref ref-type="bibr" rid="ref-38">38</xref>], ScanNet [<xref ref-type="bibr" rid="ref-39">39</xref>].</p>
<p>ShapeNet: We use the train set that contains 100000 training RGB-D images and split the test set into two subsets: holdout view and holdout instance.</p>
<p>YCB: We follow the original train/test split established in [<xref ref-type="bibr" rid="ref-38">38</xref>].</p>
<p>ScanNet: We use the train set that contain 13126 training RGB-D images and split the test set into two subsets: holdout view and holdout scene, we only test on holdout view.</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Evaluation Metric</title>
<p>In order to determine whether a predicted symmetry is a true positive or a false positive, we compute a dense symmetry error from the difference between the predicted symmetry and the ground-truth symmetry. For a reflectional symmetry, we compute the dense symmetry error as:<disp-formula id="eqn-13"><label>(13)</label>
<mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mrow><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mi>i</mml:mi><mml:mi>N</mml:mi></mml:munderover><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mi>T</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mi>&#x03C1;</mml:mi></mml:mfrac></mml:mrow></mml:mstyle></mml:mstyle></mml:math>
</disp-formula>where <italic>T</italic><sub><italic>ref</italic></sub> and <inline-formula id="ieqn-15">
<mml:math id="mml-ieqn-15"><mml:msub><mml:mrow><mml:mover><mml:mi>T</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math>
</inline-formula> are the symmetric transformation of the predicted symmetry and the ground-truth symmetry of a complete shape with points <italic>P</italic>&#x2009;&#x003D;&#x2009;&#x007B;<italic>P</italic><sub><italic>i</italic></sub>&#x007D;, <italic>i</italic>[1, <italic>N</italic>] and <italic>&#x03C1;</italic> is the max distance from the points in <italic>P</italic> to the symmetric plane of <inline-formula id="ieqn-16">
<mml:math id="mml-ieqn-16"><mml:msub><mml:mrow><mml:mover><mml:mi>T</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi></mml:mrow></mml:msub></mml:math>
</inline-formula>.for rotational symmetry, we compute the dense symmetry error as:</p>
<p><disp-formula id="eqn-14"><label>(14)</label>
<mml:math id="mml-eqn-14" display="block"><mml:msub><mml:mi>&#x03B5;</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow><mml:mo fence="false" stretchy="false">|</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac></mml:mrow><mml:munder><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x03B3;</mml:mi><mml:mi>&#x03F5;</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x0393;</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mo>&#x2061;</mml:mo><mml:munderover><mml:mrow><mml:mo movablelimits="false">&#x2211;</mml:mo></mml:mrow><mml:mi>i</mml:mi><mml:mi>N</mml:mi></mml:munderover><mml:mstyle displaystyle="true" scriptlevel="0"><mml:mrow><mml:mfrac><mml:mrow><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:msub><mml:mi>T</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mover><mml:mi>T</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x03B3;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mo fence="false" stretchy="false">&#x2016;</mml:mo><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mi>&#x03C1;</mml:mi></mml:mfrac></mml:mrow></mml:mstyle></mml:mstyle></mml:mstyle></mml:math>
</disp-formula>where <italic>T</italic><sub><italic>rot</italic>,<italic>&#x03B3;</italic></sub> is the rotational transformation of the predicted symmetry with a rotation angle of <italic>&#x03B3;</italic>. The set of rotation angles is &#x0393;&#x2009;&#x003D;&#x2009;&#x007B;&#x03BA;&#x2009;&#x22C5;&#x2009;&#x03C0;/8&#x007D;<sub>&#x03BA;&#x003D;1, &#x2026;, 16</sub>, and <italic>&#x03C1;</italic> is the max distance from the points in <italic>P</italic> to the rotational axis of <inline-formula id="ieqn-17">
<mml:math id="mml-ieqn-17"><mml:msub><mml:mrow><mml:mover><mml:mi>T</mml:mi><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math>
</inline-formula>. We set the dense symmetry error threshold to be 0.25 for both reflectional and rotational symmetries.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Ablation Studies for Symmetry Prediction</title>
<p>To study the importance of each component of our method, we compare our full method against two variants. A specific part of the pipeline is taken out for each variant, as follows:</p>
<p><italic>No 3D-VAE</italic>: Without the reconstruction part in our method, we predict symmetry only from the incomplete shape, which lacks a lot of geometric information.</p>
<p><italic>No feature fusion for symmetry prediction</italic>: Without the aggregation of color features, depth features and geometric features, we use the result of 3D VAE to predict symmetry directly.</p>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the results of our ablation study, for reflectional symmetry detection. The full method outperforms the simpler variants in all cases. It can be seen that the self-reconstruction part (3D VAE) is crucial to our method, perhaps because it provides more geometric information for symmetry detection. In addition, by observing the results of using only geometric features and fusion features (color, depth, geometry), we can see that in most cases, more features are helpful to the final symmetry prediction, but not much, which indicates that the global geometric features are more important for symmetry detection.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Ablation studies: comparison of symmetry prediction performance on two subsets of ShapeNet between our full proposed method (red) and its several variants (blue: without 3D VAE; green: without feature fusion)</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-5.png"/>
</fig>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Studies for Self-Reconstruction</title>
<p>To study the importance of the CTN to the 3D VAE, we compare our self-reconstruction results without some specific component: <italic>w/o independent network for color and depth</italic>, <italic>w/o channel transformer</italic>, <italic>w/o 3D latent layer</italic>. Without dependent networks for color and depth, we simply use a 3D U-Net to extract the RGB-D feature.</p>
<p><xref ref-type="table" rid="table-1">Tab. 1</xref> shows the self-reconstruction accuracy of these ablation studies. It is worth noting that independent networks are very useful for feature extraction. Maybe it&#x2019;s because color and depth are essentially different modal features. The former provides more detailed information, and the latter provides more contour information. If they are treated together, it may cause a waste of information. By analyzing the data in <xref ref-type="table" rid="table-1">Tab. 1</xref>, we can see that the 3D CNN in the latent layer also makes some contribution to the final accuracy, which shows that the global information is also useful when fitting the two parameters of mean and variance.</p>
<table-wrap id="table-1"><label>Table 1</label>
<caption>
<title>Ablation studies: Comparison of self-reconstruction performance</title></caption>
<table><colgroup><col align="left"/><col align="left"/>
</colgroup>
<thead>
<tr>
<th align="left">Method</th>
<th align="left">Accuracy (&#x0025;)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">w/o independent network for color and depth</td>
<td align="left">75&#x0025;</td>
</tr>
<tr>
<td align="left">w/o channel transformer</td>
<td align="left">81&#x0025;</td>
</tr>
<tr>
<td align="left">w/o 3D latent layers</td>
<td align="left">83&#x0025;</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Comparison to Baselines</title>
<p>We refer to the RGB-D Retrieval [<xref ref-type="bibr" rid="ref-40">40</xref>] and Geometric Fitting [<xref ref-type="bibr" rid="ref-41">41</xref>], to compare with our method. The comparisons are plotted in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>. It is obvious that our method achieves the best score over all the data subsets.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Comparisons with baselines on the performance of predicting reflectional symmetry</title></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="IASC_30298-fig-6.png"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>We have presented the robust symmetry prediction method, which benefits from global geometric features reconstructed from single RGB-D images. A channel-transformer network is designed to aggregate features from the color and the depth separately, which essentially helps feature fetching. The 3D-VAE network demonstrates a significant performance boost for global geometric features recovery. A limitation to our method is that we have not considered the rotational symmetry, which is also universal. An interesting future direction is to further utilize the detection-by-reconstruction module to help single view reconstruction target, which would help to preserve more geometry information through symmetry reflection, rather than deep learning networks.</p>
</sec>
</body>
<back><fn-group>
<fn fn-type="other">
<p><bold>Funding Statement:</bold> The authors received no specific funding for this study.</p>
</fn>
<fn fn-type="conflict">
<p><bold>Conflicts of Interest:</bold> The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</fn>
</fn-group>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Kazhdan</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Funkhouser</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Rusinkiewicz</surname></string-name></person-group>, &#x201C;<article-title>Symmetry descriptors and 3D shape matching</article-title>.&#x201D; In <conf-name>Proceedings of the Symposium on Geometry Processing</conf-name>, pp. <fpage>115</fpage>&#x2013;<lpage>123</lpage>, <year>2004</year>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Podolak</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Shilane</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Golovinskiy</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Rusinkiewicz</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Funkhouser</surname></string-name></person-group>, &#x201C;<article-title>A planar-reflective symmetry transform for 3D shapes</article-title>,&#x201D; <source>ACM Transactions on Graphics</source>, vol. <volume>25</volume>, no. <issue>3</issue>, pp. <fpage>549</fpage>&#x2013;<lpage>559</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Simari</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kalogerakis</surname></string-name> and <string-name><given-names>K.</given-names> <surname>Singh</surname></string-name></person-group>, &#x201C;<article-title>Folding meshes: Hierarchical mesh segmentation based on planar symmetry</article-title>,&#x201D; <source>Symposium on Geometry Processing</source>, vol. <volume>256</volume>, pp. <fpage>111</fpage>&#x2013;<lpage>119</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Choi</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Hong</surname></string-name></person-group>, &#x201C;<article-title>Volumetric object modeling using internal shape preserving constraint in unity 3D</article-title>,&#x201D; <source>Intelligent Automation &#x0026; Soft Computing</source>, vol. <volume>32</volume>, no. <issue>3</issue>, pp. <fpage>1541</fpage>&#x2013;<lpage>1556</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Lutsiv</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Maksymyuk</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Beshley</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Lavriv</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Andrushchak</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deep semisupervised learning-based network anomaly detection in heterogeneous information systems</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>70</volume>, no. <issue>1</issue>, pp. <fpage>413</fpage>&#x2013;<lpage>431</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Mehmood</surname></string-name>, <string-name><given-names>A. W.</given-names> <surname>Khan</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Aslam</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ahmad</surname></string-name>, <string-name><given-names>A. M.</given-names> <surname>El-Sherbeeny</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Requirement design for software configuration and system modeling</article-title>,&#x201D; <source>Intelligent Automation &#x0026; Soft Computing</source>, vol. <volume>32</volume>, no. <issue>1</issue>, pp. <fpage>441</fpage>&#x2013;<lpage>454</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Ramachandram</surname></string-name> and <string-name><given-names>G. W.</given-names> <surname>Taylor</surname></string-name></person-group>, &#x201C;<article-title>Deep multimodal learning: A survey on recent advances and trends</article-title>,&#x201D; <source>IEEE Signal Processing Magazine</source>, vol. <volume>34</volume>, no. <issue>6</issue>, pp. <fpage>96</fpage>&#x2013;<lpage>108</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Hazirbas</surname></string-name>, <string-name><given-names>L. N.</given-names> <surname>Ma</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Domokos</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Cremers</surname></string-name></person-group>, &#x201C;<article-title>FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture</article-title>,&#x201D; in <conf-name>Asian Conf. on Computer Vision</conf-name>, <conf-loc>Asian Conference on Computer Vision (ACCV)</conf-loc>, pp. <fpage>213</fpage>&#x2013;<lpage>228</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Ngiam</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Khosla</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Kim</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Nam</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Lee</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Multimodal deep learning</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <conf-loc>International Conference on Machine Learning</conf-loc>, <year>2011</year>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Zeng</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Tong</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Yan</surname></string-name>, <string-name><given-names>W. X.</given-names> <surname>Sun</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<chapter-title>Deep surface normal estimationwith hierarchical RGB-D fusion</chapter-title>,&#x201D; in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>6153</fpage>&#x2013;<lpage>6162</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Zhou</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Enhancing Chinese character representation with lattice-aligned attention</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, 2021. <uri xlink:href="https://doi.org/10.1109/TNNLS.2021.3114378">https://doi.org/10.1109/TNNLS.2021.3114378</uri>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>Y. H.</given-names> <surname>Cheng</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Cai</surname></string-name>, <string-name><given-names>Z. W.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhao</surname></string-name> and <string-name><given-names>K. Q.</given-names> <surname>Huang</surname></string-name></person-group>, &#x201C;<chapter-title>Locality-sensitive deconvolution networks with gated fusion for RGB-D indoor semantic segmentation</chapter-title>,&#x201D; in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>3029</fpage>&#x2013;<lpage>3037</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S. J.</given-names> <surname>Song</surname></string-name>, <string-name><given-names>J. Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>Y. H.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>Z. M.</given-names> <surname>Guo</surname></string-name></person-group>, &#x201C;<article-title>Modality compensation network: Cross-modal adaptation for action recognition</article-title>,&#x201D; <source>IEEE Transactions on Image Processing</source>, vol. <volume>29</volume>, pp. <fpage>3957</fpage>&#x2013;<lpage>3969</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. K.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>W. B.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>F. C.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>T. Y.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Rong</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deep multimodal fusion by channel exchanging</article-title>,&#x201D; <source>Neural Information Processing Systems</source>, vol. <volume>33</volume>, pp. <fpage>4835</fpage>&#x2013;<lpage>4845</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Marola</surname></string-name></person-group>, &#x201C;<article-title>On the detection of axes of symmetry of symmetric and almost symmetric planner images</article-title>,&#x201D; <source>Pattern Analysis and Machine Intelligence</source>, vol. <volume>11</volume>, no. <issue>1</issue>, pp. <fpage>239</fpage>&#x2013;<lpage>245</lpage>, <year>1989</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Wolter</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Woo</surname></string-name> and <string-name><given-names>R.</given-names> <surname>Volz</surname></string-name></person-group>, &#x201C;<article-title>Optimal algorithms for symmetry detection in two and three dimensions</article-title>,&#x201D; <source>The Visual Computer</source>, vol. <volume>1</volume>, pp. <fpage>37</fpage>&#x2013;<lpage>48</lpage>, <year>1985</year>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. J.</given-names> <surname>Atallah</surname></string-name></person-group>, &#x201C;<article-title>On symmetry detection</article-title>,&#x201D; <source>IEEE Transactions on Computers</source>, vol. <volume>34</volume>, no. <issue>7</issue>, pp. <fpage>663</fpage>&#x2013;<lpage>666</lpage>, <year>1984</year>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Alt</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Mehlhorn</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Wagener</surname></string-name> and <string-name><given-names>E.</given-names> <surname>Welzl</surname></string-name></person-group>, &#x201C;<article-title>Congruence, similarity, and symmetries of geometric objects</article-title>,&#x201D; <source>Discrete Comput. Geom</source>, vol. <volume>3</volume>, pp. <fpage>237</fpage>&#x2013;<lpage>256</lpage>, <year>1988</year>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>J.</given-names> <surname>Sherrah</surname></string-name></person-group>, &#x201C;<article-title>3D symmetry detection using the extended Gaussian image</article-title>,&#x201D; <source>Pattern Analysis and Machine Intelligence</source>, vol. <volume>19</volume>, no. <issue>2</issue>, pp. <fpage>164</fpage>&#x2013;<lpage>168</lpage>, <year>1997</year>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Kazhdan</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Chazelle</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Dobkin</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Funkhouser</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Rusinkiewicz</surname></string-name></person-group>, &#x201C;<article-title>A reflective symmetry descriptor for 3D models</article-title>,&#x201D; <source>Algorithmica</source>, vol. <volume>38</volume>, no. <issue>1</issue>, pp. <fpage>201</fpage>&#x2013;<lpage>225</lpage>, <year>2003</year>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>N. J.</given-names> <surname>Mitra</surname></string-name>, <string-name><given-names>L. J.</given-names> <surname>Guibas</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Pauly</surname></string-name></person-group>, &#x201C;<chapter-title>Partial and approximate symmetry detection for 3D geometry</chapter-title>,&#x201D; in <source>Special Interest Group on GRAPHics and Interactive Techniques</source>, <comment>ACM Trans. on Graphics (Proc. SIGGRAPH)</comment>, vol. <volume>25</volume>, no. <issue>3</issue>, pp. <fpage>560</fpage>&#x2013;<lpage>568</lpage>, <year>2006</year>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Johan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Ye</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Lu</surname></string-name></person-group>, &#x201C;<article-title>Efficient 3D reflection symmetry detection: A view-based approach</article-title>,&#x201D; <source>Graphical Models</source>, vol. <volume>83</volume>, pp. <fpage>2</fpage>&#x2013;<lpage>14</lpage>, <year>2016</year>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Ovsjanikov</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>L.</given-names> <surname>Guibas</surname></string-name></person-group>, &#x201C;<article-title>Global intrinsic symmetries of shapes</article-title>,&#x201D; <source>Computer Graphics Forum</source>, vol. <volume>27</volume>, no. <issue>5</issue>, pp. <fpage>1341</fpage>&#x2013;<lpage>1348</lpage>, <year>2008</year>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. L.</given-names> <surname>Teo</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Fermuller</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Aloimonos</surname></string-name></person-group>, &#x201C;<article-title>Detection and segmentation of 2D curved reflection symmetric structures</article-title>,&#x201D; in <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>, pp. <fpage>1644</fpage>&#x2013;<lpage>1652</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Gao</surname></string-name>, <string-name><given-names>L. X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>H. Y.</given-names> <surname>Meng</surname></string-name>, <string-name><given-names>Y. H.</given-names> <surname>Ren</surname></string-name>, <string-name><given-names>Y. K.</given-names> <surname>Lai</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>PRS-net: Planar reflective symmetry detection net for 3D models</article-title>,&#x201D; <source>IEEE Transactions on Visualization and Computer Graphics</source>, vol. <volume>27</volume>, no. <issue>6</issue>, pp. <fpage>3007</fpage>&#x2013;<lpage>3018</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H. B.</given-names> <surname>Huang</surname></string-name>, <string-name><given-names>E.</given-names> <surname>Kalogerakis</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Chaudhuri</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Ceylan</surname></string-name>, <string-name><given-names>V. G.</given-names> <surname>Kim</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Learning local shape descriptors from part correspondences with multiview convolutional networks</article-title>,&#x201D; <source>ACM Transactions on Graphics</source>, vol. <volume>37</volume>, no. <issue>1</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>14</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Ioffe</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Szegedy</surname></string-name></person-group>, &#x201C;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201D; in <conf-name>Int. Conf. on Machine Learning</conf-name>, <conf-loc>International Conference on Machine Learning</conf-loc>, pp. <fpage>448</fpage>&#x2013;<lpage>456</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhao</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hu</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Cai</surname></string-name> and <string-name><given-names>F.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Dynamic modeling cross-modal interactions in two-phase prediction for entity-relation extraction</article-title>,&#x201D; <source>IEEE Transactions on Neural Networks and Learning Systems</source>, <year>2021</year>. pp. <fpage>1</fpage>&#x2013;<lpage>10</lpage>, 2021. <uri xlink:href="https://doi.org/10.1109/TNNLS.2021.3104971">https://doi.org/10.1109/TNNLS.2021.3104971</uri>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>W. G.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>You</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Seo</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Kwak</surname></string-name> and <string-name><given-names>B.</given-names> <surname>Han</surname></string-name></person-group>, &#x201C;<chapter-title>Domain-Specific Batch Normalization for Unsupervised Domain Adaptation</chapter-title>,&#x201D; in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>7354</fpage>&#x2013;<lpage>7362</lpage>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. K.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>F. C.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Lu</surname></string-name> and <string-name><given-names>A. B.</given-names> <surname>Yao</surname></string-name></person-group>, &#x201C;<article-title>Learning deep multimodal feature representation with asymmetric multi-layer fusion</article-title>,&#x201D; in <conf-name>Proceedings of the 28th ACM International Conference on Multimedia</conf-name>, pp. <fpage>3902</fpage>&#x2013;<lpage>3910</lpage>, <year>2020</year>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. S.</given-names> <surname>Almasoud</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Abdalla</surname></string-name>, <string-name><given-names>F. N.</given-names> <surname>Al-Wesabi</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Elsafi</surname></string-name>, <string-name><given-names>M. A.</given-names> <surname>Duhayyim</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Parkinson&#x2019;s detection using RNN-graph-LSTM with optimization based on speech signals</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>72</volume>, no. <issue>1</issue>, pp. <fpage>871</fpage>&#x2013;<lpage>886</lpage>, <year>2022</year>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X. R.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Sun</surname></string-name> and <string-name><given-names>X. Z.</given-names> <surname>He</surname></string-name></person-group>, &#x201C;<article-title>Vehicle re-identification model based on optimized densenet121 with joint loss</article-title>,&#x201D; <source>Computers, Materials &#x0026; Continua</source>, vol. <volume>67</volume>, no. <issue>3</issue>, pp. <fpage>3933</fpage>&#x2013;<lpage>3948</lpage>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Glorot</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Bengio</surname></string-name></person-group>, &#x201C;<article-title>Understanding the difficulty of training deep feedforward neural networks</article-title>,&#x201D; in <conf-name>International Conference on Artificial Intelligence and Statistics</conf-name>, vol. <volume>9</volume>, pp. <fpage>249</fpage>&#x2013;<lpage>256</lpage>, <year>2010</year>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C.</given-names> <surname>Funk</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Beyond planar symmetry: Modeling human perception of reflection and rotation symmetries in the wild</article-title>,&#x201D; in <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>, pp. <fpage>793</fpage>&#x2013;<lpage>803</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. K.</given-names> <surname>Vasudevan</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Dyck</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Ziatdinov</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Jesse</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Laanait</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Deep convolutional neural networks for symmetry detection</article-title>,&#x201D; <source>Microscopy and Microanalysis</source>, vol. <volume>24</volume>, no. <issue>S1</issue>, pp. <fpage>112</fpage>&#x2013;<lpage>113</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. L.</given-names> <surname>Maas</surname></string-name>, <string-name><given-names>A. Y.</given-names> <surname>Hannun</surname></string-name> and <string-name><given-names>A. Y.</given-names> <surname>Ng</surname></string-name></person-group>, &#x201C;<article-title>Rectifier nonlinearities improve neural network acoustic models</article-title>,&#x201D; <source>International Conference on Machine Learning</source>, vol. <volume>30</volume>, no. <issue>1</issue>, pp. <fpage>3</fpage>&#x2013;<lpage>8</lpage>, <year>2013</year>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>A. X.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Funkhouser</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Guibas</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Hanrahan</surname></string-name>, <string-name><given-names>Q. X.</given-names> <surname>Huang</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<article-title>Shapenet: An information-rich 3D model repository</article-title>,&#x201D; arXiv preprint arXiv:1512.03012, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Calli</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Walsman</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Srinivasa</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Abbeel</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<chapter-title>The ycb object and model set: Towards common benchmarks for manipulation research</chapter-title>,&#x201D; in <source>Indian Council of Agricultural Research</source>, <conf-loc>in International Conference on Advanced Robotics</conf-loc>, pp. <fpage>510</fpage>&#x2013;<lpage>517</lpage>, <year>2015</year>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Dai</surname></string-name>, <string-name><given-names>A. X.</given-names> <surname>Chang</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Savva</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Halber</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Funkhouser</surname></string-name> <etal>et al.,</etal></person-group> &#x201C;<chapter-title>Scannet: Richly-annotated 3D reconstructions of indoor scenes</chapter-title>,&#x201D; in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>5828</fpage>&#x2013;<lpage>5839</lpage>, <year>2017</year>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Ecins</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Ferm&#x00FC;ller</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Aloimonos</surname></string-name></person-group>, &#x201C;<chapter-title>Seeing behind the scene: Using symmetry to reason about objects in cluttered environments</chapter-title>,&#x201D; in <source>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>, pp. <fpage>7193</fpage>&#x2013;<lpage>7200</lpage>, <year>2018</year>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>Y. Q.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Feng</surname></string-name>, <string-name><given-names>Y. R.</given-names> <surname>Shen</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Tian</surname></string-name></person-group>, &#x201C;<chapter-title>Foldingnet: Point cloud autoencoder via deep grid deformation</chapter-title>,&#x201D; in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, pp. <fpage>206</fpage>&#x2013;<lpage>215</lpage>, <year>2018</year>.</mixed-citation></ref>
</ref-list>
</back>
</article>