<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">48104</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.048104</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment</article-title>
<alt-title alt-title-type="left-running-head">Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment</alt-title>
<alt-title alt-title-type="right-running-head">Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Al-Buraihy</surname><given-names>Emran</given-names></name></contrib>
<contrib id="author-2" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Wang</surname><given-names>Dan</given-names></name><email>wangdan@bjut.edu.cn</email></contrib>
<aff>
<institution>Faculty of Information Technology, Beijing University of Technology</institution>, <addr-line>Beijing, 100124</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Dan Wang. Email: <email>wangdan@bjut.edu.cn</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2024</year></pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>20</day>
<month>6</month>
<year>2024</year></pub-date>
<volume>79</volume>
<issue>3</issue>
<fpage>3913</fpage>
<lpage>3938</lpage>
<history>
<date date-type="received">
<day>28</day>
<month>11</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>28</day>
<month>2</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2024 Al-Buraihy and Dan</copyright-statement>
<copyright-year>2024</copyright-year>
<copyright-holder>Al-Buraihy and Dan</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_48104.pdf"></self-uri>
<abstract>
<p>Cross-lingual image description, the task of generating image captions in a target language from images and descriptions in a source language, is addressed in this study through a novel approach that combines neural network models and semantic matching techniques. Experiments conducted on the Flickr8k and AraImg2k benchmark datasets, featuring images and descriptions in English and Arabic, showcase remarkable performance improvements over state-of-the-art methods. Our model, equipped with the Image &#x0026; Cross-Language Semantic Matching module and the Target Language Domain Evaluation module, significantly enhances the semantic relevance of generated image descriptions. For English-to-Arabic and Arabic-to-English cross-language image descriptions, our approach achieves a CIDEr score for English and Arabic of 87.9% and 81.7%, respectively, emphasizing the substantial contributions of our methodology. Comparative analyses with previous works further affirm the superior performance of our approach, and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language. In summary, this study advances the field of cross-lingual image description, offering an effective solution for generating image captions across languages, with the potential to impact multilingual communication and accessibility. Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Cross-language image description</kwd>
<kwd>multimodal deep learning</kwd>
<kwd>semantic matching</kwd>
<kwd>reward mechanisms</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>In the digital age, we are amidst an unprecedented era of visual information exchange, fueled by the proliferation of multimedia content on the internet [<xref ref-type="bibr" rid="ref-1">1</xref>]. Among the vast array of media at our disposal, images stand out as a universal language that transcends linguistic barriers, serving as a vital medium for communication and information dissemination [<xref ref-type="bibr" rid="ref-2">2</xref>]. In today&#x2019;s digital landscape, images have become the lingua franca, effortlessly conveying ideas, experiences, and emotions across the global online community [<xref ref-type="bibr" rid="ref-3">3</xref>].</p>
<p>The field of image captioning, which involves automatically generating descriptive captions for images, has emerged as a critical research area with a wide range of applications. It extends its reach from assisting the visually impaired to enriching content retrieval and enhancing user engagement on multimedia platforms [<xref ref-type="bibr" rid="ref-4">4</xref>]. Remarkable progress in this field has been driven by cutting-edge technologies such as object detection, relationship reasoning, and language sequence generation [<xref ref-type="bibr" rid="ref-5">5</xref>].</p>
<p>Yet, the mosaic of languages spoken worldwide presents a formidable challenge for image captioning systems [<xref ref-type="bibr" rid="ref-6">6</xref>]. The task of generating precise, coherent, and culturally relevant image descriptions in multiple languages necessitates a nuanced understanding of both the visual content and the linguistic subtleties inherent to each target language [<xref ref-type="bibr" rid="ref-7">7</xref>]. Conventional image captioning models often falter in capturing these intricacies, leading to translations that lack fluency, coherence, and context, ultimately failing to resonate with speakers of the target language [<xref ref-type="bibr" rid="ref-8">8</xref>].</p>
<p>These multilingual challenges underscore the pressing need for cross-lingual image captioning solutions that bridge linguistic and cultural divides [<xref ref-type="bibr" rid="ref-9">9</xref>]. This need has grown in significance as individuals and communities with diverse linguistic backgrounds increasingly seek access to and comprehension of content from different cultural spheres and regions [<xref ref-type="bibr" rid="ref-10">10</xref>]. Cross-lingual image description tasks, such as the transfer of descriptions from English to Arabic, have become pivotal in this evolving research landscape [<xref ref-type="bibr" rid="ref-11">11</xref>].</p>
<p>The challenge involves creating descriptive image captions in a language different from the original image label, posing a complex issue when dealing with substantial linguistic and cultural differences [<xref ref-type="bibr" rid="ref-12">12</xref>]. Conventional image captioning methods, relying on single-language models, fall short in delivering accurate and culturally resonant descriptions across multiple languages [<xref ref-type="bibr" rid="ref-13">13</xref>].</p>
<p>The primary research problem we tackle in this study revolves around enabling accurate and culturally apt cross-lingual image captioning between Arabic and English. Arabic, a language steeped in rich cultural and linguistic heritage, poses unique challenges due to its complex script and diverse dialects [<xref ref-type="bibr" rid="ref-14">14</xref>]. In contrast, English stands as a widely spoken global language [<xref ref-type="bibr" rid="ref-15">15</xref>]. The challenge lies not only in precisely translating captions between these languages but also in ensuring that the resulting descriptions are semantically coherent, culturally pertinent, and contextually accurate [<xref ref-type="bibr" rid="ref-16">16</xref>].</p>
<p>As shown in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, there are language style differences between the Arabic and English descriptions. The source English description of the image is &#x201C;A person in a blue jacket follows two donkeys along a mountain trail&#x201D; (a short descriptive sentence), while the target domain Arabic description follows a more descriptive sentence with a translation style of <inline-graphic xlink:href="CMC_48104-inline-1.tif"/> (A man wearing a blue jacket and jeans, with a backpack on his back). Furthermore, the emphasis on semantics is also not the same. Although both sentences mention &#x201C; a blue jacket,&#x201D; the real Arabic description centers around &#x201C;the man,&#x201D; while the English description centers around &#x201C;a photo.&#x201D;</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The task of cross-lingual image captioning and our solution</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-1.tif"/>
</fig>
<p>The field of cross-lingual image captioning faces notable limitations, especially in dataset diversity. Many existing studies utilize English datasets with translations that often lack cross-cultural relevance [<xref ref-type="bibr" rid="ref-17">17</xref>]. Additionally, the reliance on machine translation in prior models raises issues of accuracy and cultural sensitivity [<xref ref-type="bibr" rid="ref-18">18</xref>]. Addressing these gaps, our work introduces the AraImg2k Dataset, a comprehensive collection of 2000 images embodying Arab culture, each paired with five carefully crafted captions in Modern Standard Arabic (MSA). This curated dataset aims to authentically represent the rich diversity and nuances of Arab culture.</p>
<p>Previous methods in cross-lingual image captioning struggled with accurately capturing the semantic relationship between images and their captions [<xref ref-type="bibr" rid="ref-19">19</xref>]. To address this, our study introduces a multimodal semantic matching module. This module improves the accuracy of semantic consistency between images and captions across different languages, utilizing multimodal visual-semantic embeddings. This ensures that the generated captions more accurately reflect the original images, enhancing the quality of cross-lingual image descriptions.</p>
<p>Previous methods in cross-lingual captioning often overlooked linguistic subtleties and cultural context [<xref ref-type="bibr" rid="ref-20">20</xref>]. Our research counters this by introducing a language evaluation module. This module adapts to the target language&#x2019;s distribution and style, enabling the creation of captions that are more aligned with linguistic nuances and cultural norms, thereby producing more natural and culturally attuned image descriptions.</p>
<p>Earlier studies in cross-lingual image captioning often lacked comprehensive evaluation metrics, hindering performance assessment. Our research addresses this by employing a range of evaluation metrics, including BLEU [<xref ref-type="bibr" rid="ref-21">21</xref>], ROUGE [<xref ref-type="bibr" rid="ref-22">22</xref>], METEOR [<xref ref-type="bibr" rid="ref-23">23</xref>], CIDEr [<xref ref-type="bibr" rid="ref-24">24</xref>], and SPICE [<xref ref-type="bibr" rid="ref-25">25</xref>]. This allows for a rigorous comparison with previous works and a more detailed evaluation of our approach&#x2019;s effectiveness and superiority in the field.</p>
<p>In light of these advancements and contributions, our research seeks to bridge the gap between languages, cultures, and communities by enhancing the quality and cultural relevance of cross-lingual image descriptions. Through meticulous dataset creation, improved translation techniques, advanced semantic matching, and comprehensive evaluation, we aim to significantly advance the field of cross-lingual image captioning, ultimately fostering more effective cross-cultural understanding and communication.</p>
<p>Therefore, this study presents a comprehensive approach to cross-lingual image captioning, leveraging semantic matching and language evaluation techniques to address the aforementioned challenges. The key contributions of this study can be summarized as follows:
<list list-type="bullet">
<list-item>
<p>Cross-Lingual Image Captioning (Arabic and English): We address the challenge of accurate and culturally relevant cross-lingual captioning between Arabic and English. Our method ensures precise translations, semantic coherence, and cultural relevance, bridging the linguistic and cultural divide.</p></list-item>
<list-item>
<p>AraImg2k Dataset: To overcome the limitations in existing datasets, we introduce AraImg2k, a dataset of 2000 images representing Arab culture, each with five detailed captions in Modern Standard Arabic, reflecting the cultural diversity of the Arab world.</p></list-item>
<list-item>
<p>Multimodal Semantic Matching Module: Our novel module captures the semantic relationship between images and captions in cross-lingual contexts using multimodal visual-semantic embeddings, ensuring captions accurately reflect the image content.</p></list-item>
<list-item>
<p>Language Evaluation Module: This module focuses on understanding the target language&#x2019;s nuances and cultural context, aiding in producing captions that are linguistically and culturally aligned, enhancing the naturalness and relevance of our cross-lingual descriptions.</p></list-item>
<list-item>
<p>Comprehensive Evaluation Metrics: Setting our research apart from previous studies, we employ diverse evaluation metrics like BLEU, ROUGE, METEOR, CIDEr, and SPICE, allowing for a detailed comparison with prior works and demonstrating the effectiveness of our approach.</p></list-item>
</list></p>
<p><xref ref-type="sec" rid="s2">Section 2</xref> is dedicated to an extensive examination of prior research within the same domain. <xref ref-type="sec" rid="s3">Section 3</xref> provides a thorough explication of the essential components of the proposed framework is presented. <xref ref-type="sec" rid="s4">Section 4</xref> presents empirical findings along with a comparative analysis vis-&#x00E0;-vis the preceding study. Finally, in <xref ref-type="sec" rid="s5">Section 5</xref>, a conclusive summary is offered, accompanied by suggestions for prospective research endeavors.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Literature Review</title>
<p>This study delves into the dynamic and diverse landscape of image captioning and the emerging field of cross-lingual image caption generation. Image captioning, situated at the intersection of computer vision and natural language processing, has witnessed significant advancements in recent years, bridging the gap between visual content and human language [<xref ref-type="bibr" rid="ref-26">26</xref>]. In this section, we explore the foundational concepts, methodologies, and key research papers in both image captioning and the evolving domain of cross-lingual image captioning. By examining pioneering work in Arabic and English, we pave the way for our own cross-lingual image captioning approach. This literature review serves as a guiding beacon, illuminating the path of prior research and informing the innovative contributions in subsequent sections.</p>
<sec id="s2_1">
<label>2.1</label>
<title>Image Captioning</title>
<p>Image captioning, a multidisciplinary research area at the confluence of computer vision and natural language processing, centers around the task of automatically generating descriptive text for images [<xref ref-type="bibr" rid="ref-27">27</xref>]. Its significance extends beyond enhancing human-computer interaction and content retrieval; it also plays a pivotal role in enabling visually impaired individuals to comprehend visual content [<xref ref-type="bibr" rid="ref-4">4</xref>]. The evolution of image captioning has been propelled by remarkable progress driven by deep learning techniques and the availability of extensive image-text datasets [<xref ref-type="bibr" rid="ref-28">28</xref>]. In this subsection, we delve into the fundamentals of image captioning and provide an overview of key research papers in both Arabic and English domains.</p>
<sec id="s2_1_1">
<label>2.1.1</label>
<title>Arabic Image Captioning</title>
<p>In the domain of Arabic image captioning, researchers in [<xref ref-type="bibr" rid="ref-29">29</xref>] proposed an innovative approach tailored to generating image captions specifically for clothing images. Leveraging deep learning techniques, their model proficiently generates Arabic captions describing clothing items, enhancing accessibility to fashion-related visual content. Meanwhile, researchers in [<xref ref-type="bibr" rid="ref-30">30</xref>] investigated Arabic image captioning, focusing on the impact of text pre-processing on attention weights and BLEU-N scores. Their work sheds light on optimizing the caption generation process in Arabic, taking into account the nuances of text pre-processing. Authors in [<xref ref-type="bibr" rid="ref-31">31</xref>] ventured into automatic Arabic image captioning using a combination of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) language models, alongside Convolutional Neural Networks (CNNs). Their model demonstrated the feasibility of generating Arabic image captions, marking a significant milestone in the field of Arabic image description.</p>
</sec>
<sec id="s2_1_2">
<label>2.1.2</label>
<title>English Image Captioning</title>
<p>In the realm of English image captioning, researchers in [<xref ref-type="bibr" rid="ref-32">32</xref>] proposed a novel approach for automatic caption generation for news images. They employ a multimodal approach that integrates both image content and associated news articles to create coherent and informative image captions. Researchers in [<xref ref-type="bibr" rid="ref-33">33</xref>] introduced a compact image captioning model with an attention mechanism, focusing on the efficiency of caption generation. Their research contributes to streamlining image captioning models for practical applications. In addition, researchers in [<xref ref-type="bibr" rid="ref-34">34</xref>] provided valuable insights from lessons learned during the 2015 MSCOCO Image Captioning Challenge, highlighting key takeaways and challenges in image captioning.</p>
</sec>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Cross-Lingual Image Captioning</title>
<p>Cross-lingual image caption generation is an emerging research area that addresses the challenge of automatically generating descriptive captions for images in different languages [<xref ref-type="bibr" rid="ref-35">35</xref>]. It plays a pivotal role in facilitating multilingual communication and cross-cultural understanding, enabling individuals from diverse linguistic backgrounds to access and comprehend visual content [<xref ref-type="bibr" rid="ref-36">36</xref>]. In this subsection, we delve into cross-lingual image caption generation and provide summaries of key research papers in Arabic-English, Chinese-English, German-English, and Japanese-English domains.</p>
<sec id="s2_2_1">
<label>2.2.1</label>
<title>Arabic-English Cross-Lingual Image Captioning</title>
<p>Researchers in [<xref ref-type="bibr" rid="ref-37">37</xref>] introduced a novel approach, &#x201C;Wikily&#x201D; Supervised Neural Translation, tailored to Arabic-English cross-lingual tasks, including image captioning. Their model leverages Wikipedia as a resource for cross-lingual supervision, showcasing its efficacy in generating accurate image captions across the Arabic-English language barrier. In addition, researchers in [<xref ref-type="bibr" rid="ref-38">38</xref>] contributed to Arabic-English cross-lingual image captioning by providing valuable resources and end-to-end neural network models, enriching the accessibility and understanding of visual content for Arabic speakers.</p>
</sec>
<sec id="s2_2_2">
<label>2.2.2</label>
<title>Chinese-English Cross-Lingual Image Captioning</title>
<p>Researchers in [<xref ref-type="bibr" rid="ref-39">39</xref>] introduced COCO-CN, a resource for cross-lingual image tagging, captioning, and retrieval tasks involving Chinese and English. Their work underscores the significance of bridging the linguistic gap between these two languages in the context of visual content. Additionally, researchers in [<xref ref-type="bibr" rid="ref-40">40</xref>] explored fluency-guided cross-lingual image captioning, particularly focusing on Chinese-English pairs. Their approach highlights the importance of fluency in generating high-quality image captions that resonate with speakers of both languages.</p>
</sec>
<sec id="s2_2_3">
<label>2.2.3</label>
<title>German-English Cross-Lingual Image Captioning</title>
<p>Researchers in [<xref ref-type="bibr" rid="ref-41">41</xref>] presented multimodal pivots for image caption translation, addressing the German-English cross-lingual challenge. Their work explores strategies for effectively translating image captions between these languages using multimodal approaches. Furthermore, researchers in [<xref ref-type="bibr" rid="ref-42">42</xref>] contributed to the field with the creation of Multi30k, a multilingual English-German image description dataset. Their work serves as a valuable resource for cross-lingual image captioning research, fostering improved understanding and communication between German and English speakers.</p>
</sec>
<sec id="s2_2_4">
<label>2.2.4</label>
<title>Japanese-English Cross-Lingual Image Captioning</title>
<p>Researchers in [<xref ref-type="bibr" rid="ref-43">43</xref>] presented the STAIR captions dataset, a substantial resource for Japanese image captioning. Their work advances the availability of image description data for Japanese speakers, contributing to the field&#x2019;s progress in Japanese-English cross-lingual image captioning. Moreover, researchers in [<xref ref-type="bibr" rid="ref-44">44</xref>] delved into cross-lingual image caption generation with a focus on Japanese-English pairs. Their work explores techniques to generate image captions that transcend language barriers, enhancing cross-cultural communication.</p>
<p>In generating <xref ref-type="table" rid="table-1">Table 1</xref>, we employed a meticulous and systematic literature review process to ensure the precision and comprehensiveness of the information presented. This process entailed the following steps:</p>
<p><list list-type="alpha-lower">
<list-item><p>Keyword-Based Search: We initiated our literature review with a keyword-based search in major academic databases. The keywords were carefully chosen to encompass the core themes of our study, namely &#x2018;cross-lingual image captioning&#x2019;, &#x2018;multimodal learning&#x2019;, and &#x2018;semantic matching&#x2019;.</p></list-item>
<list-item><p>Selection Criteria: Upon retrieving a preliminary set of papers, we applied specific selection criteria to filter out the most relevant studies. These criteria included the recency of publication, relevance to our study&#x2019;s focus on cross-lingual and multimodal aspects, and the academic credibility of the sources.</p></list-item>
<list-item><p>Data Extraction and Synthesis: For each selected paper, we extracted key information such as methodologies used, datasets employed, and evaluation metrics applied. This data was then critically analyzed and synthesized to present a comprehensive view of the current research landscape.</p></list-item>
<list-item><p>Tabulation and Cross-Verification: The synthesized data was tabulated in <xref ref-type="table" rid="table-1">Table 1</xref>, ensuring that each entry accurately reflected the corresponding study&#x2019;s contributions and findings. We cross-verified each entry for accuracy and completeness.</p>
</list-item>
<list-item><p>Continuous Updating: Recognizing the dynamic nature of the field, we maintained an ongoing process of updating the table to include the latest significant contributions up until the finalization of our manuscript.</p></list-item>
</list></p>
<p>Through this rigorous process, <xref ref-type="table" rid="table-1">Table 1</xref> was crafted to provide a detailed and accurate summary of existing literature in the field of cross-lingual image captioning, serving as a foundational reference for our study and future research in this domain.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Summary of the literature</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th>Research area</th>
<th>Ref.</th>
<th>Dataset</th>
<th>Data source</th>
<th>Language<break/>(s)</th>
<th>App/Tech</th>
<th>Evaluation metrics</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="6"><bold>Image captioning</bold></td>
<td>[<xref ref-type="bibr" rid="ref-29">29</xref>]</td>
<td>Arabic<break/>Fashion<break/>Data</td>
<td>DeepFashion dataset [<xref ref-type="bibr" rid="ref-45">45</xref>] InFashAIv1 [<xref ref-type="bibr" rid="ref-46">46</xref>]</td>
<td>Arabic</td>
<td>Image captioning, attention mechanism</td>
<td>BLEU</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>&#x2013;</td>
<td>Arabic-Flickr8 [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>Arabic</td>
<td>Attention mechanism, beam search</td>
<td>BLEU, THUMB [<xref ref-type="bibr" rid="ref-47">47</xref>]</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>Arabic corpus</td>
<td>MS-COCO dataset [<xref ref-type="bibr" rid="ref-48">48</xref>], Flickr8k [<xref ref-type="bibr" rid="ref-49">49</xref>]</td>
<td>Arabic</td>
<td>Crowd-Flower crowdsourcing, commercial cloud server FloydHub</td>
<td>BLEU</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>&#x2013;</td>
<td>News images</td>
<td>English</td>
<td>Unsupervised fashion, news media and journalism</td>
<td>BLEU, ROUGE, METEOR, CIDEr, SPICE</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>&#x2013;</td>
<td>MSCOCO, InstaPIC-1.1M [<xref ref-type="bibr" rid="ref-50">50</xref>]</td>
<td>English</td>
<td>Attention mechanism, streamlining image captioning</td>
<td>BLEU, ROUGE, METEOR, CIDEr, SPICE</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>&#x2013;</td>
<td>MS-COCO</td>
<td>English</td>
<td>Qualitatively and quantitatively, probability of the correct description</td>
<td>BLEU, ROUGE, METEOR</td>
</tr>
<tr>
<td rowspan="8"><bold>Cross-lingual Image captioning</bold></td>
<td>[<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>&#x2013;</td>
<td>Wikipedia</td>
<td>Arabic-English</td>
<td>Supervised neural, image captioning</td>
<td>BLEU</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>Arabic-Flickr8</td>
<td>Flickr8k</td>
<td>Arabic-English</td>
<td>End-to-end</td>
<td>BLEU</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>COCO-CN</td>
<td>MS-COCO</td>
<td>Chinese-English</td>
<td>Image captioning, tagging, retrieval, recommendation-assisted annotation system</td>
<td>Precision, Recall, F-measure, BLEU, METEOR, ROUGE-L, CIDEr</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>Flickr30k-CN</td>
<td>Flickr30k [<xref ref-type="bibr" rid="ref-51">51</xref>]</td>
<td>Chinese-English</td>
<td>Image captioning, fluency-guided learning framework</td>
<td>BLEU</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-41">41</xref>]</td>
<td>Bilingual caption</td>
<td>MS-COCO</td>
<td>German-English</td>
<td>Image retrieval</td>
<td>BLEU, METEOR, TER</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-42">42</xref>]</td>
<td>Multi30k</td>
<td>Flickr30k</td>
<td>German-English</td>
<td>Crowdsourced platform</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-43">43</xref>]</td>
<td>STAIR captions</td>
<td>MS-COCO</td>
<td>Japanese-English</td>
<td>A web system for caption annotation, quantitatively and qualitatively</td>
<td>BLEU, ROUGE, CIDEr</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-44">44</xref>]</td>
<td>YJ captions</td>
<td>MS-COCO</td>
<td>Japanese-English</td>
<td>Translation models, multilingual adaptation</td>
<td>BLEU, ROUGE, METEOR, CIDEr, Cross-lingual metrics</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Methodology</title>
<p>As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the cross-lingual image description model proposed in this study consists of three components.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Cross-lingual image captioning model</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-2.tif"/>
</fig>
<p>Naive Image Encoder-Sentence Decoder (Image Description Generation Module): This module is responsible for generating descriptive sentences. It encodes the image and decodes it into sentences.</p>
<p>Image &#x0026; Source Language Domain Semantic Matching Module: This module is responsible for providing semantic matching rewards and optimization. It takes into account the semantic information from the source domain image and axis language and maps them into a common embedding space for semantic matching calculations.</p>
<p>Target Language Domain Evaluation Module: This module is designed to provide language evaluation rewards. It incorporates knowledge about the data distribution in the target language domain for language evaluation constraints.</p>
<p>The first module is responsible for sentence generation, while the latter two modules guide the model to learn semantic matching constraints and language knowledge optimization. This helps the model generate more fluent and semantically rich descriptions.</p>
<sec id="s3_1">
<label>3.1</label>
<title>Data Collection and Preparation</title>
<p>The first step in our methodology was to collect 2000 images that represent Arab culture from authentic websites. We selected images from a variety of sources, including museums, cultural institutions, and travel websites, to ensure that we have a diverse and representative set of images. We then manually generate five captions for each image in Modern Standard Arabic (MSA), ensuring of having a variety of descriptions that capture different aspects of the image. This dataset serves as a valuable resource for cross-lingual image captioning research and richly reflects the diversity of Arab culture.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Image Encoder-Sentence Decoder Module</title>
<p>A naive image encoder-sentence decoder framework is used to generate descriptive sentences. It employs a pre-trained neural network model, ResNet-101 [<xref ref-type="bibr" rid="ref-52">52</xref>], and a fully-connected layer (referred to as (<inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mi>F</mml:mi><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), to extract features (<inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> from the image (I). A single-layer Long Short-Term Memory (LSTM) network, denoted as <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, is used to decode (<inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>) and generate the current time-step word. The source domain description language (<inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) of the image I is translated into a target domain pseudo-sentence label (<inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) using Google Translation to initialize this module. During model initialization training, the pre-trained ResNet-101 model is not involved in model optimization, while the fully-connected layer (<inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:mi>F</mml:mi><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and (<inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) participate in model optimization. The optimization objective is set to minimize the negative log probability of correct words in the sentence.</p>
<p>In addition to the methodologies described previously, it is pertinent to elaborate on the translation module employed in our study, particularly for the initial translation between English and Arabic languages. We utilized Google Translate for this purpose, leveraging its capabilities to generate pseudo-sentence labels in the target domain from the source language descriptions. This step was crucial for initializing the model with a basic understanding of cross-lingual semantic structures. It is important to note that these machine-generated translations were primarily used as a starting point. The subsequent modules, namely the Image &#x0026; Source Language Domain Semantic Matching Module and the Target Language Domain Evaluation Module, were designed to refine these translations, ensuring their semantic accuracy and cultural relevance.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mrow><mml:mtext>L</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>G</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mtext>p</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>G</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mrow><mml:mtext>i</mml:mtext></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p><xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref> is derived based on standard practices in neural network training for language modeling, particularly in image captioning tasks. It calculates the negative log probability of the correct word sequence in a generated caption, given an image and the preceding words. This approach is consistent with methodologies adopted in neural network-based natural language processing, as detailed in foundational works such as [<xref ref-type="bibr" rid="ref-34">34</xref>].</p>
<p>In the equation, (N) represents the length of the sentence <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. The word (<inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>T</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>) is set as the start symbol (<inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mo fence="false" stretchy="false">&#x27E8;</mml:mo><mml:mi>b</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mo fence="false" stretchy="false">&#x27E9;</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> represents the learning parameters of this module, including (<inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>F</mml:mi><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), (<inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>).</p>
<p>In selecting LSTM over Transformer-based models for our cross-lingual image captioning research, we considered several pivotal factors unique to our study&#x2019;s context. LSTM networks, recognized for their efficiency in sequential data processing and less demanding computational requirements, aligned well with our resource constraints and the exploratory nature of our work. This was particularly pertinent given the complexity and specific linguistic characteristics of our primary dataset, AraImg2k, which includes the nuanced morphological features of Arabic. LSTMs&#x2019; proven track record in language modeling provided a solid and interpretable foundation for initial experiments. While we acknowledge the advanced capabilities of Transformers in handling long-range dependencies and their parallel processing strengths, our initial focus was to establish a robust baseline model that effectively balances computational efficiency with the linguistic intricacies of our dataset. Moving forward, we plan to explore the integration of Transformer models to further advance our approach, leveraging their benefits in subsequent phases of our research.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Image &#x0026; Source Language Domain Semantic Matching Module</title>
<p>After the initialization described in <xref ref-type="sec" rid="s3_2">Section 3.2</xref>, the descriptions generated by the model exhibit certain characteristics, such as simple imitation of pseudo-labels, repetitive combinations of high-frequency vocabulary, or a lack of relevance to the content of the image. Manually annotated source language descriptions typically possess rich semantics and provide concrete descriptions of the image content. The source language and the image should contain consistent semantic information.</p>
<p>To address this issue and enhance the semantic relevance of the generated descriptions, the study introduces a multi-modal semantic matching module. This module leverages both the semantic information from the image and the source language to impose constraints on semantic similarity. The aim is to ensure that the generated descriptions are semantically aligned with both the image and the source language, resulting in more meaningful and contextually relevant descriptions.</p>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Cross-Modal Semantic Matching</title>
<p>For heterogeneous images and sentences, the first step is to map the images and sentences into a common embedding space and measure semantic relatedness. As shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>, the image&#x2019;s semantic embedding network, denoted as, (<inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) consists of a CNN encoder (using the pre-trained ResNet-101 model) and a fully connected layer (referred to as <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>F</mml:mi><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). The text&#x2019;s semantic embedding network, denoted as (<inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, is composed of a single-layer LSTM (denoted as *** <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). The final hidden vector of (<inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) at the last time step defines the semantic vector in the common embedding space for the input sentence.</p>
<p>By inputting image-sentence pairs (<inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:mrow><mml:mo>(</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), you obtain the image&#x2019;s feature embedding, (<inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>), in the common semantic space and the sentence&#x2019;s embedding feature, <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, in the common semantic space. For matching pairs (<inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:mrow><mml:mo>(</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), negative examples are found within the same batch. Specifically, sentences (<inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula>) that do not match with (I) and images (<inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msup><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) that do not match with (<inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) within the same batch are identified. The pretraining process involves minimizing a bidirectional ranking loss in the common semantic space.</p>
<p>The goal of this process is to align the semantics of images and sentences in a shared embedding space and measure their semantic relatedness.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="left left" rowspacing="4pt" columnspacing="1em"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo></mml:mtd><mml:mtd><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>I</mml:mi></mml:mrow></mml:munder><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>I</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>I</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>&#x2212;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p><xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref> describes a bidirectional ranking loss, a common approach in cross-modal semantic matching. It is designed to fine-tune the semantic alignment between images and their corresponding textual descriptions, following a methodology widely used in multimodal learning tasks. For further theoretical background and application of similar loss functions, readers are referred to [<xref ref-type="bibr" rid="ref-53">53</xref>].</p>
<p>In the equation, (<inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:mi mathvariant="normal">&#x0394;</mml:mi></mml:math></inline-formula>) represents a boundary hyperparameter, and (<inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03BC;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) represents the learning parameters for the (<inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>F</mml:mi><mml:msub><mml:mi>C</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and (<inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) layers in this module.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Cross-Lingual Semantic Matching</title>
<p>In addition, this study also has axis language sentence-pseudo-label sentence pairs (<inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), which can provide data support for measuring the semantic similarity between target language sentences and axis language sentences. In this section, cross-lingual semantic matching is introduced to enhance the semantic relevance of sentences, using a similar semantic embedding network mechanism as in <xref ref-type="sec" rid="s3_3_1">Section 3.3.1</xref> to align the embedding vectors of the target language and axis language. Both the encoder for the target language and the encoder for the axis language use a single-layer BG-RU (Bidirectional Gated Recurrent Unit) structure, with the hidden vector at the end of BGRU used as the sentence feature vector. <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the axis language feature mapper (<inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>P</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), and <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the target language feature mapper (<inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). Similarly, pretraining is performed by minimizing bidirectional ranking loss to align the common semantic space.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03C1;</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mspace width="1em" /><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder><mml:mo movablelimits="true" form="prefix">max</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">&#x0394;</mml:mi><mml:mo>&#x2212;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mspace width="1em" /><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>p</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In the equation, for matching pairs (<inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>T</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> is the negative example from the pseudo-label sentence set in the same batch that does not match (<inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>), and (<inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:msup><mml:mi>P</mml:mi><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula>) is the negative example from the axis language sentence set in the same batch that does not match (<inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>). (<inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) represents the learning parameters for the (<inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>P</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and (<inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) layers in this module.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Target Language Domain Evaluation Module</title>
<p>Due to the currently generated descriptions having little association with the target corpus, the generated description sentences often exhibit significant differences in language style from the real target sentences. To optimize the quality of the description language, this section pertains to a module on the target language dataset that can provide language evaluation rewards, focusing on correctly classifying the input words. This module employs LSTM (referred to as <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) and inputs words sequentially into (<inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), then utilizes (<inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) to predict the probability of the current input word. Using sentences of length (N) from the target corpus as input, represented as <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, the pretraining objective is to minimize the negative log probability of correct words in the sentence, as shown in the equation.
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mi>L</mml:mi><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Here, (<inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) represents the learning parameters for the (<inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) module in this section.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Model Optimization Based on Semantic Matching and Language Rewards</title>
<p>After the self-supervised pretraining of the three modules mentioned above, the optimization learning of the Image Encoder-Sentence Decoder module in <xref ref-type="sec" rid="s3_2">Section 3.2</xref> is jointly implemented with the three modules. Specifically, semantic matching rewards from <xref ref-type="sec" rid="s3_3">Section 3.3</xref> and language evaluation rewards from <xref ref-type="sec" rid="s3_4">Section 3.4</xref> are utilized to optimize the module in <xref ref-type="sec" rid="s3_2">Section 3.2</xref>.</p>
<p>Image-Sentence Matching Reward: The image I is mapped through the visual semantic embedding network (<inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), and the sentence (<inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>) is mapped through the text semantic embedding network (<inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) to the common embedding space. The cross-modal semantic matching reward can be defined as:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mtext>r|v</mml:mtext></mml:mrow><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>I</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>I</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Cross-Language Sentence Matching Reward: Similarly, the source domain sentence (<inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>) is mapped through the axis language feature mapper (<inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>), and the sentence (<inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>) is mapped through the target language feature mapper (<inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). The cross-language semantic matching reward can be defined as:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mtext>r|v</mml:mtext></mml:mrow><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow></mml:msubsup><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mtext>&#x00A0;</mml:mtext><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><xref ref-type="disp-formula" rid="eqn-6">Eq. (6)</xref> defines the cross-language semantic matching reward using cosine similarity, a standard measure in natural language processing for assessing semantic closeness between high-dimensional vectors. This approach aligns with established practices in cross-lingual semantic analysis, where maintaining semantic integrity across languages is crucial. For a foundational reference on the application of cosine similarity in cross-lingual contexts, readers can consult the work of [<xref ref-type="bibr" rid="ref-54">54</xref>].</p>
<p>Target Domain Sentence Language Evaluation Reward: Each word of the sentence (<inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>) is iteratively input into the (<inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) module trained on the target language domain in <xref ref-type="sec" rid="s3_4">Section 3.4</xref>. The language evaluation process is as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mrow><mml:mtext>q</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mtext>LSTM</mml:mtext></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>h</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>&#x03C9;</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:mi>N</mml:mi><mml:mo>}</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Here, <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x22EF;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>, where (<inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>) is the starting symbol &#x201C;bos&#x201D;, (N) is the length of the sentence (<inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>), (<inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msubsup><mml:mi>h</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>) is the hidden vector at the time step (<inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:mi>i</mml:mi></mml:math></inline-formula>), <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) is the probability vector over the vocabulary with dimensions equal to the vocabulary size, and (<inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>) represents the predicted probability of the word (<inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>) at time step (<inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>i</mml:mi></mml:math></inline-formula>).
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>q</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The total reward for the entire cross-lingual description model is defined as:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B2;</mml:mi><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>r</mml:mtext></mml:mrow><mml:mo>&#x2223;</mml:mo><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msubsup><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>r</mml:mtext></mml:mrow><mml:mo>&#x2223;</mml:mo><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula></p>
<p>In the equation, (<inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>), (<inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>), and (<inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>) are hyperparameters with values in the range [0,1]. (<inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>), (<inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>), and (<inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>) are empirical parameters, and the optimal values are determined in <xref ref-type="sec" rid="s4_2">Section 4.2</xref>.</p>
<p>To reduce the expected gradient variance during model training, we follow a self-critical sequence training approach. The current model uses a multinomial distribution sampling method to obtain sentences (<inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>), and additionally, defaults to using a maximum probability greedy sampling method to obtain sentences (<inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:mi>S</mml:mi></mml:math></inline-formula>), with (<inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>(S)) as the baseline reward. The overall reward for a sentence <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>) can be expressed as (<inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (<inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>)) &#x2013; (<inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> (<inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:mi>S</mml:mi></mml:math></inline-formula>)), where sentences with higher rewards than the baseline are encouraged, and sentences with lower rewards than the baseline are discouraged. Through iterative reinforcement training, the model generates sentences with better semantic matching rewards and language evaluation rewards. Therefore, the final objective loss of the cross-lingual description model can be defined as:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mtext>total</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd><mml:mtd><mml:mi></mml:mi><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:munderover><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mi>S</mml:mi><mml:mrow><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mrow><mml:mtext>total&#xA0;</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mi>S</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x00D7;</mml:mo><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd /><mml:mtd><mml:mspace width="2em" /><mml:mrow><mml:mo fence="true" stretchy="true" symmetric="true"></mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x2223;</mml:mo><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mtext>w</mml:mtext></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x003A;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x2217;</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>(<inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) represents the parameters of the image description module.</p>
<fig id="fig-7">
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-7.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Results Analysis</title>
<p>To validate the effectiveness of the model in cross-lingual image description tasks, this study conducted two sub-task experiments: Generating image descriptions in English using Arabic as the pivot language and generating image descriptions in Arabic using English as the pivot language.</p>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets and Evaluation Metrics</title>
<p>In this section, we present an overview of the datasets employed in our experiments and the evaluation metrics utilized to assess the performance of our cross-lingual image captioning model.</p>
<sec id="s4_1_1">
<label>4.1.1</label>
<title>Datasets</title>
<p>We utilized two benchmark datasets for our experiments, as outlined in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Statistics of the datasets used in our experiments</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th>Datasets</th>
<th>Language</th>
<th>Image no.</th>
<th>Caption no. per image</th>
<th>Training set</th>
<th>Validation set</th>
<th>Test set</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>Flickr8k</td>
<td>English</td>
<td>8092</td>
<td>5</td>
<td>6092</td>
<td>1000</td>
<td>1000</td>
</tr>
<tr>
<td>AraImg2k</td>
<td>Arabic</td>
<td>2000</td>
<td>5</td>
<td>1500</td>
<td>250</td>
<td>250</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Flickr8k (English Dataset): This dataset consists of 8092 images, with each image accompanied by five annotated English descriptions. To ensure data consistency, we divided the dataset into three sets: 6092 images for the training set, 1000 images for the validation set, and another 1000 images for the test set. English word segmentation was performed using the &#x201C;Stanford Parser&#x201D; tool (<ext-link ext-link-type="uri" xlink:href="https://stanfordnlp.github.io/CoreNLP/index.html">https://stanfordnlp.github.io/CoreNLP/index.html</ext-link>), retaining English words that appeared at least 5 times and truncating sentences exceeding 20 words in length.</p>
<p>AraImg2k (Arabic Dataset): This dataset comprises 2000 images, each associated with five manually annotated Arabic descriptions. To maintain uniformity, we split this dataset into three subsets: 1500 images for training, 250 images for validation, and 250 images for testing. Arabic word segmentation followed the method proposed by [<xref ref-type="bibr" rid="ref-55">55</xref>], retaining Arabic words occurring at least 5 times. The segmentation data was extracted from the ATB and stored in text files, with each sentence treated as a time-series instance. Each file contained information for a single sentence.</p>
<p>It is important to note that the images and sentences in AraImg2k and Flickr8k are distinct from each other.</p>
</sec>
<sec id="s4_1_2">
<label>4.1.2</label>
<title>Evaluation Metrics</title>
<p>For evaluating the generated image descriptions, we employed semantic evaluation metrics commonly used to assess the quality of machine-generated text compared to human references. These metrics provide insights into the model&#x2019;s performance in generating accurate and fluent image descriptions. The following evaluation metrics were used:
<list list-type="bullet">
<list-item>
<p>BLEU (Bilingual Evaluation Understudy): Measures the quality of machine-generated text by comparing it to human references. We reported BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores.</p></list-item>
<list-item>
<p>ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Assesses text quality by comparing the overlap of machine-generated text with human references.</p></list-item>
<list-item>
<p>METEOR (Metric for Evaluation of Translation with Explicit ORdering): Measures the quality of machine-generated text by considering word choice, synonymy, stemming, and word order.</p></list-item>
<list-item>
<p>CIDEr (Consensus-Based Image Description Evaluation): Evaluates the diversity and quality of generated descriptions by computing consensus scores based on human references.</p></list-item>
<list-item>
<p>SPICE (Semantic Propositional Image Caption Evaluation): Evaluates the quality of generated descriptions by assessing their semantic content and structure.</p></list-item>
</list></p>
</sec>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Training Settings</title>
<p>In the following sections, we describe the specific training settings for our cross-lingual image captioning model:</p>
<sec id="s4_2_1">
<label>4.2.1</label>
<title>Image Encoder-Sentence Decoder Module</title>
<p>We utilized the pre-trained ResNet-101 model and a fully connected layer to extract image features <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>I</mml:mtext></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> resulting in a dimension of d &#x003D; 512. These features were then used as the initial input to the decoder <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>G</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> at time step 0.</p>
</sec>
<sec id="s4_2_2">
<label>4.2.2</label>
<title>Cross-Modal Semantic Matching Module</title>
<p>The image semantic embedding network consists of the pre-trained ResNet-101 model and a fully connected layer. The target language encoder employed a single-layer <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> structure.</p>
</sec>
<sec id="s4_2_3">
<label>4.2.3</label>
<title>Cross-Lingual Semantic Matching Module</title>
<p>Both the axis language and target language encoders used single-layer <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>P</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mi>B</mml:mi><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:msub><mml:mi>U</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mi>E</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> frameworks with a hidden layer dimension of 512. The output dimension of BGRU was set to 1024.</p>
</sec>
<sec id="s4_2_4">
<label>4.2.4</label>
<title>Target Language Domain Evaluation Module</title>
<p>The language sequence model utilized a single-layer <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>L</mml:mi><mml:mi>S</mml:mi><mml:mi>T</mml:mi><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The hidden layer dimension and word embedding dimension for all LSTM structures in this study were set to d &#x003D; 512.</p>
<p>Throughout the model training process for both subtasks, dropout was set to 0.3, the batch size during pre-training was 128, and during reinforcement training, it was 256.</p>
<p>After the pre-training of the Semantic Matching module (<xref ref-type="sec" rid="s3_3">Section 3.3</xref>) and the Language Optimization module (<xref ref-type="sec" rid="s3_4">Section 3.4</xref>), the learning parameters <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x0B5;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C1;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi>&#x03B8;</mml:mi><mml:mrow><mml:mrow><mml:mi mathvariant="normal">&#x03C9;</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> remained fixed. Both of these modules provided rewards to guide the Image Description Generation module (<xref ref-type="sec" rid="s3_2">Section 3.2</xref>) in learning more source-domain semantic knowledge and target-domain language knowledge.</p>
<p>For the task of generating image descriptions in English using Arabic as the axis language, the learning rate for pre-training the Image Description Generation module is 1E-3. The learning rates for pre-training the source-domain semantic matching module and the target-language domain evaluation module are set to 2E-4. When training with language evaluation rewards and multi-modal semantic rewards, the learning rate for the Image Description Generation module is 4E-5. The values of <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> are set to 1, 1, and 0.15, respectively.</p>
<p>However, for the task of generating image descriptions in Arabic using English as the axis language, the learning rate for pre-training the Image Description Generation module is 1E-3. The learning rates for pre-training the source-domain semantic matching module and the target-language domain evaluation module are set to 4E-4. When training with language evaluation rewards and multi-modal semantic matching rewards, the learning rate for the Image Description Generation module is 1E-5. The values of <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula>, <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>&#x03B2;</mml:mi></mml:math></inline-formula>, and <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> are set to 1, 1, and 1, respectively.</p>
</sec>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Results Analysis</title>
<sec id="s4_3_1">
<label>4.3.1</label>
<title>Ablation Experiments</title>
<p>To assess the effectiveness of the Image &#x0026; Cross-Language Semantic Matching module and the Target Language Domain Evaluation module, ablation experiments were conducted. <xref ref-type="table" rid="table-3">Table 3</xref> presents the results of ablation experiments for the task of cross-language image description from Arabic to English and from English to Arabic.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>The contributions of different rewards for cross-lingual English image captioning on Flickr8k test dataset and cross-lingual Arabic image captioning on AraImg2k test dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead valign="top">
<tr>
<th>Task</th>
<th><inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:mi mathvariant="bold-italic">L</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi mathvariant="bold-italic">&#x03B8;</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">G</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:msub><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mi mathvariant="bold-italic">g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msubsup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi mathvariant="bold-italic">v</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th><inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msubsup><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mrow><mml:mi mathvariant="bold-italic">r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi mathvariant="bold-italic">v</mml:mi></mml:mrow><mml:mrow><mml:mi mathvariant="bold-italic">p</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula></th>
<th align="center" colspan="8">Metrics</th>
</tr>
<tr>
<th/>
<th/>
<th/>
<th/>
<th/>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>ROUGE</th>
<th>METEOR</th>
<th>CIDEr</th>
<th>SPICE</th>
</tr>
</thead>
<tbody valign="middle">
<tr>
<td rowspan="5"><bold>Cross-language English image captioning</bold></td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>81.0</td>
<td>72.4</td>
<td>67.3</td>
<td>65.3</td>
<td>40.6</td>
<td>44.3</td>
<td>74.1</td>
<td>50.2</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>82.0</td>
<td>73.6</td>
<td>67.9</td>
<td>65.5</td>
<td>40.4</td>
<td>44.6</td>
<td>75.3</td>
<td>50.5</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>89.0</td>
<td>77.3</td>
<td>70.1</td>
<td>66.9</td>
<td>40.2</td>
<td>44.0</td>
<td>76.4</td>
<td>51.1</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>88.0</td>
<td>79.9</td>
<td>74.4</td>
<td>71.0</td>
<td>41.2</td>
<td>44.6</td>
<td>84.4</td>
<td>50.5</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td><bold>91.7</bold></td>
<td><bold>82.4</bold></td>
<td><bold>75.9</bold></td>
<td><bold>71.8</bold></td>
<td><bold>41.7</bold></td>
<td><bold>45.5</bold></td>
<td><bold>87.9</bold></td>
<td><bold>51.7</bold></td>
</tr>
<tr>
<td rowspan="5"><bold>Cross-language Arabic image captioning</bold></td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>85.5</td>
<td>79.6</td>
<td>74.3</td>
<td>70.7</td>
<td>41.8</td>
<td>52.0</td>
<td>76.8</td>
<td>54.6</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>86.4</td>
<td>80.2</td>
<td>74.9</td>
<td>71.4</td>
<td>42.0</td>
<td>52.6</td>
<td>78.2</td>
<td>55.0</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>&#x2014;</td>
<td>88.0</td>
<td>81.1</td>
<td>75.8</td>
<td>72.0</td>
<td>42.4</td>
<td>52.5</td>
<td>79.0</td>
<td>54.8</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2014;</td>
<td>91.0</td>
<td>82.5</td>
<td>76.5</td>
<td>72.2</td>
<td>42.9</td>
<td>52.9</td>
<td>80.4</td>
<td>55.1</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td><bold>91.7</bold></td>
<td><bold>83.9</bold></td>
<td><bold>77.9</bold></td>
<td><bold>73.6</bold></td>
<td><bold>43.2</bold></td>
<td><bold>54.0</bold></td>
<td><bold>81.7</bold></td>
<td><bold>55.5</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="table" rid="table-3">Table 3</xref>, the baseline model employed <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>, <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mrow><mml:mtext>L</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mrow><mml:mi mathvariant="normal">&#x03B8;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>G</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> as the objective function. The model trained with the <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> reward represents participation in the Cross-Modal Semantic Matching module (<xref ref-type="sec" rid="s3_3_1">Section 3.3.1</xref>). The model trained with the <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> reward represents participation in the Cross-Language Semantic Matching module (<xref ref-type="sec" rid="s3_3_2">Section 3.3.2</xref>). The model trained with the <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> reward represents participation in the Target Language Domain Evaluation module (<xref ref-type="sec" rid="s3_4">Section 3.4</xref>). The model that jointly uses rewards <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is also evaluated.</p>
<p>The results of the ablation experiments shed light on the impact of various reward components on our model&#x2019;s performance for both cross-language English and Arabic image captioning tasks.</p>
<p><xref ref-type="fig" rid="fig-3">Figs. 3</xref> and <xref ref-type="fig" rid="fig-4">4</xref> illustrate the detailed evaluation metrics for cross-language English and Arabic image captioning, respectively, further elucidating the findings of our study.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Evaluation metrics for cross-language English image captioning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-3.tif"/>
</fig><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Evaluation metrics for cross-language Arabic image captioning</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-4.tif"/>
</fig>
<p>According to <xref ref-type="table" rid="table-3">Table 3</xref>, introducing the Multi-Modal Semantic Relevance Reward <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and Cross-Language Semantic Matching Reward <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> led to improvements in several performance metrics. Notably, the CIDEr scores increased for English and Arabic by 1.2% and 1.4%, respectively, compared to the baseline. These results indicate that the Image &#x0026; Cross-Language Semantic Matching module enhanced the semantic relevance of the generated sentences.</p>
<p>The Target Language Domain Evaluation Reward <inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> played a positive role in both cross-language English and Arabic image description tasks. For cross-language English and Arabic image captioning, CIDEr scores increased by 2.3% and 2.2%, respectively, compared to the baseline.</p>
<p>Furthermore, the combined effect of the Target Language Domain Reward <inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">&#x2223;</mml:mo><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and Image-Sentence Semantic Matching Reward <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mo>&#x2223;</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> resulted in substantial performance improvements in both tasks. For cross-language English and Arabic image descriptions, CIDEr scores increased by 10.3% and 3.6%, respectively, compared to the baseline. This indicates that combining these rewards results in descriptions that are more semantically consistent with the images.</p>
<p>Finally, when all rewards <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, and <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> were considered jointly, significant improvements were observed across all metrics. In comparison to the baseline, CIDEr scores increased for English and Arabic by 13.8% and 4.9%, respectively. This highlights the effectiveness of incorporating guidance from both the Image &#x0026; Cross-Language Domain and Target Language Domain in improving fluency and semantic relevance in generated sentences.</p>
<p>It is worth noting when comparing experiments involving only the <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> reward to those with both the <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> rewards, CIDEr scores for English and Arabic increased by 8.0% and 1.4%, respectively. The differential impact of the <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> reward on the two subtasks is noticeable. This difference is because, in the cross-language English image description subtask, the test set from Flickr8k contains various scenes, such as people, animals, and objects, making the visual semantics richer and more diverse. In this scenario, the Image-Sentence Semantic Matching Reward <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msubsup><mml:mi>r</mml:mi><mml:mrow><mml:mi>r</mml:mi><mml:mi>l</mml:mi><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> demonstrates excellent semantic complementing ability (resulting in an 8.0% increase).</p>
<p>However, in the cross-language Arabic image description subtask, the test set AraImg2k primarily features images with a single visual scene (mostly focusing on people). Consequently, there is limited visual semantics to complement. Despite this, the method still improved performance by 1.4%.</p>
<p>The data in <xref ref-type="table" rid="table-3">Table 3</xref> was obtained through experimental trials conducted using the Python programming language. The trials were designed to evaluate the impact of different reward mechanisms in our cross-lingual image captioning model. The experiments were carried out on the Flickr8k dataset for English captions and the AraImg2k dataset for Arabic captions. These results demonstrate the effectiveness of our model in enhancing cross-lingual image captioning through a combination of semantic matching and language evaluation rewards.</p>
</sec>
<sec id="s4_3_2">
<label>4.3.2</label>
<title>Cross-Language English Image Description Performance Analysis</title>
<p><xref ref-type="table" rid="table-4">Table 4</xref> provides a comparative analysis of various methods, including our study, for cross-language English image description tasks. The performance metrics presented were derived from experimental results on the English Flickr8k dataset. These results were obtained through systematic testing and evaluation of our model against established benchmarks in the field.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Performance comparison for English image description on Flickr8k dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col/>
<col/>
<col align="left"/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead valign="top">
<tr>
<th>Ref.</th>
<th>Year</th>
<th>Dataset</th>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>ROUGE</th>
<th>METEOR</th>
<th>CIDEr</th>
<th>SPICE</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>[<xref ref-type="bibr" rid="ref-56">56</xref>]</td>
<td>2022</td>
<td>Flickr8k</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.074</td>
<td>0.29</td>
<td>0.3</td>
<td>0.33</td>
<td>0.037</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>2022</td>
<td>Flickr8k</td>
<td>0.6126</td>
<td>0.4091</td>
<td>0.2762</td>
<td>0.1866</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-58">58</xref>]</td>
<td>2022</td>
<td>Flickr8k</td>
<td>85.0</td>
<td>78.4</td>
<td>70.3</td>
<td>48.3</td>
<td>69.2</td>
<td>35.4</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-59">59</xref>]</td>
<td>2023</td>
<td>Flickr8k</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.52</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-60">60</xref>]</td>
<td>2023</td>
<td>Flickr8k</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.1044</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-61">61</xref>]</td>
<td>2023</td>
<td>Flickr8k</td>
<td>0.6338</td>
<td>0.4825</td>
<td>0.3940</td>
<td>0.3275</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-62">62</xref>]</td>
<td>2023</td>
<td>Flickr8k</td>
<td>41.25</td>
<td>37.77</td>
<td>78.87</td>
<td>93.91</td>
<td>34.56</td>
<td>38.56</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Ours</td>
<td>2023</td>
<td>Flickr8k &#x0026; AraImg2k</td>
<td><bold>91.7</bold></td>
<td><bold>82.4</bold></td>
<td>75.9</td>
<td>71.8</td>
<td><bold>41.7</bold></td>
<td><bold>45.5</bold></td>
<td><bold>87.9</bold></td>
<td><bold>51.7</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>A comparison between our work and previous studies based on the data in <xref ref-type="table" rid="table-4">Table 4</xref> demonstrates our model&#x2019;s superior performance. Our approach consistently surpasses prior methods across diverse evaluation metrics. Notably, it excels in BLEU and CIDEr scores, signifying its improved accuracy and diversity in generating English image descriptions.</p>
<p><xref ref-type="fig" rid="fig-5">Fig. 5</xref> shows the visual results of this model on the cross-language English image description task using the Flickr8k test set. The red font indicates semantic errors from the baseline model&#x2019;s translation, while the green font represents correct semantics from this model&#x2019;s translation. The figure illustrates that this model generates descriptions closer to the visual content of the images. For example, it can identify object attributes, replace the incorrect &#x2018;women&#x2019; with &#x2018;men&#x2019;, and infer object relationships, correcting &#x201C;A red Jeep driving down in a mountainous area&#x201D; to &#x201C;driving down a rocky hill.&#x201D; Additionally, this model&#x2019;s generated sentences have fewer stylistic differences from the target language. For instance, the sentences generated by this model tend to follow the target language&#x2019;s style of &#x201C;someone doing something somewhere,&#x201D; while the baseline model prefers to add attributive modifiers to objects.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Examples of the cross-lingual English image captioning from the Flickr8k test set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-5.tif"/>
</fig>
</sec>
<sec id="s4_3_3">
<label>4.3.3</label>
<title>Cross-Language Arabic Image Description Performance Analysis</title>
<p>Similar to <xref ref-type="table" rid="table-4">Table 4</xref>, <xref ref-type="table" rid="table-5">Table 5</xref> presents a comparative analysis of various methods for cross-language Arabic image description tasks. The data was derived from testing our model on both the Arabic Flickr8k and AraImg2k datasets, providing a comprehensive performance evaluation across multiple metrics.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Performance comparison for Arabic image description on Arabic Flickr8k and AraImg2k datasets</title>
</caption>
<table frame="hsides">
<colgroup>
<col/>
<col/>
<col align="left"/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
<col/>
</colgroup>
<thead valign="top">
<tr>
<th>Ref.</th>
<th>Year</th>
<th>Dataset</th>
<th>BLEU-1</th>
<th>BLEU-2</th>
<th>BLEU-3</th>
<th>BLEU-4</th>
<th>ROUGE</th>
<th>METEOR</th>
<th>CIDEr</th>
<th>SPICE</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td>[<xref ref-type="bibr" rid="ref-63">63</xref>]</td>
<td>2018</td>
<td>Flickr8k</td>
<td>34.8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>2018</td>
<td>Flickr8k</td>
<td>46</td>
<td>26</td>
<td>19</td>
<td>8</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>2020</td>
<td>Flickr8k</td>
<td>33</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>6</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-64">64</xref>]</td>
<td>2021</td>
<td>Flickr8k</td>
<td>44.3</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>15.6</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-30">30</xref>]</td>
<td>2022</td>
<td>Flickr8k</td>
<td>39.10</td>
<td>25.13</td>
<td>13.96</td>
<td>8.29</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-57">57</xref>]</td>
<td>2022</td>
<td>Flickr8k</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>&#x2013;</td>
<td>0.062</td>
<td>0.29</td>
<td>0.31</td>
<td>0.31</td>
<td>0.037</td>
</tr>
<tr>
<td>[<xref ref-type="bibr" rid="ref-65">65</xref>]</td>
<td>2023</td>
<td>Flickr8k</td>
<td>0.59</td>
<td>0.39</td>
<td>0.30</td>
<td>0.16</td>
<td>0.24</td>
<td>0.21</td>
<td>0.16</td>
<td>&#x2013;</td>
</tr>
<tr>
<td>Ours</td>
<td>2023</td>
<td>Flickr8k &#x0026; AraImg2k</td>
<td><bold>91.7</bold></td>
<td><bold>83.9</bold></td>
<td><bold>77.9</bold></td>
<td><bold>73.6</bold></td>
<td><bold>43.2</bold></td>
<td><bold>54.0</bold></td>
<td><bold>81.7</bold></td>
<td><bold>55.5</bold></td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Comparing our work with previous studies based on the data in <xref ref-type="table" rid="table-5">Table 5</xref> reveals a significant advancement. Our model consistently outperforms existing methods across all evaluation metrics. Notably, it achieves remarkable improvements in BLEU, CIDEr, and SPICE scores, reflecting its superior accuracy, diversity, and linguistic quality in generating Arabic image descriptions.</p>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> indicates that the descriptions generated by the proposed model are more semantically relevant to the visual content. For example, the proposed model is capable of supplementing and correcting missing or incorrect visual information, resulting in more coherent and accurate sentences. Additionally, the sentences generated by the proposed model align more closely with the style of real descriptions, presenting a continuous and concise style.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Examples of the cross-lingual Arabic image captioning from the Flickr8k test set</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_48104-fig-6.tif"/>
</fig>
<p>In summary, the performance analysis of the cross-language Arabic image description task shows that the proposed model consistently outperforms baseline and state-of-the-art methods in various evaluation metrics. It generates descriptions that are not only more semantically accurate but also stylistically aligned with the target language, making it an effective solution for cross-language image description tasks.</p>
<p>In conclusion, this study represents a significant contribution to the field of cross-lingual image description. Our method&#x2019;s ability to generate culturally relevant and semantically coherent captions across languages is not just an academic advancement; it has practical implications for enhancing multilingual understanding and communication. The introduction of the AraImg2k dataset, along with our novel methodologies, sets a new benchmark in the field and lays the groundwork for future research in this area.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion</title>
<p>In this study, we presented a novel approach for cross-lingual image description generation, aiming to bridge the gap between different languages and facilitate the understanding of images across linguistic barriers. Our method combines state-of-the-art techniques in image analysis, natural language processing, and cross-lingual semantics, resulting in a robust and effective model for generating image descriptions in multiple languages.</p>
<sec id="s5_1">
<label>5.1</label>
<title>Key Contributions</title>
<p>Our research makes several key contributions to the field of cross-lingual image description:
<list list-type="bullet">
<list-item>
<p>Effective Cross-Lingual Image Description: We successfully developed a model capable of generating image descriptions in English using Arabic as the pivot language and vice versa. This achievement highlights the versatility and adaptability of our approach to handling diverse language pairs.</p></list-item>
<list-item>
<p>Semantic Relevance Enhancement: Through the Image &#x0026; Cross-Language Semantic Matching module, we demonstrated significant improvements in the semantic relevance of generated sentences. This enhancement contributes to more accurate and contextually appropriate image descriptions.</p></list-item>
<list-item>
<p>Stylistic Alignment: Our model not only excels in semantic accuracy but also exhibits a superior ability to align with the stylistic nuances of the target language. This results in image descriptions that are more fluent and natural, closing the gap between machine-generated and human-authored content.</p></list-item>
</list></p>
</sec>
<sec id="s5_2">
<label>5.2</label>
<title>Future Work</title>
<p>While our current research presents a substantial step forward in cross-lingual image description, there are several exciting avenues for future exploration:
<list list-type="bullet">
<list-item>
<p>Multimodal Enhancements: Incorporating additional modalities such as audio or video content into the image description process could lead to more comprehensive and context-aware descriptions, enabling applications in areas like multimedia indexing and retrieval.</p></list-item>
<list-item>
<p>Low-Resource Languages: Extending our model&#x2019;s capabilities to low-resource languages is a promising direction. This would require addressing the challenges of limited training data and language-specific complexities.</p></list-item>
<list-item>
<p>Fine-Grained Image Understanding: Future work can focus on improving the model&#x2019;s ability to capture fine-grained details within images, allowing for more precise and nuanced descriptions, especially in complex scenes.</p></list-item>
<list-item>
<p>User Interaction: Incorporating user feedback and preferences into the image description generation process can lead to personalized and user-specific descriptions, enhancing the user experience in various applications.</p></list-item>
<list-item>
<p>Real-Time Applications: Adapting our model for real-time applications, such as automatic translation during live events or real-time image description for the visually impaired, is an exciting area for future research and development.</p></list-item>
</list></p>
</sec>
</sec>
</body>
<back>
<ack><p>None.</p>
</ack>
<sec><title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec><title>Author Contributions</title>
<p>Study conception and design: Emran Al-Buraihy, Wang Dan; data collection: Emran Al-Buraihy; analysis and interpretation of results: Emran Al-Buraihy, Wang Dan; draft manuscript preparation: Emran Al-Buraihy, Wang Dan. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The code and the dataset will be available from the authors upon reasonable request.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare that they have no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Kumar</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Harnessing the power of big data: Challenges and opportunities in analytics</article-title>,&#x201D; <source>Tuijin Jishu/J. Propul. Tech.</source>, vol. <volume>44</volume>, no. <issue>2</issue>, pp. <fpage>681</fpage>&#x2013;<lpage>691</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.52783/tjjpt.v44.i2.193</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A. S.</given-names> <surname>George</surname></string-name>, <string-name><given-names>A. H.</given-names> <surname>George</surname></string-name>, and <string-name><given-names>T.</given-names> <surname>Baskar</surname></string-name></person-group>, &#x201C;<article-title>Emoji unite: Examining the rise of emoji as an international language bridging cultural and generational divides</article-title>,&#x201D; <source>Partners Univ. Int. Innov. J.</source>, vol. <volume>1</volume>, no. <issue>4</issue>, pp. <fpage>183</fpage>&#x2013;<lpage>204</lpage>, <month>Aug.</month> <year>2023</year>. doi: <pub-id pub-id-type="doi">10.5281/zenodo.8280356</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Kiaer</surname></string-name></person-group>, <source>Emoji Speak: Communication and Behaviours on Social Media</source>. <publisher-loc>London</publisher-loc>: <publisher-name>Bloomsbury Academic</publisher-name>, <year>2023</year>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Madake</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bhatlawande</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Solanke</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Shilaskar</surname></string-name></person-group>, &#x201C;<article-title>PerceptGuide: A perception driven assistive mobility aid based on self-attention and multi-scale feature fusion</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>11</volume>, pp. <fpage>101167</fpage>&#x2013;<lpage>101182</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2023.3314702</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Zheng</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Ni</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Yin</surname></string-name>, and <string-name><given-names>B.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>Improving visual reasoning through semantic representation</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>9</volume>, pp. <fpage>91476</fpage>&#x2013;<lpage>91486</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3074937</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Cai</surname></string-name></person-group>, &#x201C;<article-title>Learning to collocate neural modules for image captioning</article-title>,&#x201D; in <conf-name>2019 IEEE/CVF Int. Conf. Comput. Vision (ICCV)</conf-name>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/iccv.2019.00435</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>H. M.</given-names> <surname>Kuzenko</surname></string-name></person-group>, <source>The Role of Audiovisual Translation in the Digital Age</source>. <publisher-loc>Riga, Latvia</publisher-loc>: <publisher-name>Baltija Publishing</publisher-name>, <month>Jun.</month> <day>27</day>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.30525/978-9934-26-319-4-14</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L. H.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Z. Y.</given-names> <surname>Dou</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Peng</surname></string-name>, and <string-name><given-names>K. W.</given-names> <surname>Chang</surname></string-name></person-group>, &#x201C;<article-title>DesCo: Learning object recognition with rich language descriptions</article-title>,&#x201D; <month>Jun.</month> <year>2023</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2306.14060</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Kocmi</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Mach&#x00E1;&#x010D;ek</surname></string-name>, and <string-name><given-names>O.</given-names> <surname>Bojar</surname></string-name></person-group>, &#x201C;<article-title>The reality of multi-lingual machine translation</article-title>,&#x201D; <month>Feb</month>. <year>2022</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2202.12814</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Francis</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Hollauer</surname></string-name>, <string-name><given-names>M. C.</given-names> <surname>Lawson</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Shaikh</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Cotsman</surname></string-name></person-group>, &#x201C;<article-title>Reliability of electric vehicle charging infrastructure: A cross-lingual deep learning approach</article-title>,&#x201D; <source>Commun. Trans. Res.</source>, vol. <volume>3</volume>, no. <issue>6</issue>, pp. <fpage>100095</fpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1016/j.commtr.2023.100095</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Sanders</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Etter</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Kriz</surname></string-name>, and <string-name><given-names>B. van</given-names> <surname>Durme</surname></string-name></person-group>, &#x201C;<article-title>MultiVENT: Multilingual videos of events with aligned natural text</article-title>,&#x201D; <month>Jul</month>. <year>2023</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2307.03153</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Amirian</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Rasheed</surname></string-name>, <string-name><given-names>T. R.</given-names> <surname>Taha</surname></string-name>, and <string-name><given-names>H. R.</given-names> <surname>Arabnia</surname></string-name></person-group>, &#x201C;<article-title>Automatic image and video caption generation with deep learning: A concise review and algorithmic overlap</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>8</volume>, pp. <fpage>218386</fpage>&#x2013;<lpage>218400</lpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3042484</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Yang</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Liu</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Liu</surname></string-name></person-group>, &#x201C;<article-title>Better understanding: Stylized image captioning with style attention and adversarial training</article-title>,&#x201D; <source>Sym.</source>, vol. <volume>12</volume>, no. <issue>12</issue>, pp. <fpage>1978</fpage>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.3390/sym12121978</pub-id>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Younis</surname></string-name></person-group>, <article-title>&#x201C;I-Arabic: Computational attempts and corpus issues in modern Arabic,&#x201D; <inline-graphic xlink:href="CMC_48104-inline-2.tif"/></article-title>, vol. <volume>3</volume>, no. <issue>3</issue>, pp. <fpage>301</fpage>&#x2013;<lpage>325</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.21608/mjoms.2023.299689</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P. S.</given-names> <surname>Rao</surname></string-name></person-group>, &#x201C;<article-title>The role of English as a global language</article-title>,&#x201D; <source>Res. J. Eng.</source>, vol. <volume>4</volume>, no. <issue>1</issue>, pp. <fpage>65</fpage>&#x2013;<lpage>79</lpage>, <month>Jan</month>. <year>2019</year>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Liu</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>On the cultural gap in text-to-image generation</article-title>,&#x201D; <month>Jul</month>. <year>2023</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2307.02971</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Lu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Jiang</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Chen</surname></string-name></person-group>, &#x201C;<article-title>TRAVL: Transferring pre-trained visual-linguistic models for cross-lingual image captioning</article-title>,&#x201D; in <source>Web and Big Data. APWeb-WAIM 2022</source>, <publisher-loc>Nanjing, China</publisher-loc>, <publisher-name>Springer</publisher-name>, <year>2022</year>, vol. <volume>13422</volume>, pp. <fpage>341</fpage>&#x2013;<lpage>355</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-25198-6_26</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Sharoff</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Rapp</surname></string-name>, and <string-name><given-names>P.</given-names> <surname>Zweigenbaum</surname></string-name></person-group>, &#x201C;<article-title>Building comparable corpora</article-title>,&#x201D; in <conf-name>Building Using Comp. Corpora Multi. Nat. Lang. Process.</conf-name>, <year>2023</year>, pp. <fpage>17</fpage>&#x2013;<lpage>37</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-31384-4_3</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Fan</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Q.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Huang</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Wei</surname></string-name></person-group>, &#x201C;<article-title>Unifying cross-lingual and cross-modal modeling towards weakly supervised multilingual vision-language pre-training</article-title>,&#x201D; in <conf-name>Proc. 61st Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Toronto, Canada</publisher-loc>, <year>2023</year>, pp. <fpage>5939</fpage>&#x2013;<lpage>5958</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/2023.acl-long.327</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Reynolds</surname></string-name> and <string-name><given-names>K.</given-names> <surname>McDonell</surname></string-name></person-group>, &#x201C;<article-title>Prompt programming for large language models: Beyond the few-shot paradigm</article-title>,&#x201D; in <conf-name>Ext. Abstr. 2021 CHI Conf. Human Factors Comput. Syst.</conf-name>, <year>2021</year>, pp. <fpage>1</fpage>&#x2013;<lpage>7</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3411763.3451760</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Papineni</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Roukos</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Ward</surname></string-name>, and <string-name><given-names>W. J.</given-names> <surname>Zhu</surname></string-name></person-group>, &#x201C;<article-title>BLEU: a method for automatic evaluation of machine translation</article-title>,&#x201D; in <conf-name>Proc. 40th Annu. Meet. Assoc. Comput.</conf-name>, <year>2001</year>, pp. <fpage>311</fpage>&#x2013;<lpage>318</lpage>. doi: <pub-id pub-id-type="doi">10.3115/1073083.1073135</pub-id>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. Y.</given-names> <surname>Lin</surname></string-name></person-group>, &#x201C;<article-title>A package for automatic evaluation of summaries</article-title>,&#x201D; in <conf-name>Text Summar. Bran. Out</conf-name>, <month>Jul</month>. <year>2004</year>, pp. <fpage>74</fpage>&#x2013;<lpage>81</lpage>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Banerjee</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Lavie</surname></string-name></person-group>, &#x201C;<article-title>METEOR: An automatic metric for MT evaluation with improved correlation with human judgments</article-title>,&#x201D; in <conf-name>Proc. ACL Workshop Intr. Extr. Eval. Meas. Mach. Trans. Summar.</conf-name>, <month>Jun</month>. <year>2005</year>, pp. <fpage>65</fpage>&#x2013;<lpage>72</lpage>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>R.</given-names> <surname>Vedantam</surname></string-name>, <string-name><given-names>C. L.</given-names> <surname>Zitnick</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Parikh</surname></string-name></person-group>, &#x201C;<article-title>CIDEr: Consensus-based image description evaluation</article-title>,&#x201D; in <conf-name>2015 IEEE Conf. Comput. Vision Pattern Recognit. (CVPR)</conf-name>, <publisher-loc>Boston, MA, USA</publisher-loc>, <year>2015</year>, pp. <fpage>4566</fpage>&#x2013;<lpage>4575</lpage>. doi: <pub-id pub-id-type="doi">10.1109/cvpr.2015.7299087</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Anderson</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Fernando</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Johnson</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Gould</surname></string-name></person-group>, &#x201C;<article-title>SPICE: Semantic propositional image caption evaluation</article-title>,&#x201D; <source>Computer Vision&#x2013;ECCV 2016</source>, vol. <volume>9909</volume>, no. <issue>12</issue>, pp. <fpage>382</fpage>&#x2013;<lpage>398</lpage>, <year>2016</year>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-46454-1_24</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="book"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Koul</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ganju</surname></string-name>, and <string-name><given-names>M.</given-names> <surname>Kasam</surname></string-name></person-group>, <source>Practical Deep Learning for Cloud, Mobile, and Edge: Real-World AI &#x0026; Computer-Vision Projects Using Python, Keras &#x0026; Tensorflow</source>. <publisher-loc>Sebastopol, CA</publisher-loc>: <publisher-name>O&#x2019;Reilly Media</publisher-name>, <year>2019</year>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Shafiq</surname></string-name> and <string-name><given-names>Z.</given-names> <surname>Gu</surname></string-name></person-group>, &#x201C;<article-title>Deep residual learning for image recognition: A survey</article-title>,&#x201D; <source>Appl. Sci.</source>, vol. <volume>12</volume>, no. <issue>18</issue>, pp. <fpage>8972</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.3390/app12188972</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Ding</surname></string-name></person-group>, &#x201C;<article-title>A systematic literature review on image captioning</article-title>,&#x201D; in <source>HCI International 2023 Posters, Communications in Computer and Information Science</source>, <person-group person-group-type="editor"><string-name><given-names>C.</given-names> <surname>Stephanidis</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Antona</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ntoa</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Salvendy</surname></string-name></person-group>, Eds., <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>, <year>2023</year>, vol. <volume>1836</volume>, pp. <fpage>396</fpage>&#x2013;<lpage>404</lpage>. doi: <pub-id pub-id-type="doi">10.1007/978-3-031-36004-6_54</pub-id>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. S.</given-names> <surname>Al-Malki</surname></string-name> and <string-name><given-names>A. Y.</given-names> <surname>Al-Aama</surname></string-name></person-group>, &#x201C;<article-title>Arabic captioning for images of clothing using deep learning</article-title>,&#x201D; <source>Sens.</source>, vol. <volume>23</volume>, no. <issue>8</issue>, pp. <fpage>3783</fpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.3390/s23083783</pub-id>; <pub-id pub-id-type="pmid">37112124</pub-id></mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M. T.</given-names> <surname>Lasheen</surname></string-name> and <string-name><given-names>N. H.</given-names> <surname>Barakat</surname></string-name></person-group>, &#x201C;<article-title>Arabic image captioning: The effect of text pre-processing on the attention weights and the BLEU-N scores</article-title>,&#x201D; <source>Int. J. Adv. Comput. Sci. App.</source>, vol. <volume>13</volume>, no. <issue>7</issue>, pp. <fpage>413</fpage>&#x2013;<lpage>422</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.14569/ijacsa.2022.0130751</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H. A.</given-names> <surname>Al-muzaini</surname></string-name>, <string-name><given-names>T. N.</given-names> <surname>Al-Yahya</surname></string-name>, and <string-name><given-names>H.</given-names> <surname>Benhidour</surname></string-name></person-group>, &#x201C;<article-title>Automatic arabic image captioning using RNN-LSTM-based language model and CNN</article-title>,&#x201D; <source>Int. J. Adv. Comput. Sci. App.</source>, vol. <volume>9</volume>, no. <issue>6</issue>, pp. <fpage>67</fpage>&#x2013;<lpage>72</lpage>, <year>2018</year>. doi: <pub-id pub-id-type="doi">10.14569/ijacsa.2018.090610</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Feng</surname></string-name> and <string-name><given-names>M.</given-names> <surname>Lapata</surname></string-name></person-group>, &#x201C;<article-title>Automatic caption generation for news images</article-title>,&#x201D; <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, vol. <volume>35</volume>, no. <issue>4</issue>, pp. <fpage>797</fpage>&#x2013;<lpage>812</lpage>, <year>2013</year>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2012.118</pub-id>; <pub-id pub-id-type="pmid">22641700</pub-id></mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J. H.</given-names> <surname>Tan</surname></string-name>, <string-name><given-names>C. S.</given-names> <surname>Chan</surname></string-name>, and <string-name><given-names>J. H.</given-names> <surname>Chuah</surname></string-name></person-group>, &#x201C;<article-title>COMIC: Toward a compact image captioning model with attention</article-title>,&#x201D; <source>IEEE Trans. Multimed.</source>, vol. <volume>21</volume>, no. <issue>10</issue>, pp. <fpage>2686</fpage>&#x2013;<lpage>2696</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/TMM.2019.2904878</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>Vinyals</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Toshev</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Bengio</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Erhan</surname></string-name></person-group>, &#x201C;<article-title>Show and tell: Lessons learned from the 2015 MSCOCO Image captioning challenge</article-title>,&#x201D; <source>IEEE Trans. Pattern Anal. Mach. Intell.</source>, vol. <volume>39</volume>, no. <issue>4</issue>, pp. <fpage>652</fpage>&#x2013;<lpage>663</lpage>, <year>2017</year>. doi: <pub-id pub-id-type="doi">10.1109/TPAMI.2016.2587640</pub-id>; <pub-id pub-id-type="pmid">28055847</pub-id></mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y. A.</given-names> <surname>Thakare</surname></string-name> and <string-name><given-names>K. H.</given-names> <surname>Walse</surname></string-name></person-group>, &#x201C;<article-title>A review of deep learning image captioning approaches</article-title>,&#x201D; <source>J. Integr. Sci. Tec.</source>, vol. <volume>12</volume>, no. <issue>1</issue>, pp. <fpage>712</fpage>, <year>2023</year>. <comment>Accessed: Dec. 16, 2023. [Online]</comment>. Available: <ext-link ext-link-type="uri" xlink:href="https://pubs.thesciencein.org/journal/index.php/jist/article/view/a712">https://pubs.thesciencein.org/journal/index.php/jist/article/view/a712</ext-link></mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Tianying</surname></string-name> and <string-name><given-names>Y. V.</given-names> <surname>Bogoyavlenskaya</surname></string-name></person-group>, &#x201C;<article-title>Semantic transformation and cultural adaptation of metaphor and multimodal metaphor in multilingual communication from the perspective of cognitive linguistics</article-title>,&#x201D; <source>Eurasian J. Appl. Linguist.</source>, vol. <volume>9</volume>, no. <issue>1</issue>, pp. <fpage>161</fpage>&#x2013;<lpage>189</lpage>, <month>Jun.</month> <day>11</day> <year>2023</year>. doi: <pub-id pub-id-type="doi">10.32601/ejal.901015</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>M. S.</given-names> <surname>Rasooli</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Callison-Burch</surname></string-name>, and <string-name><given-names>D. T.</given-names> <surname>Wijaya</surname></string-name></person-group>, &#x201C;<article-title>&#x201C;Wikily&#x201D; supervised neural translation tailored to cross-lingual tasks</article-title>,&#x201D; in <conf-name>Proc. 2021 Conf. Empir. Methods Nat. Lang. Process.</conf-name>, <year>2021</year>, pp. <fpage>1655</fpage>&#x2013;<lpage>1670</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/2021.emnlp-main.124</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>O.</given-names> <surname>ElJundi</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Dhaybi</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Mokadam</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Hajj</surname></string-name>, and <string-name><given-names>D.</given-names> <surname>Asmar</surname></string-name></person-group>, &#x201C;<article-title>Resources and end-to-end neural network models for Arabic image captioning</article-title>,&#x201D; in <conf-name>Proc. 15th Int. Joint Conf. Comput. Vision, Imag. Comput. Graph. Theory App.</conf-name>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.5220/0008881202330241</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Lan</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Jia</surname></string-name> and <string-name><given-names>G.</given-names> <surname>Yang</surname></string-name></person-group>, &#x201C;<article-title>COCO-CN for cross-lingual image tagging, captioning, and retrieval</article-title>,&#x201D; <source>IEEE Trans. Multimed.</source>, vol. <volume>21</volume>, no. <issue>9</issue>, pp. <fpage>2347</fpage>&#x2013;<lpage>2360</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/TMM.2019.2896494</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Lan</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Li</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Dong</surname></string-name></person-group>, &#x201C;<article-title>Fluency-guided cross-lingual image captioning</article-title>,&#x201D; in <conf-name>Proc. 25th ACM Int. Conf. Multimed.</conf-name>, <publisher-loc>California, USA</publisher-loc>, <year>2017</year>, pp. <fpage>1549</fpage>&#x2013;<lpage>1557</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3123266.3123366</pub-id>.</mixed-citation></ref>
<ref id="ref-41"><label>[41]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Hitschler</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Schamoni</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Riezler</surname></string-name></person-group>, &#x201C;<article-title>Multimodal pivots for image caption translation</article-title>,&#x201D; in <conf-name>Proc. 54th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Berlin, Germany</publisher-loc>, <year>2016</year>, pp. <fpage>2399</fpage>&#x2013;<lpage>2409</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/p16-1227</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-42"><label>[42]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Elliott</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Frank</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Sima&#x2019;an</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Specia</surname></string-name></person-group>, &#x201C;<article-title>Multi30k: Multilingual English-German image descriptions</article-title>,&#x201D; in <conf-name>Proc. 5th Workshop Vision Lang.</conf-name>, <publisher-loc>Berlin, Germany</publisher-loc>, <year>2016</year>, pp. <fpage>70</fpage>&#x2013;<lpage>74</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/w16-3210</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-43"><label>[43]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Yoshikawa</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Shigeto</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Takeuchi</surname></string-name></person-group>, &#x201C;<article-title>STAIR captions: Constructing a large-scale Japanese image caption dataset</article-title>,&#x201D; in <conf-name>Proc. 55th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Vancouver, Canada</publisher-loc>, <year>2017</year>, pp. <fpage>417</fpage>&#x2013;<lpage>421</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/p17-2066</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-44"><label>[44]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Miyazaki</surname></string-name> and <string-name><given-names>N.</given-names> <surname>Shimizu</surname></string-name></person-group>, &#x201C;<article-title>Cross-lingual image caption generation</article-title>,&#x201D; in <conf-name>Proc. 54th Annu. Meet. Assoc. Comput. Linguist.</conf-name>, <publisher-loc>Berlin, Germany</publisher-loc>, <year>2016</year>, pp. <fpage>1780</fpage>&#x2013;<lpage>1790</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/p16-1168</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-45"><label>[45]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Z.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Luo</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Qiu</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Wang</surname></string-name>, and <string-name><given-names>X.</given-names> <surname>Tang</surname></string-name></person-group>, &#x201C;<article-title>DeepFashion: Powering robust clothes recognition and retrieval with rich annotations</article-title>,&#x201D; in <conf-name>2016 IEEE Conf. Compu. Vision Pattern Recognit. (CVPR)</conf-name>, <publisher-loc>Las Vegas, NV, USA</publisher-loc>, <year>2016</year>, pp. <fpage>1096</fpage>&#x2013;<lpage>1104</lpage>. doi: <pub-id pub-id-type="doi">10.1109/cvpr.2016.124</pub-id>.</mixed-citation></ref>
<ref id="ref-46"><label>[46]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Hacheme</surname></string-name> and <string-name><given-names>S.</given-names> <surname>Noureini</surname></string-name></person-group>, &#x201C;<article-title>Neural fashion image captioning: Accounting for data diversity</article-title>,&#x201D; <comment>arXiv preprint arXiv:2106.12154</comment>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.31730/osf.io/hwtpq</pub-id>.</mixed-citation></ref>
<ref id="ref-47"><label>[47]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Kasai</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Sakaguchi</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Dunagan</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Morrison</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Le Bras</surname></string-name> and <string-name><given-names>Y.</given-names> <surname>Choi</surname></string-name></person-group>, &#x201C;<article-title>Transparent human evaluation for image captioning</article-title>,&#x201D; in <conf-name>Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist.: Human Lang. Tech.</conf-name>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.18653/v1/2022.naacl-main.254</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-48"><label>[48]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T. Y.</given-names> <surname>Lin</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Maire</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Belongie</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Hays</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Perona</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Ramanan</surname></string-name></person-group>, &#x201C;<article-title>Microsoft COCO: Common objects in context</article-title>,&#x201D; <source>Computer Vision&#x2013;ECCV 2014</source>, vol. <volume>8693</volume>, no. <issue>2</issue>, pp. <fpage>740</fpage>&#x2013;<lpage>755</lpage>, <year>2014</year>. doi: <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id>.</mixed-citation></ref>
<ref id="ref-49"><label>[49]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Hodosh</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Young</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Hockenmaier</surname></string-name></person-group>, &#x201C;<article-title>Framing image description as a ranking task: Data, models and evaluation metrics</article-title>,&#x201D; <source>J. Artif. Intell. Res.</source>, vol. <volume>47</volume>, pp. <fpage>853</fpage>&#x2013;<lpage>899</lpage>, <year>2013</year>. doi: <pub-id pub-id-type="doi">10.1613/jair.3994</pub-id>.</mixed-citation></ref>
<ref id="ref-50"><label>[50]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. C.</given-names> <surname>Park</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Kim</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Kim</surname></string-name></person-group>, &#x201C;<article-title>Attend to you: Personalized image captioning with context sequence memory networks</article-title>,&#x201D; in <conf-name>2017 IEEE Conf. Comput. Vision Pattern Recognit. (CVPR)</conf-name>, <publisher-loc>Honolulu, HI, USA</publisher-loc>, <year>2017</year>, pp. <fpage>6432</fpage>&#x2013;<lpage>6440</lpage>. doi: <pub-id pub-id-type="doi">10.1109/cvpr.2017.681</pub-id>.</mixed-citation></ref>
<ref id="ref-51"><label>[51]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Young</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Lai</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Hodosh</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Hockenmaier</surname></string-name></person-group>, &#x201C;<article-title>From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions</article-title>,&#x201D; <source>Trans. Assoc. Comput. Linguist.</source>, vol. <volume>2</volume>, no. <issue>1</issue>, pp. <fpage>67</fpage>&#x2013;<lpage>78</lpage>, <year>2014</year>. doi: <pub-id pub-id-type="doi">10.1162/tacl_a_00166</pub-id>.</mixed-citation></ref>
<ref id="ref-52"><label>[52]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>He</surname></string-name>, <string-name><given-names>X.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Ren</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Sun</surname></string-name></person-group>, &#x201C;<article-title>Deep residual learning for image recognition</article-title>,&#x201D; in <conf-name>2016 IEEE Conf. Comput. Vision Pattern Recognit. (CVPR)</conf-name>, <publisher-loc>Las Vegas, NV, USA</publisher-loc>, <year>2016</year>, pp. <fpage>770</fpage>&#x2013;<lpage>778</lpage>. doi: <pub-id pub-id-type="doi">10.1109/cvpr.2016.90</pub-id>.</mixed-citation></ref>
<ref id="ref-53"><label>[53]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Faghri</surname></string-name>, <string-name><given-names>D. J.</given-names> <surname>Fleet</surname></string-name>, <string-name><given-names>J. R.</given-names> <surname>Kiros</surname></string-name>, and <string-name><given-names>S.</given-names> <surname>Fidler</surname></string-name></person-group>, &#x201C;<article-title>VSE&#x002B;&#x002B;: Improving visual-semantic embeddings with hard negatives</article-title>,&#x201D; <month>Jul</month>. <year>2017</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1707.05612</pub-id>.</mixed-citation></ref>
<ref id="ref-54"><label>[54]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Mikolov</surname></string-name>, <string-name><given-names>Q. V.</given-names> <surname>Le</surname></string-name>, and <string-name><given-names>I.</given-names> <surname>Sutskever</surname></string-name></person-group>, &#x201C;<article-title>Exploiting similarities among languages for machine translation</article-title>,&#x201D; <month>Sep</month>. <year>2013</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.1309.4168</pub-id>.</mixed-citation></ref>
<ref id="ref-55"><label>[55]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Almuhareb</surname></string-name>, <string-name><given-names>W.</given-names> <surname>Alsanie</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Al-Thubaity</surname></string-name></person-group>, &#x201C;<article-title>Arabic word segmentation with long short-term memory neural networks and word embedding</article-title>,&#x201D; <source>IEEE Access</source>, vol. <volume>7</volume>, pp. <fpage>12879</fpage>&#x2013;<lpage>12887</lpage>, <year>2019</year>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2019.2893460</pub-id>.</mixed-citation></ref>
<ref id="ref-56"><label>[56]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Emami</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Nugues</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Elnagar</surname></string-name>, and <string-name><given-names>I.</given-names> <surname>Afyouni</surname></string-name></person-group>, &#x201C;<article-title>Arabic image captioning using pre-training of deep bidirectional transformers</article-title>,&#x201D; in <conf-name>Proc. 15th Int. Conf. Nat. Lang. Gen.</conf-name>, <publisher-loc>Waterville, Maine, USA</publisher-loc>, <year>2022</year>, pp. <fpage>40</fpage>&#x2013;<lpage>51</lpage>. doi: <pub-id pub-id-type="doi">10.18653/v1/2022.inlg-main.4</pub-id>; <pub-id pub-id-type="pmid">36568019</pub-id></mixed-citation></ref>
<ref id="ref-57"><label>[57]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Shinde</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Hatzade</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Unhale</surname></string-name>, and <string-name><given-names>G.</given-names> <surname>Marwal</surname></string-name></person-group>, &#x201C;<article-title>Analysis of different feature extractors for image captioning using deep learning</article-title>,&#x201D; in <conf-name>2022 3rd Int. Conf. Emerg. Tech. (INCET)</conf-name>, <publisher-loc>Belgaum, India</publisher-loc>, <year>2022</year>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/incet54531.2022.9824294</pub-id>.</mixed-citation></ref>
<ref id="ref-58"><label>[58]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Kumar</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Srivastava</surname></string-name>, <string-name><given-names>D. E.</given-names> <surname>Popescu</surname></string-name>, and <string-name><given-names>J. D.</given-names> <surname>Hemanth</surname></string-name></person-group>, &#x201C;<article-title>Dual-modal transformer with enhanced inter-and intra-modality interactions for image captioning</article-title>,&#x201D; <source>Appl. Sci.</source>, vol. <volume>12</volume>, no. <issue>13</issue>, pp. <fpage>6733</fpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.3390/app12136733</pub-id>.</mixed-citation></ref>
<ref id="ref-59"><label>[59]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Shah</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Kumar</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Chaudhary</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Sharma</surname></string-name> and <string-name><given-names>A.</given-names> <surname>Chaudhary</surname></string-name></person-group>, &#x201C;<article-title>Image captioning using Python</article-title>,&#x201D; in <conf-name>2023 Int. Conf. Power, Instrument., Energy Control (PIECON)</conf-name>, <publisher-loc>Aligarh, India</publisher-loc>, <year>2023</year>, pp. <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/piecon56912.2023.10085724</pub-id>.</mixed-citation></ref>
<ref id="ref-60"><label>[60]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Dixit</surname></string-name>, <string-name><given-names>G. R.</given-names> <surname>Pawar</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Gayakwad</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Joshi</surname></string-name>, <string-name><given-names>A.</given-names> <surname>Mahajan</surname></string-name> and <string-name><given-names>S. V.</given-names> <surname>Chinchmalatpure</surname></string-name></person-group>, &#x201C;<article-title>Challenges and a novel approach for image captioning using neural network and searching techniques</article-title>,&#x201D; <source>Int. J. Intell. Syst. Appl. Eng.</source>, vol. <volume>11</volume>, no. <issue>3</issue>, pp. <fpage>712</fpage>&#x2013;<lpage>720</lpage>, <year>2023</year>. <comment>Accessed: Dec. 16, 2023. [Online]</comment>. Available: <ext-link ext-link-type="uri" xlink:href="http://ijisae.org/index.php/IJISAE/article/view/3277">https://ijisae.org/index.php/IJISAE/article/view/3277</ext-link></mixed-citation></ref>
<ref id="ref-61"><label>[61]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>P.</given-names> <surname>Singh</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Kumar</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Kumar</surname></string-name></person-group>, &#x201C;<article-title>Next-LSTM: A novel LSTM-based image captioning technique</article-title>,&#x201D; <source>Int. J. Syst. Assur. Eng. Manag.</source>, vol. <volume>14</volume>, no. <issue>4</issue>, pp. <fpage>1492</fpage>&#x2013;<lpage>1503</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s13198-023-01956-7</pub-id>.</mixed-citation></ref>
<ref id="ref-62"><label>[62]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B. S.</given-names> <surname>Revathi</surname></string-name> and <string-name><given-names>A. M.</given-names> <surname>Kowshalya</surname></string-name></person-group>, &#x201C;<article-title>Automatic image captioning system based on augmentation and ranking mechanism</article-title>,&#x201D; <source>Signal, Image Video Process.</source>, vol. <volume>18</volume>, no. <issue>1</issue>, pp. <fpage>265</fpage>&#x2013;<lpage>274</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s11760-023-02725-6</pub-id>.</mixed-citation></ref>
<ref id="ref-63"><label>[63]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>V.</given-names> <surname>Jindal</surname></string-name></person-group>, &#x201C;<article-title>A deep learning approach for Arabic caption generation using roots-words</article-title>,&#x201D; in <conf-name>Proc. AAAI Conf. Artif. Intell.</conf-name>, vol. <volume>31</volume>, no. <issue>1</issue>, <year>2017</year>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v31i1.11090</pub-id>.</mixed-citation></ref>
<ref id="ref-64"><label>[64]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>S. M.</given-names> <surname>Sabri</surname></string-name></person-group>, &#x201C;<article-title>Arabic image captioning using deep learning with attention,&#x201D; Ph.D. dissertation, Univ. of Georgia, USA</article-title>, <year>2021</year>.</mixed-citation></ref>
<ref id="ref-65"><label>[65]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Elbedwehy</surname></string-name> and <string-name><given-names>T.</given-names> <surname>Medhat</surname></string-name></person-group>, &#x201C;<article-title>Improved Arabic image captioning model using feature concatenation with pre-trained word embedding</article-title>,&#x201D; <source>Neural Comput. Appl.</source>, vol. <volume>35</volume>, no. <issue>26</issue>, pp. <fpage>19051</fpage>&#x2013;<lpage>19067</lpage>, <year>2023</year>. doi: <pub-id pub-id-type="doi">10.1007/s00521-023-08744-1</pub-id>.</mixed-citation></ref>
</ref-list>
</back></article>