<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">58438</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2024.058438</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>A Cross Attention Transformer-Mixed Feedback Video Recommendation Algorithm Based on DIEN</article-title>
<alt-title alt-title-type="left-running-head">A Cross Attention Transformer-Mixed Feedback Video Recommendation Algorithm Based on DIEN</alt-title>
<alt-title alt-title-type="right-running-head">A Cross Attention Transformer-Mixed Feedback Video Recommendation Algorithm Based on DIEN</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Zhang</surname><given-names>Jianwei</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><email>mailzjw@163.com</email></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Zhao</surname><given-names>Zhishang</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Cai</surname><given-names>Zengyu</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-4" contrib-type="author">
<name name-style="western"><surname>Feng</surname><given-names>Yuan</given-names></name><xref ref-type="aff" rid="aff-4">4</xref></contrib>
<contrib id="author-5" contrib-type="author">
<name name-style="western"><surname>Zhu</surname><given-names>Liang</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<contrib id="author-6" contrib-type="author">
<name name-style="western"><surname>Sun</surname><given-names>Yahui</given-names></name><xref ref-type="aff" rid="aff-3">3</xref></contrib>
<aff id="aff-1"><label>1</label><institution>College of Software Engineering, Zhengzhou University of Light Industry</institution>, <addr-line>Zhengzhou, 450000</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Faculty of Information Engineering, Xuchang Vocational Technical College</institution>, <addr-line>Xuchang, 461000</addr-line>, <country>China</country></aff>
<aff id="aff-3"><label>3</label><institution>College of Computer Science and Technology, Zhengzhou University of Light Industry</institution>, <addr-line>Zhengzhou, 450000</addr-line>, <country>China</country></aff>
<aff id="aff-4"><label>4</label><institution>College of Electronics and Information, Zhengzhou University of Light Industry</institution>, <addr-line>Zhengzhou, 450000</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Jianwei Zhang. Email: <email>mailzjw@163.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2025</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>03</day><month>1</month><year>2025</year>
</pub-date>
<volume>82</volume>
<issue>1</issue>
<fpage>977</fpage>
<lpage>996</lpage>
<history>
<date date-type="received">
<day>12</day>
<month>9</month>
<year>2024</year>
</date>
<date date-type="accepted">
<day>23</day>
<month>10</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_58438.pdf"></self-uri>
<abstract>
<p>The rapid development of short video platforms poses new challenges for traditional recommendation systems. Recommender systems typically depend on two types of user behavior feedback to construct user interest profiles: explicit feedback (interactive behavior), which significantly influences users&#x2019; short-term interests, and implicit feedback (viewing time), which substantially affects their long-term interests. However, the previous model fails to distinguish between these two feedback methods, leading it to predict only the overall preferences of users based on extensive historical behavior sequences. Consequently, it cannot differentiate between users&#x2019; long-term and short-term interests, resulting in low accuracy in describing users&#x2019; interest states and predicting the evolution of their interests. This paper introduces a video recommendation model called CAT-MF Rec (Cross Attention Transformer-Mixed Feedback Recommendation) designed to differentiate between explicit and implicit user feedback within the DIEN (Deep Interest Evolution Network) framework. This study emphasizes the separate learning of the two types of behavioral feedback, effectively integrating them through the cross-attention mechanism. Additionally, it leverages the long sequence dependence capabilities of Transformer technology to accurately construct user interest profiles and predict the evolution of user interests. Experimental results indicate that CAT-MF Rec significantly outperforms existing recommendation methods across various performance indicators. This advancement offers new theoretical and practical insights for the development of video recommendations, particularly in addressing complex and dynamic user behavior patterns.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Video recommendation</kwd>
<kwd>user interest</kwd>
<kwd>cross-attention</kwd>
<kwd>transformer</kwd>
</kwd-group>
<funding-group>
<award-group id="awg1">
<funding-source>National Natural Science Foundation of China</funding-source>
<award-id>62072416</award-id>
</award-group>
<award-group id="awg2">
<funding-source>Key Research and Development Special Project of Henan Province</funding-source>
<award-id>221111210500</award-id>
</award-group>
<award-group id="awg3">
<funding-source>Key Technologies R&#x0026;D Program of Henan Province</funding-source>
<award-id>232102211053</award-id>
<award-id>242102211071</award-id>
</award-group>
</funding-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>In recent years, short videos have gained immense popularity, with platforms like TikTok and Kuaishou amassing large user bases and significant traffic [<xref ref-type="bibr" rid="ref-1">1</xref>]. Given the large user volume and traffic, recommendation systems (RS) have become a crucial technology in this domain [<xref ref-type="bibr" rid="ref-2">2</xref>], Recently, researchers have developed various recommender system methods focused on building user profiles based on user feedback [<xref ref-type="bibr" rid="ref-3">3</xref>] to enhance recommendation quality and achieve specific platform objectives [<xref ref-type="bibr" rid="ref-4">4</xref>].</p>
<p>Early recommendation models evolved from methods based on Collaborative Filtering (CF) [<xref ref-type="bibr" rid="ref-5">5</xref>] and Matrix Factorization (MF) [<xref ref-type="bibr" rid="ref-6">6</xref>] to logistic regression algorithms that integrate multiple features, eventually developing into deep learning-based methods [<xref ref-type="bibr" rid="ref-7">7</xref>], such as the AFM model [<xref ref-type="bibr" rid="ref-8">8</xref>], which was the first to introduce the attention mechanism. Recently, researchers have proposed several new recommendation system models, including the SASRec model [<xref ref-type="bibr" rid="ref-9">9</xref>], BERT4Rec model [<xref ref-type="bibr" rid="ref-10">10</xref>], and Alibaba&#x2019;s models, such as the Deep Interest Network (DIN) [<xref ref-type="bibr" rid="ref-11">11</xref>] and the Deep Interest Evolution Network (DIEN) [<xref ref-type="bibr" rid="ref-12">12</xref>]. These recommendation models aim to achieve the objectives of short video platforms by analyzing user behaviors, including increasing user retention, enhancing user engagement, and extending viewing time. The DIEN model, in particular, has gained significant recognition among researchers for its ability to learn and predict the evolution of user interests. For instance, Xu et al. introduced a hierarchical attention network based on the DIEN model, proposing a deep interest prediction model that utilizes hierarchical attention networks [<xref ref-type="bibr" rid="ref-13">13</xref>]. Feng et al. integrated Bi-LSTM into the DIEN model, resulting in the development of the Deep Session Interest Network [<xref ref-type="bibr" rid="ref-14">14</xref>]. Shi et al. employed neural networks for continuous modeling of interest evolution, proposing a deep time-stream framework built on the DIEN model [<xref ref-type="bibr" rid="ref-15">15</xref>].</p>
<p>Existing recommender systems and several models, including SASRec, BERT4Rec, DIN, and DIEN, along with their improved variants, have implemented deep learning on user behavior sequences, leading to the proposal of the concept of interest evolution. This concept has become a crucial technical foundation for short video recommendation systems. However, a common issue in these models is that they learn users&#x2019; historical behavior information indiscriminately when constructing user interest profiles, neglecting the distinction between explicit and implicit feedback in reflecting user interests [<xref ref-type="bibr" rid="ref-16">16</xref>]. This undifferentiated approach may result in a misunderstanding of user interests, thereby affecting the accuracy of recommendations, this is due to explicit and implicit feedback each have distinct advantages, relying on a single type of feedback or learning both types simultaneously fails to adequately capture the full scope of users&#x2019; interest states [<xref ref-type="bibr" rid="ref-17">17</xref>]. To overcome these limitations, it is essential to develop a more effective strategy for separately learning and integrating explicit and implicit feedback, thereby balancing users&#x2019; long-term and short-term interests.</p>
<p>This paper proposes a Cross Attention Transformer-Mixed Feedback Recommendation (CAT-MFRec) model to address this issue. Unlike existing models, the CAT-MFRec model distinctly differentiates between explicit and implicit feedback during user interest modeling, utilizing Transformers to model each type of feedback separately. More importantly, the model integrates the two types of feedback through the cross-attention mechanism, maximizing the advantages of both explicit and implicit feedback, and ultimately capturing users&#x2019; long-term and short-term interest states more accurately. Additionally, the model significantly improves predictions of user interest evolution while balancing short-term engagement and long-term satisfaction, thereby enhancing personalized recommendations and achieving the objectives of short video platforms.</p>
<p>The primary contributions of this paper are as follows:
<list list-type="bullet">
<list-item>
<p>This paper highlights the differing performance of explicit and implicit feedback mechanisms concerning long-term and short-term interests, emphasizing the importance of separate learning. It effectively combines these mechanisms to construct user interest states.</p></list-item>
<list-item>
<p>We propose the CAT-MFRec model, an innovative approach based on the cross-attention mechanism and Transformer. This model facilitates separate learning and effective integration of the two types of feedback, accurately constructing users&#x2019; long-term and short-term interest states and thereby enhancing predictions of their future interests.</p></list-item>
<list-item>
<p>We validated the effectiveness of our work using a real-world dataset (KuaiRand), with results demonstrating the efficacy of our approach.</p></list-item>
</list></p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Work</title>
<sec id="s2_1">
<label>2.1</label>
<title>Research Actuality</title>
<p>Video recommendation technology has emerged as a significant development area in artificial intelligence and data science in recent years. Video recommendation systems primarily analyze users&#x2019; historical behaviors to determine their interest preferences, delivering personalized video content to enhance user experience and increase platform engagement [<xref ref-type="bibr" rid="ref-18">18</xref>]. However, traditional recommendation systems typically account only for users&#x2019; static interests. The integration of deep learning technology into recommendation systems has led to significant breakthroughs, including the introduction of the concept of interest evolution, which signifies that user interests change dynamically over time [<xref ref-type="bibr" rid="ref-19">19</xref>].</p>
<p>Users have diverse interests; at any given moment, a user may possess multiple interests, a situation referred to as the &#x201C;interest state.&#x201D; Furthermore, each interest is dynamic, undergoes its own evolutionary process, and exhibits specific causal relationships [<xref ref-type="bibr" rid="ref-20">20</xref>]. Additionally, interests can drift; at any moment, a user&#x2019;s interest may manifest as behavior, such as watching basketball-related videos for a time and then suddenly switching to calligraphy-related content. However, previous models are limited in their ability to predict overall user preferences based solely on a large number of historical behavior sequences [<xref ref-type="bibr" rid="ref-21">21</xref>]. These models are unable to predict the user&#x2019;s &#x201C;next&#x201D; preference. To accurately predict the user&#x2019;s next preference, we must thoroughly understand the user&#x2019;s interest state, which necessitates distinguishing between the two types of user feedback, as they have different effects on interests. We will focus on this distinction in the next section; however, previous recommendation models do not differentiate between these feedback types. These models train all of a user&#x2019;s historical behaviors collectively, resulting in a generalized description of the user&#x2019;s interest state that fails to accurately balance long-term and short-term interests [<xref ref-type="bibr" rid="ref-22">22</xref>].</p>
<p>Therefore, the primary challenge lies in balancing the use of two feedback methods to construct a user&#x2019;s interest profile while effectively managing the relationship between long-term and short-term interests.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>User Feedback</title>
<p>In video recommendation systems, user feedback serves as a critical foundation for optimizing recommendation algorithms. Feedback is primarily categorized into explicit and implicit types, reflecting different dimensions of user interest and differing significantly in terms of timeliness and dependence. Understanding the characteristics of these two types of feedback and utilizing them effectively is essential for optimizing recommender systems [<xref ref-type="bibr" rid="ref-23">23</xref>].</p>
<p>Explicit feedback refers to preferences directly expressed by users through interactive actions, such as likes, comments, favorites, and shares, which provide immediate insights. This feedback arises directly from users&#x2019; active behaviors and reflects their attitudes toward the video content. For instance, a user who likes a video after viewing it provides strong positive feedback on the content. This short-term behavior allows the recommender system to rapidly capture users&#x2019; short-term interests.</p>
<p>Implicit feedback refers to behavioral preferences that are not explicitly stated, including viewing time, frequency, browsing path, and interaction duration. Although this type of feedback does not directly indicate user preferences, it allows for inferring long-term interests due to continuous observation of user behavior, often necessitating the analysis of historical data over extended periods. While implicit signals, such as viewing duration, do not directly reflect user attitudes, they can reveal trends in user preferences. For instance, a user who watches specific content for an extended duration may demonstrate a sustained interest in that content. These patterns typically reflect long-term interests rather than short-term fluctuations [<xref ref-type="bibr" rid="ref-24">24</xref>].</p>
<p>In general, explicit feedback allows the recommender system to rapidly adjust recommended content to align with users&#x2019; current interests, while implicit feedback reflects users&#x2019; long-term interests due to its stability, thereby enhancing the representation of long-term preferences [<xref ref-type="bibr" rid="ref-25">25</xref>].</p>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Transformer</title>
<p>The Transformer is a deep learning model designed for processing sequence data, utilizing a self-attention mechanism that operates independently of Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).</p>
<p>The model input includes a word embedding vector, and positional information is enhanced through positional encoding. The self-attention mechanism enables each position in the input sequence to focus on other positions by calculating the relationships between the query, key, and value to assess their influence. It employs the dot product and softmax function to generate attention weights. In the multi-head attention mechanism, attention is distributed across multiple heads, with each head processing different parts of the input. Each self-attention layer is followed by a feedforward neural network, which processes the vector at each position independently. Each encoder layer comprises a multi-head self-attention layer and a feedforward network, both equipped with residual connections and layer normalization to improve training stability and efficiency. The decoder layer resembles the encoder layer but incorporates an encoder-decoder attention layer, enabling it to focus on relevant portions of the encoder&#x2019;s output. Finally, the decoder output is processed through a linear layer followed by a softmax layer to generate a probability distribution [<xref ref-type="bibr" rid="ref-26">26</xref>].</p>
<p>These designs allow the Transformer to optimize parallel processing and manage long-range dependencies, making it suitable for complex tasks such as machine translation and text generation. In this study, we replace positional encoding with precise temporal information for two primary reasons. First, positional encoding based on sine and cosine primarily offers relative positional information within the sequence. In contrast, time series data encompasses not only the order of elements but also time intervals and durations, which positional encoding cannot accurately capture. Incorporating precise time information enables the model to learn actual time points and intervals, thereby modeling temporal dependencies more effectively. Second, when processing time series data, the model must comprehend the relative positions of data points and their time intervals. By integrating precise time information, the model can more effectively capture temporal dependencies and improve its capability to handle tasks with varying time spans [<xref ref-type="bibr" rid="ref-27">27</xref>].</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Cross Attention</title>
<p>The cross-attention mechanism is derived from classical attention mechanisms. Common attention mechanisms are typically found in self-attention mechanisms, which learn dependencies between elements within the same input dataset. In contrast, cross-attention integrates multiple inputs from diverse sources, enhancing feature representation by focusing on interactions among these inputs. This aligns closely with our requirements, as our model must handle two distinct sources of information: explicit and implicit feedback. The cross-attention mechanism effectively captures the mutual relationships and dependencies between these two types of information, facilitating better integration to predict the user&#x2019;s interest state. It dynamically assigns attention weights, enabling more precise learning of how different signals influence user interests.</p>
<p>In comparison to other methods, directly averaging or concatenating explicit and implicit feedback before sending it to the network for processing lacks dynamic weight adjustment. The cross-attention mechanism dynamically learns the importance of various feedback types in different contexts, while averaging or concatenating lacks this flexibility. Compared to graph neural networks, the cross-attention mechanism offers significant advantages in terms of parameter complexity and training efficiency when only two sources are present. Utilizing graph neural networks significantly increases model complexity, while also reducing training efficiency. When using the self-attention mechanism, dependencies cannot be captured through interactions among data from different sources [<xref ref-type="bibr" rid="ref-28">28</xref>].</p>
<p>In summary, the cross-attention mechanism improves the performance and adaptability of sequence-to-sequence models by allowing the decoder to dynamically focus on elements of the input sequence. It is a crucial component of contemporary deep learning models for processing complex sequence data [<xref ref-type="bibr" rid="ref-29">29</xref>].</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Method</title>
<p>In this section, we present a comprehensive overview of our proposed CAT-MFRec model. We begin with a foundational overview to facilitate deeper understanding, followed by a detailed examination of the complex design and setup of specific modules.</p>
<sec id="s3_1">
<label>3.1</label>
<title>CAT-MFRec Network Structure</title>
<p>The CAT-MFRec model mainly comprises three layers. From bottom to top, these layers are: 1. the behavior sequence layer; 2. the interest extraction layer; and 3. the interest evolution layer. And the final decision output section. Its structure is shown in the <xref ref-type="fig" rid="fig-1">Fig. 1</xref> below.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>CAT-MFRec network structure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-1.tif"/>
</fig>
<p>As can be seen from <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, the structure of CAT-MFRec is as follows:
<list list-type="order">
<list-item><p><bold>Embedding Layer:</bold> This layer comprises four types of features: User Behavior, Platform Object, Context, and User Profile. In this layer. We process two types of user behavior separately: <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> represents the implicit feedback from the user&#x2019;s viewing time, while <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> represents the explicit feedback from the user&#x2019;s interactions. Both are processed separately and fed into the embedding layer. This results in two batches of embedding vectors: <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>.</p></list-item>
<list-item><p><bold>Interest Extractor Layer:</bold> The Interest Extractor layer aims to mine and extract the &#x201C;interest state&#x201D; hidden behind the user&#x2019;s behavior at each moment, which is the primary focus of our improvement efforts. The cross-attention mechanism is introduced to enhance the training and learning of mixed feedback. The adopted sequence model is the Transformer model, it relies entirely on the self-attention mechanism to process the entire sequence, allowing simultaneous processing of all data points and significantly improving training speed.</p></list-item>
<list-item><p><bold>Interest Evolving Layer:</bold> As shown in the figure, a weight score is calculated through attention between the hidden state <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> obtained from the Interest Extractor layer and the platform target. This weight score is then combined with <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> in the AUGRU (Attention-based Gated Recurrent Unit) to obtain the final user interest state.</p></list-item>
<list-item><p><bold>Subsequent Processing:</bold> The user&#x2019;s interest state is obtained through the three aforementioned layers and concatenated with the Platform Object, Context features, and User Profile feature vectors. This combined input then enters the multi-layer fully connected layer to generate the final recommendation prediction.</p></list-item>
</list></p>
<p>The loss function for the entire model is calculated as follows:
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the cross-entropy loss following the fully connected layer, and the formula is:
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mtext>target</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mo>[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Here <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math></inline-formula> is the real situation, <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:mrow><mml:mover><mml:msub><mml:mi>y</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">&#x005E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is the probability that the model predicts the user is interested, <inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the auxiliary loss, and <inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>&#x03B1;</mml:mi></mml:math></inline-formula> is the hyperparameters used for balancing, which will be explained in the later section.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Embedding Layer</title>
<p>In our proposed model, in the Embedding stage, we divide user behavior features into explicit feedback <inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> brought by user interaction behavior and implicit feedback <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mn>2</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> brought by user viewing time for Embedding operation, respectively, to explicit feedback word embedding vector <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> and implicit feedback word embedding vector <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>T</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>. By handling explicit and implicit feedback separately, our model retains the distinct information associated with each feedback type. Subsequently, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>d</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> are fed into two parallel Transformer encoders to learn patterns and dependencies specific to each feedback type.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Interest Extractor Layer</title>
<sec id="s3_3_1">
<label>3.3.1</label>
<title>Transformer and Cross Attention</title>
<p>To accurately extract the user&#x2019;s interest state at each moment from their behavior, we need to implement a Transformer network capable of handling time series data. We will also incorporate the cross-attention mechanism when passing it into the self-attention mechanism of the Transformer, specifically exchanging the generated <italic>Q</italic> matrix. Its structure is shown in <xref ref-type="fig" rid="fig-2">Fig. 2</xref> below.</p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Transformers for time series</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-2.tif"/>
</fig>
<p>The input to the Transformer model consists of the embedding vector, which incorporates positional information through positional encoding. However, we will remove the Positional Encoding from the Transformer and replace it with appropriate time series encoding.</p>
<p>Building on the Transformer model with time series encoding, the sequential data processing methods for the explicit and implicit feedback embedding vectors are outlined below. In the encoder section of the Transformer model, accurate time information replaces the original positional encoding. First, accurate time features are extracted from the timestamps of user behavior and periodically encoded. These time features are then matched with the dimensions of <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> through linear transformation and subsequently summed.The embedding vectors <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> with time series information were obtained. The queries <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mi>Q</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, keys <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, and values <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> for the explicit feedback embedding vector <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and the queries <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, keys <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mi>K</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, and values <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>V</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> for the implicit feedback embedding vector <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> were derived through linear transformation.</p>
<p>When the explicit feedback embedding vector <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> and the implicit feedback embedding vector <inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:mi>T</mml:mi><mml:msub><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> are passed into the Transformer model, the query matrix <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mi>Q</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> representing explicit feedback is exchanged with the query matrix representing implicit feedback.This exchange allows the Transformer to process <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> along with <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>Q</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>K</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, and <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:msub><mml:mi>V</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> as inputs, accounting for both types of user behavior and their effects on long-term and short-term interests, as shown in <xref ref-type="fig" rid="fig-3">Fig. 3</xref> below.</p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Cross-attention mechanism</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-3.tif"/>
</fig>
<p>The subsequent calculation process in the two Transformer models is the same, taking the subsequent calculation of the explicit feedback embedding vector as an example: Each head of multi-head attention is a self-attention calculation, and the attention output is calculated through three steps: Scale, Softmax, and MatMul.</p>
<p>Firstly, in the Scale operation, the dot product result of <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> is scaled to prevent numerical instability caused by too large vector dimension, so that the attention score is maintained in a reasonable range. Subsequently, Softmax (normalization) is performed to obtain the attention weight, Finally, MatMul (matrix multiplication) multiplizes the attention weights with the value vector matrix <inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> to obtain the attention output of each head, The formula is summarized as follows:
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:msubsup><mml:mi>K</mml:mi><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mspace width="1em" /><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>W</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>S</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mi>S</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="1em" /><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>O</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi>p</mml:mi><mml:mi>u</mml:mi><mml:msub><mml:mi>t</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>W</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:msub><mml:mi>s</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></disp-formula></p>
<p>The above calculation process is the calculation process of multi-head attention computation:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:msubsup><mml:mi>K</mml:mi><mml:mi>i</mml:mi><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></disp-formula></p>
<p>Multi-head attention then concatenates the outputs of all heads and maps them back to the original dimensions of the model through a linear layer, the output of multi-head attention is then residual connected with the original input.</p>
<p>In the decoder section: These embeddings help the decoder to refer to the information of different candidate videos in the prediction process. Firstly, the input is linearly transformed to generate <italic>Q</italic>, <italic>K</italic>, <italic>V</italic> vector and sent to the calculation of mask multi-head, which is similar to the multi-head attention mechanism. But after each Scale of self-attention, a masked process is added. To ensure that the model is focusing on the relevant content, we apply masks as needed to hide the parts that don&#x2019;t need to be focused on, After that, the query <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>Q</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> obtained from the result of the mask multi-head attention calculation and the key <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mi>K</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and value <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:msub><mml:mi>V</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> obtained from the encoder output features are used for the multi-head cross-attention calculation of the encoder and decoder. The process of multi-head cross-attention calculation is consistent with the above multi-head attention calculation, and the overall formula is as follows:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mi>C</mml:mi><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi><mml:mi>A</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>Q</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>K</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>V</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mi>o</mml:mi><mml:mi>f</mml:mi><mml:mi>t</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>x</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>Q</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:msubsup><mml:mi>K</mml:mi><mml:mn>1</mml:mn><mml:mi>T</mml:mi></mml:msubsup></mml:mrow><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The output features of multi-head cross-attention computation were Add&#x0026;Norm and residual connected with the results of masked multi-head attention computation, and then through a feedforward neural network and concatenated Add&#x0026;Norm and residual connected with the input of feedforward neural network. Finally, the output result <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is obtained by a liner and Softmax function.</p>
<p>Similarly, a subsequent computation of the implicit feedback embedding vector yields the output <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:msub><mml:mi>h</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula>, which is followed by a weighted average of the two probability distributions <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:msub><mml:mi>h</mml:mi><mml:mi>d</mml:mi></mml:msub></mml:math></inline-formula> with one-to-one weights to synthesize an intermediate state <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>.</p>
</sec>
<sec id="s3_3_2">
<label>3.3.2</label>
<title>Auxiliary Loss Design</title>
<p>As we said above, the overall loss of the whole network is <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, this is because if the time series of users is directly added to it, new problems will also be encountered. If the behavior of users is regarded as a sequence, there is little difference between the user behavior sequence &#x201C;jeans&#x2013;Harun pants&#x2013;wide-leg pants&#x201D; and &#x201C;Harun pants&#x2013;jeans&#x2013;wide-leg pants&#x201D;. That is, the sequence of user actions is actually not very sensitive to order. The problem is that hidden state <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> only captures the dependence of user behavior sequence and cannot effectively reflect user interest. If we only rely on the loss after the final fully connected layer, we can only learn the final comprehensive interest of users, while hidden state <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> cannot be guided by effective supervision signals.</p>
<p>Therefore, there will be an auxiliary loss, which is used to guide the learning of the intermediate states, and its structure is shown in <xref ref-type="fig" rid="fig-4">Fig. 4</xref> below.</p>
<fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>Auxiliary loss</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-4.tif"/>
</fig>
<p>As shown in the figure, this is a binary classification model used to calculate the accuracy of interest extraction. We use the user&#x2019;s actual behavior at the next time step <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> as the positive example, and the negatively sampled behavior as the negative example <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>e</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup></mml:math></inline-formula>. These are respectively dotted with the extracted interest h(t), then input into the designed auxiliary network to obtain the prediction results, and calculate an auxiliary loss through the <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The formula for the auxiliary loss is:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>N</mml:mi></mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:munder><mml:mo>&#x2211;</mml:mo><mml:mi>t</mml:mi></mml:munder><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>Among them, <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> represents the inner product, and <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> is the Sigmoid function. From this loss design, it can be seen that the purpose is to force the interest sequence feature <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> to fit the user&#x2019;s behavior, so as to better capture the user&#x2019;s interest. The more similar <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is to <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, the larger the inner product, so <inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula> is close to 1, and <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> tends to 0. While <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> also tends to 0, so the auxiliary loss <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> tends to 0. It conforms to the objective of minimizing auxiliary loss <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Conversely, if <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is not similar to <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> tends to negative infinity and <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> also tends to negative infinity, in which case the auxiliary loss <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> tends to positive infinity, in line with our min (loss) objective. And vice versa, so the final loss function is <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:mi>L</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B1;</mml:mi><mml:mo>&#x2217;</mml:mo><mml:msub><mml:mi>L</mml:mi><mml:mrow><mml:mi>a</mml:mi><mml:mi>u</mml:mi><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>The purpose of this is to extract the interest state at each time. If only the last interest state is used to supervise, all the states of the hidden layer will serve for the last state, and the extracted hidden layer state is obviously distorted.</p>
</sec>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Interest Evolving Layer</title>
<p>In the interest evolution layer, firstly, the attention score <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> between the interest state and the target is calculated, which represents the correlation degree between the interest sequence feature <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and the current candidate advertisement at the current time step, and the larger the value is. It shows that the current <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is more related to the platform goal and more concerned about the value.</p>
<p>Where <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msub><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is the intermediate state to calculate the attention score <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>, <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:mi>q</mml:mi></mml:math></inline-formula> refers to the embedding vector of the current recommended target, <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math></inline-formula> and <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math></inline-formula> are the weight matrices of the projection, <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:mi>v</mml:mi></mml:math></inline-formula> is the context vector used to map to a scalar, <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mi>tanh</mml:mi></mml:math></inline-formula> adds nonlinearity to capture complex relationships, and <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>b</mml:mi></mml:math></inline-formula> is the bias term, then we normalize the attention weights, the calculation formula is summarized as follows:
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msub><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mi>v</mml:mi><mml:mi>T</mml:mi></mml:msup><mml:mo>&#x22C5;</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mi>q</mml:mi><mml:mo>+</mml:mo><mml:mi>b</mml:mi><mml:mo stretchy="false">)</mml:mo><mml:mspace width="2em" /><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>T</mml:mi></mml:munderover><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>e</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>After having the attention score, add it to AUGRU, that is, embed this attention operation into the AUGRU update gate, and use this layer to more pertinently simulate the interest evolution path related to the platform goal, so as to obtain the final interest, its structure is shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref> below.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>AUGRU structure</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-5.tif"/>
</fig>
<p>In AUGRU, by dynamically adjusting the update gate <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:msub><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula>, combined with the weight <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> of the attention mechanism, the model is allowed to selectively update the user&#x2019;s interest state according to the relevance of the current goal. This mechanism enables the model to give enough attention to the latest input or changes while retaining important historical information, so as to track and predict user interest changes more precisely.</p>
<p>In the calculation process, the vectors of the two behaviors of the user at time <italic>t</italic> are concatenated to get <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>d</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, and then the GRU update gate <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:msub><mml:mi>z</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> can calculate how much historical information should be retained from the previous state <inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> to the current state <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> through the current input <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and the user interest state <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> at the previous time step, Then we add the attention score to the update gate, we get:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>u</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>u</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mspace width="2em" /><mml:mrow><mml:mover><mml:msub><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mi>&#x03B1;</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x2217;</mml:mo><mml:msubsup><mml:mi>u</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup></mml:math></disp-formula></p>
<p>Then we also need a candidate hidden state <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mrow><mml:mover><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, the key intermediate computation used to update the current hidden state of the network. This mechanism allows these networks to preserve both long-term dependence and short-term correlation of information when processing sequential data, calculated as a gated combination of the current input <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> and the previous state <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>:
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:mrow><mml:mover><mml:msub><mml:mi>h</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mo>=</mml:mo><mml:mi>tanh</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>h</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:math></inline-formula> is the reset gate of GRU (Gated Recurrent Unit), which determines how much information should be kept from the previous state <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> when calculating the candidate state, and is calculated as follows:
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>r</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula>where <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula> represents the Sigmoid function used to compress the linear combination of update and reset gates into the range [0, 1], and <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:mi>tanh</mml:mi></mml:math></inline-formula> represents the hyperbolic tangent activation function used to introduce nonlinearities and help the network capture complex patterns and relationships.</p>
<p>With the above formula, it finally follows that AUGRU state updates can be expressed as follows:
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:msubsup><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msup><mml:mrow><mml:mover><mml:msub><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2217;</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>u</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup><mml:mo>&#x2217;</mml:mo><mml:msubsup><mml:mrow><mml:mover><mml:mi>h</mml:mi><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup></mml:math></disp-formula></p>
<p>The result is a vector <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:msubsup><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x2032;</mml:mi></mml:msubsup></mml:math></inline-formula>, which represents a high-dimensional representation of the user&#x2019;s interest state at that particular point in time. This interest state sequence reflects how the user&#x2019;s interest changes over time based on their actions and interactions, and can be directly used as input to subsequent recommendation models to predict the probability of a user&#x2019;s click rate or other actions for a recommendation item.</p>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Subsequent Processing</title>
<p>In this section, we provide a detailed exploration of the subsequent processing steps for obtaining user interest representation in the CAT-MFRec model.</p>
<p>Based on prior research on video recommendation, we have selected three metrics as platform goals to guide our model&#x2019;s learning: user time, user engagement, and user retention. The first two metrics are derived primarily from total user viewing time and the number of user interactions, while the third metric is influenced by the user&#x2019;s interest in the recommended content.</p>
<p>First, we analyze the user&#x2019;s interaction history with specific content on the platform over time, including viewing time and retention, to characterize user preferences. Together with contextual features (time, location, etc.) and user profile features (age, gender, interest tags, etc.), these data are encoded using one-hot encoding in the embedding layer and then concatenated.The concatenated vectors are then input into a fully connected layer for linear transformation. Following the fully connected layer, an activation function is introduced to introduce non-linear factors, enabling the model to learn and simulate more complex functions. In this model, we employ PReLU, alongside the Dice activation function.</p>
<p>The final feature vector is processed through the Sigmoid function of the post-classifier, converting the output to a value between 0 and 1, which indicates the accuracy of our predictions.</p>
<p>Its flow chart is shown in <xref ref-type="fig" rid="fig-6">Fig. 6</xref>.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Flow chart</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_58438-fig-6.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiment</title>
<p>We will conduct experiments to address three questions: 1. How does our improvement perform on a real dataset compared to existing recommendation methods? 2. How does the improved component impact the experiment? 3. How do the hyperparameters in the experimental setup influence the results?</p>
<sec id="s4_1">
<label>4.1</label>
<title>Experimental Environment Setup</title>
<sec id="s4_1_1">
<label>4.1.1</label>
<title>Datasets</title>
<p>We conduct extensive experiments using the real-world dataset KuaiRand<xref ref-type="fn" rid="fn1"><sup>1</sup></xref>
<fn id="fn1"><label>1</label><p><ext-link ext-link-type="uri" xlink:href="https://kuairand.com/">https://kuairand.com/</ext-link> (accessed on 22 October 2024).</p></fn>, from the recommendation logs of the video-sharing mobile application Kuaishou. The dataset covers a time span from 08 April 2022, to 08 May 2022. The dataset includes explicit feedback on user interaction behaviors, such as likes and comments, as well as implicit feedback based on viewing time. This combination enables effective capture of both short-term and long-term user interests, while the dataset&#x2019;s large scale supports model training. The dataset is derived from an actual short video platform, providing a high-fidelity research foundation for studying user interest evolution.</p>
<p>We divide the dataset into three parts: the first 15 days constitute the training set, the subsequent 10 days comprise the test set, and the final 5 days serve as the validation set. To enhance explicit feedback, we combined six types of explicit feedback (is-click, is-like, is-follow, is-comment, is-forward, is-hate) into a single item labeled &#x201C;clfcf.&#x201D; If &#x201C;is-hate&#x201D; equals 1, then &#x201C;clfcf&#x201D; is set to 0, indicating that the user dislikes the content. Conversely, if &#x201C;is-hate&#x201D; equals 0 and at least two of the first five items equal 1, &#x201C;clfcf&#x201D; is set to 1, indicating that the user likes the content. Refer to <xref ref-type="table" rid="table-1">Table 1</xref> for details on the evaluation of the preprocessed dataset.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Statistical details of the evaluation dataset</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Dataset</th>
<th>Users</th>
<th>Item</th>
<th>Interaction</th>
<th>Interaction density</th>
</tr>
</thead>
<tbody>
<tr>
<td>KuaiRand</td>
<td>27,284</td>
<td>7314</td>
<td>694,180</td>
<td>0.0805</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_1_2">
<label>4.1.2</label>
<title>Evaluation Index</title>
<p>To thoroughly assess the comparative methods, we must evaluate model performance regarding recommendation accuracy, user retention, engagement, and playtime using various metrics. First, to measure recommendation accuracy, we utilize the NDCG@k metric, which is widely adopted in recommendation systems to assess ranking quality and specifically evaluate the ranking results [<xref ref-type="bibr" rid="ref-30">30</xref>]. The values of NDCG typically range from 0 to 1, where values close to 1 indicate a recommendation list of very high quality, while values close to 0 indicate poor quality NDCG is calculated as follows:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:mi>N</mml:mi><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:msub><mml:mi>G</mml:mi><mml:mi>K</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:msub><mml:mi>G</mml:mi><mml:mi>K</mml:mi></mml:msub></mml:mrow><mml:mrow><mml:mi>I</mml:mi><mml:mi>D</mml:mi><mml:mi>C</mml:mi><mml:msub><mml:mi>G</mml:mi><mml:mi>K</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>To represent the three platform goals: user time, user engagement, and user retention, we formulated three metrics: <italic>DS</italic> (Duration Standard), <italic>IR</italic> (Interaction Ratio), and <italic>PTR</italic> (Play Time Ratio).</p>
<p>The user retention metric, Duration Std@k (DS@k), represents the standard deviation used to evaluate the recommendation system. This metric analyzes duration differences to assess whether the recommender system can meet users&#x2019;varied time needs. A high value of DS@k indicates significant differences in the duration of recommended content, which may better satisfy diverse user needs. For the user&#x2019;s playback time, we use the ratio of video playback time in the user&#x2019;s <inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>k</mml:mi></mml:math></inline-formula> recommendation list to the total playback time. The baseline is set at 80%, corresponding to a value of 1. A value greater than 1 indicates that the playback time of the recommended content typically exceeds the average. This suggests that the user is interested in these recommendations, so we label this metric as Play Time Ratio@k (PTR@k). For engagement, we focus on the proportion of interactions between users and videos in the top k recommendation list. The baseline is set at 1, where a value greater than 1 indicates that the recommended content achieves higher interaction than this baseline, signaling good user acceptance. We label this measure as Interaction Ratio@k (IR@k). The formula is presented as follows:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>K</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msqrt><mml:mfrac><mml:mn>1</mml:mn><mml:mi>K</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>d</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mover><mml:mi>d</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mn>2</mml:mn></mml:msup></mml:msqrt><mml:mspace width="2em" /><mml:mi>P</mml:mi><mml:mi>T</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>@</mml:mo></mml:mrow><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>U</mml:mi><mml:mi>p</mml:mi></mml:msub><mml:mrow><mml:mn>0.8</mml:mn><mml:msub><mml:mi>A</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:mfrac><mml:mspace width="2em" /><mml:mi>I</mml:mi><mml:mi>R</mml:mi><mml:mrow><mml:mo>@</mml:mo></mml:mrow><mml:mi>K</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:msub><mml:mi>U</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mrow><mml:mn>0.8</mml:mn><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Here, <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:msub><mml:mi>&#x03C3;</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:math></inline-formula> denotes the standard deviation of the durations of the <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>k</mml:mi></mml:math></inline-formula> recommendation items, <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mi>d</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math></inline-formula> represents the duration of the <italic>i</italic>-th recommendation item, and <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mrow><mml:mover><mml:mi>d</mml:mi><mml:mo stretchy="false">&#x00AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is the average duration of the <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:mi>T</mml:mi><mml:mi>o</mml:mi><mml:mi>p</mml:mi><mml:mi>k</mml:mi></mml:math></inline-formula> recommendations.</p>
</sec>
<sec id="s4_1_3">
<label>4.1.3</label>
<title>Implementation Details</title>
<p>In this invention, the dimension of the embedding layer in CAT-MFRec is set to 128, which is then increased to 256 in the linear layer of the Transformer. The number of Transformer layers is set to 2, with 8 heads, each having a dimension of 32, and the feedforward neural network dimension is 1024. The hidden layer dimension in DIEN is also set to 128, consistent with the model dimension. Adam is chosen as the optimizer for DIEN, responsible for adjusting the model&#x2019;s weights and parameters to minimize the loss function. Consequently, the model gradually improves its performance during training. The auxiliary loss weight decay range is: [1e-2, 1e-4, 1e-6, 0], and the search learning rate range is: [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], range of micro-batch sizes: [512, 1024, 2048], to adapt to the parameters of the range [0.0&#x2013;1.0, default &#x003D; 0.5]. The training period consists of 10 rounds, utilizing an early stopping policy with a patience of 5. Additionally, the hyperparameter k for the top-k list metric is set to 10.</p>
</sec>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Performance Comparison</title>
<p>We evaluated the overall performance of the current popular recommender system models in this system, and the summary results shown in <xref ref-type="table" rid="table-2">Table 2</xref>.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Model performance comparison</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>NDCG@10</th>
<th>DS@10</th>
<th>PTR@10</th>
<th>IR@10</th>
</tr>
</thead>
<tbody>
<tr>
<td>CAT-MFRec</td>
<td>0.8143</td>
<td>77.2623</td>
<td>1.152</td>
<td>1.135</td>
</tr>
<tr>
<td>DIEN</td>
<td>0.7814</td>
<td>75.4512</td>
<td>1.088</td>
<td>1.083</td>
</tr>
<tr>
<td>DIN</td>
<td>0.7684</td>
<td>70.1514</td>
<td>1.069</td>
<td>1.058</td>
</tr>
<tr>
<td>SASRec</td>
<td>0.7325</td>
<td>66.51</td>
<td>1.004</td>
<td>1.003</td>
</tr>
<tr>
<td>BERT4Rec</td>
<td>0.6901</td>
<td>65.23</td>
<td>1.006</td>
<td>0.992</td>
</tr>
<tr>
<td>AFM</td>
<td>0.6425</td>
<td>42.33</td>
<td>0.978</td>
<td>0.966</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The CAT-MFRec model demonstrates superior performance across all evaluated aspects, achieving an NDCG@10 score of 0.8143, the highest among all models. This indicates that our model excels in the relevance and ranking quality of the recommendation list. A high NDCG value indicates that the recommender system effectively ranks content most likely to interest the user at the top of the list, this performance significantly enhances user satisfaction, as users are more likely to quickly find relevant content.</p>
<p>For the other three metrics, CAT-MFRec achieves a score of 77.2623 on the DS@10 metric, this metric reflects the diversity of recommended content durations. The high score of CAT-MFRec indicates its ability to provide users with a wide range of content options to accommodate various viewing preferences. This approach increases user engagement over time, by offering diverse content, the platform can keep users engaged for longer periods, which is essential for enhancing overall user retention.</p>
<p>The score for PTR@10 is 1.152, this performance surpasses that of other models, highlighting CAT-MFRec&#x2019;s ability to stimulate user interaction. Higher interaction rates are typically linked to better content quality and user satisfaction, suggesting that CAT-MFRec excels in personalized content matching. Frequent interaction with recommended content can foster greater user loyalty, which is essential to building a long-term relationship between users and the platform.</p>
<p>CAT-MFRec also achieves the highest performance on the CR@10 metric. This indicates that CAT-MFRec effectively understands and meets users&#x2019; deeper interests, which is crucial for enhancing user engagement and platform retention. As users invest more time on the platform, they become more integrated into the ecosystem, reducing the likelihood of churn. Furthermore, extended viewing times can lead to increased monetization opportunities, including higher ad impressions and improved subscription retention.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Ablation Experiment</title>
<p>In order to study the impact of each component, we made the following design, introducing some variants as follows: G-C: Replace Transformer with GRU and keep Cross-Attention; T-FC: Keep Transformer, but disable Cross-Attention; G-FC: Replace Transformer with GRU and disable Cross-Attention; RD: Removes the viewing duration from the input; RC: Removes explicit feedback from the input. The results are shown in <xref ref-type="table" rid="table-3">Table 3</xref> below.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Results of ablation experiments</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th>Model</th>
<th>NDCG@10</th>
<th>DS@10</th>
<th>PTR@10</th>
<th>IR@10</th>
<th>Epoch time</th>
</tr>
</thead>
<tbody>
<tr>
<td>CAT-MFRec</td>
<td>0.8143</td>
<td>77.2623</td>
<td>1.152</td>
<td>1.135</td>
<td>289 s</td>
</tr>
<tr>
<td>G-C</td>
<td>0.8052</td>
<td>76.442</td>
<td>1.137</td>
<td>1.125</td>
<td>412 s</td>
</tr>
<tr>
<td>T-FC</td>
<td>0.7851</td>
<td>75.453</td>
<td>0.928</td>
<td>0.964</td>
<td>355 s</td>
</tr>
<tr>
<td>G-FC</td>
<td>0.7725</td>
<td>74.2623</td>
<td>1.088</td>
<td>1.083</td>
<td>435 s</td>
</tr>
<tr>
<td>RD</td>
<td>0.4676</td>
<td>43.228</td>
<td>&#x2013;</td>
<td>0.686</td>
<td>281 s</td>
</tr>
<tr>
<td>RC</td>
<td>0.4651</td>
<td>42.768</td>
<td>0.712</td>
<td>&#x2013;</td>
<td>271 s</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The results indicate that the cross-attention mechanism and the Transformer module are crucial to the CAT-MFRec model, as they effectively integrate user behavior data and dynamically update user interest status. The performance differences observed when these components were removed or modified are substantial. For instance, removing the cross-attention mechanism resulted in a marked decrease in both recommendation relevance and overall ranking quality, underscoring its essential role in capturing complex user behavior interactions. Likewise, the absence of the Transformer module significantly affected the model&#x2019;s ability to track changes in user interest over time, leading to decreased prediction accuracy.</p>
<p>CAT-MFRec exhibits the highest training efficiency, the performance of T-FC remains superior to that of G-FC even when the cross-attention mechanism is disabled. primarily due to the parallel training of the two types of feedback using the Transformer. Additionally, the experiment confirms the significance of mixed feedback in enhancing the model&#x2019;s understanding of user preferences. Excluding mixed feedback substantially diminished the model&#x2019;s ability to deliver accurate recommendations. These performance differences underscore the critical contributions of each component to the overall model effectiveness.</p>
<p>In conclusion, these experiments not only validate the rationale behind the original model design but also highlight the specific contributions of individual components, providing valuable insights for future optimization and evolution. Future research could focus on refining the cross-attention mechanism to more effectively capture user behavior patterns or exploring alternative architectures.</p>
</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Hyperparameter Experiments</title>
<p>Hyperparameter tuning is essential in deep learning, as selecting the right hyperparameters improves prediction accuracy, and generalization while preventing overfitting. In this study, we performed a systematic hyperparameter optimization experiment to investigate the specific impact of various hyperparameter configurations on model performance and to identify the optimal parameter combination.</p>
<p>To achieve this goal, we evaluated various parameters, including the dimensions of the embedding vector <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:mi>d</mml:mi></mml:math></inline-formula>, learning rates <inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula>, and the performance of the multi-head attention mechanism in the Transformer <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mi>h</mml:mi></mml:math></inline-formula>, we also assessed the size of the hidden layer <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the default values for the parameters in the experiment are: <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:mi>d</mml:mi><mml:mo>=</mml:mo><mml:mn>128</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:mi>&#x03B7;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mi>h</mml:mi><mml:mo>=</mml:mo><mml:mn>8</mml:mn></mml:math></inline-formula>, <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mi>d</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>128</mml:mn></mml:math></inline-formula>. In this study, we focused solely on evaluating the final NDCG metric. The hyperparameter settings and results are presented in the following <xref ref-type="table" rid="table-4">Table 4</xref>.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Parameter settings and results</title>
</caption>
<table frame="hsides">
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
<col align="left"/>
</colgroup>
<thead>
<tr>
<th colspan="2">Parameter</th>
<th align="center" colspan="2">Settings</th>
<th/>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td><inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:mi>d</mml:mi></mml:math></inline-formula></td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
<td>512</td>
</tr>
<tr>
<td></td>
<td>0.8025</td>
<td>0.8089</td>
<td>0.8131</td>
<td>0.8032</td>
<td>0.7994</td>
</tr>
<tr>
<td><inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:mi>&#x03B7;</mml:mi></mml:math></inline-formula></td>
<td>0.1</td>
<td>0.2</td>
<td>0.3</td>
<td>0.4</td>
<td>0.5</td>
</tr>
<tr>
<td></td>
<td>0.8122</td>
<td>0.8101</td>
<td>0.8004</td>
<td>0.7974</td>
<td>0.7841</td>
</tr>
<tr>
<td><inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:mi>h</mml:mi></mml:math></inline-formula></td>
<td>8</td>
<td>12</td>
<td>16</td>
<td>32</td>
<td>64</td>
</tr>
<tr>
<td></td>
<td>0.8133</td>
<td>0.8024</td>
<td>0.8011</td>
<td>0.8007</td>
<td>0.8008</td>
</tr>
<tr>
<td><inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mrow><mml:mrow><mml:mtext>hidden</mml:mtext></mml:mrow></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula></td>
<td>32</td>
<td>64</td>
<td>128</td>
<td>256</td>
<td>512</td>
</tr>
<tr>
<td></td>
<td>0.8022</td>
<td>0.8076</td>
<td>0.8129</td>
<td>0.8066</td>
<td>0.8011</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The model performs best when the embedding vector dimension is 128, However, increasing the embedding dimension further results in decreased performance. This suggests that a larger embedding vector can increase model complexity and degrade performance. The model performance shows little difference between learning rates of 0.1 and 0.4; however, there is an overall downward trend, particularly at a learning rate of 0.5. This indicates that a higher learning rate may cause the model to update too rapidly, hindering convergence. The model achieves optimal performance with 8 heads, but performance slightly declines when the number of heads increases to 12 or more. This suggests that an excessive number of heads increases model complexity without yielding significant performance gains. The model performs optimally with a hidden layer size of 128; however, further increases in hidden layer size lead to a slight decline in performance. This may be attributed to the reduced generalization ability resulting from excessively large hidden layers.</p>
<p>The results of the hyperparameter experiments indicate that the optimal hyperparameter combination for the CAT-MFRec model includes an embedding vector dimension of 128, a learning rate of 0.1, 8 Transformer heads, and a hidden layer size of 128. These hyperparameters provide an optimal balance between model complexity and performance, enhancing the model&#x2019;s generalization ability. As the experimental parameters increase, the model&#x2019;s performance does not significantly improve and may even decline in some cases, this may result from increased complexity and computational demands associated with larger parameter values, shows that the model is a model that pays more attention to balance.</p>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusions</title>
<p>Addressing the limitations of existing recommender systems in handling user feedback, this paper proposes a hybrid model that combines both explicit and implicit user feedback. The CAT-MFRec (Cross Attention Transformer-Deep Interest Evolution Network Recommendation) model aims to enhance user experience on short video platforms. Short video platforms like TikTok and Kuaishou boast a large user base, making an effective recommendation system essential for enhancing user retention, improving engagement, and increasing watch time. This model integrates explicit and implicit user feedback to more accurately capture user interest drift and dynamic changes. Utilizing the cross-attention mechanism and Transformer architecture, CAT-MFRec effectively addresses the long-distance dependence problem in sequence data, optimizing the accuracy of feature extraction and interest prediction. Experimental results indicate that, compared to traditional recommendation methods and other deep learning models, the proposed model exhibits significant advantages in key performance indicators, including user engagement, retention rate, and viewing time in short video recommendation scenarios.</p>
</sec>
</body>
<back>
<ack><p>I would like to express my heartfelt gratitude to all those who contributed to this paper. Their dedication and insights were crucial in shaping the outcomes of this work.</p>
</ack>
<sec><title>Funding Statement</title>
<p>This research is partially supported by National Natural Science Foundation of China (62072416), Key Research and Development Special Project of Henan Province (221111210500), Key Technologies R&#x0026;D Program of Henan Province (232102211053, 242102211071).</p>
</sec>
<sec><title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design: Jianwei Zhang, Zhishang Zhao, draft manuscript preparation: Zhishang Zhao, Zengyu Cai, data collection: Yahui Sun, analysis and interpretation of results: Yuan Feng, Liang Zhu. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability"><title>Availability of Data and Materials</title>
<p>The data and materials utilized in this review originate from publicly available databases and previously published studies, with proper citations included throughout the text. References to these sources can be found in the bibliography.</p>
</sec>
<sec><title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement"><title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>Y. H.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>T. J.</given-names> <surname>Gu</surname></string-name>, and <string-name><given-names>S. Y.</given-names> <surname>Wang</surname></string-name></person-group>, &#x201C;<article-title>Causes and characteristics of short video platform internet community taking the tiktok short video application as an example</article-title>,&#x201D; in <conf-name>2019 IEEE Int. Conf. Consum. Elect.-Taiwan (ICCE-TW)</conf-name>, <publisher-loc>Yilan, Taiwan</publisher-loc>, <year>2020</year>. doi: <pub-id pub-id-type="doi">10.1109/ICCE-TW46550.2019.8992021</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>H.</given-names> <surname>Ko</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Lee</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Park</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Choi</surname></string-name></person-group>, &#x201C;<article-title>A survey of recommendation systems: Recommendation models, techniques, and application fields</article-title>,&#x201D; <source>Electronics</source>, vol. <volume>11</volume>, no. <issue>1</issue>, <year>2022</year>, Art. no. <comment>141</comment>. doi: <pub-id pub-id-type="doi">10.3390/electronics11010141</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>E.</given-names> <surname>Raad</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Chbeir</surname></string-name>, and <string-name><given-names>A.</given-names> <surname>Dipanda</surname></string-name></person-group>, &#x201C;<article-title>User profile matching in social networks</article-title>,&#x201D; in <conf-name>2010 13th Int. Conf. Netw.-Based Inf. Syst.</conf-name>, <year>2010</year>, pp. <fpage>297</fpage>&#x2013;<lpage>304</lpage>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Yuan</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Xia</surname></string-name>, and <string-name><given-names>Q.</given-names> <surname>Ye</surname></string-name></person-group>, &#x201C;<article-title>The effect of advertising strategies on a short video platform: Evidence from tiktok</article-title>,&#x201D; <source>Ind. Manag. Data Syst.</source>, vol. <volume>122</volume>, no. <issue>8</issue>, pp. <fpage>1956</fpage>&#x2013;<lpage>1974</lpage>, <year>2022</year>. doi: <pub-id pub-id-type="doi">10.1108/IMDS-12-2021-0754</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>X.</given-names> <surname>Su</surname></string-name> and <string-name><given-names>T. M.</given-names> <surname>Khoshgoftaar</surname></string-name></person-group>, &#x201C;<article-title>A survey of collaborative filtering techniques</article-title>,&#x201D; <source>Adv. Artif. Intell.</source>, vol. <volume>2009</volume>, no. <issue>1</issue>, <year>2009</year>, Art. no. <comment>421425</comment>. doi: <pub-id pub-id-type="doi">10.1155/2009/421425</pub-id>.</mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Zhang</surname></string-name></person-group>, &#x201C;<article-title>A matrix decomposition and its applications</article-title>,&#x201D; <source>Linear Multilinear A.</source>, vol. <volume>63</volume>, no. <issue>10</issue>, pp. <fpage>2033</fpage>&#x2013;<lpage>2042</lpage>, <year>2015</year>. doi: <pub-id pub-id-type="doi">10.1080/03081087.2014.933219</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>M.</given-names> <surname>Naumov</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Deep learning recommendation model for personalization and recommendation systems</article-title>,&#x201D; <year>2019</year>, <comment>arXiv:1906.00091</comment>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Xiao</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Ye</surname></string-name>, <string-name><given-names>X.</given-names> <surname>He</surname></string-name>, <string-name><given-names>H.</given-names> <surname>Zhang</surname></string-name>, <string-name><given-names>F.</given-names> <surname>Wu</surname></string-name>, and <string-name><given-names>T.</given-names> <surname>Chua</surname></string-name></person-group>, &#x201C;<article-title>Attentional factorization machines: Learning the weight of feature interactions via attention networks</article-title>,&#x201D; <year>2017</year>, <comment>arXiv:1708.04617</comment>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W. -C.</given-names> <surname>Kang</surname></string-name> and <string-name><given-names>J.</given-names> <surname>McAuley</surname></string-name></person-group>, &#x201C;<article-title>Self-attentive sequential recommendation</article-title>,&#x201D; in <conf-name>2018 IEEE Int. Conf. Data Min. (ICDM)</conf-name>, <publisher-loc>Singapore</publisher-loc>, <year>2018</year>, pp. <fpage>197</fpage>&#x2013;<lpage>206</lpage>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Sun</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer</article-title>,&#x201D; in <conf-name>Proc. 28th ACM Int. Conf. Inform. Know. Manag., CIKM &#x2019;19</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <publisher-name>Association for Computing Machinery</publisher-name>, <year>2019</year>, pp. <fpage>1441</fpage>&#x2013;<lpage>1450</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3357384.3357895</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Zhou</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Deep interest network for click-through rate prediction</article-title>,&#x201D; in <conf-name>Proc. 24th ACM SIGKDD Int. Conf. Know. Disc. &#x0026; Data Min., KDD &#x2019;18</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <publisher-name>Association for Computing Machinery</publisher-name>, <year>2018</year>, pp. <fpage>1059</fpage>&#x2013;<lpage>1068</lpage>. doi: <pub-id pub-id-type="doi">10.1145/3219819.3219823</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Zhou</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Deep interest evolution network for click-through rate prediction</article-title>,&#x201D; <source>Proc. AAAI Conf. Artif. Intell.</source>, vol. <volume>33</volume>, no. <issue>1</issue>, pp. <fpage>5941</fpage>&#x2013;<lpage>5948</lpage>, <year>Jul. 2019</year>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33015941</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>W.</given-names> <surname>Xu</surname></string-name>, <string-name><given-names>H.</given-names> <surname>He</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Tan</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Lang</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Guo</surname></string-name></person-group>, &#x201C;<article-title>Deep interest with hierarchical attention network for click-through rate prediction</article-title>,&#x201D; in <conf-name>Proc. 43rd Int. ACM SIGIR</conf-name>, <year>2020</year>, pp. <fpage>1905</fpage>&#x2013;<lpage>1908</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Feng</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Deep session interest network for click-through rate prediction</article-title>,&#x201D; <year>2019</year>, <comment>arXiv:1905.06482</comment>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>ShuTing</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Deep time-stream framework for click-through rate prediction by tracking interest evolution</article-title>,&#x201D; <source>Proc. AAAI Conf. Artif. Intell.</source>, vol. <volume>34</volume>, no. <issue>4</issue>, pp. <fpage>5726</fpage>&#x2013;<lpage>5733</lpage>, <year>Apr. 2020</year>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v34i04.6028</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>G.</given-names> <surname>Jawaheer</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Weller</surname></string-name>, and <string-name><given-names>P.</given-names> <surname>Kostkova</surname></string-name></person-group>, &#x201C;<article-title>Modeling user preferences in recommender systems: A classification framework for explicit and implicit user feedback</article-title>,&#x201D; <source>ACM TIIS</source>, vol. <volume>4</volume>, no. <issue>2</issue>, pp. <fpage>1</fpage>&#x2013;<lpage>26</lpage>, <year> Jun. 2014</year>. doi: <pub-id pub-id-type="doi">10.1145/2512208</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>R. G.</given-names> <surname>Sakthivelan</surname></string-name>, <string-name><given-names>P.</given-names> <surname>Rjendran</surname></string-name>, and <string-name><given-names>M.</given-names> <surname>Thangavel</surname></string-name></person-group>, &#x201C;<article-title>Retraction note: A video analysis on user feedback based recommendation using A-FP hybrid algorithm</article-title>,&#x201D; <source>Multimed. Tools Appl.</source>, vol. <volume>82</volume>, no. <issue>10</issue>, <year>2022</year>, Art. no. <comment>15923</comment>. doi: <pub-id pub-id-type="doi">10.1007/s11042-022-13866-0</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>L.</given-names> <surname>Tian</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Understanding user behavior at scale in a mobile video chat application</article-title>,&#x201D; in <conf-name>Proc. 2013 ACM Int. Joint Conf. Pervasive Ubiq. Comput., UbiComp &#x2019;13</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <publisher-name>Association for Computing Machinery</publisher-name>, <year>2013</year>, pp. <fpage>647</fpage>&#x2013;<lpage>656</lpage>. doi: <pub-id pub-id-type="doi">10.1145/2493432.2493488</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>T.</given-names> <surname>Jia</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Wang</surname></string-name>, and <string-name><given-names>B. K.</given-names> <surname>Szymanski</surname></string-name></person-group>, &#x201C;<article-title>Quantifying patterns of research-interest evolution</article-title>,&#x201D; <source>Nat. Hum. Behav.</source>, vol. <volume>1</volume>, no. <issue>4</issue>, <year>2017</year>, Art. no. <comment>0078</comment>. doi: <pub-id pub-id-type="doi">10.1038/s41562-017-0078</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>C. -M.</given-names> <surname>Au Yeung</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Gibbins</surname></string-name>, and <string-name><given-names>N.</given-names> <surname>Shadbolt</surname></string-name></person-group>, &#x201C;<article-title>Multiple interests of users in collaborative tagging systems</article-title>,&#x201D; in <conf-name>Weav. Serv. People World Wide Web</conf-name>, <year>2009</year>, pp. <fpage>255</fpage>&#x2013;<lpage>274</lpage>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>S.</given-names> <surname>Zhang</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Causerec: Counterfactual user sequence synthesis for sequential recommendation</article-title>,&#x201D; in <conf-name>Proc. 44th Int. ACM SIGIR Conf. Res. Develop. Inform. Retr., SIGIR &#x2019;21</conf-name>, <publisher-loc>New York, NY, USA</publisher-loc>, <publisher-name>Association for Computing Machinery</publisher-name>, <year>2021</year>, pp. <fpage>367</fpage>&#x2013;<lpage>377</lpage>.</mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>B.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>D.</given-names> <surname>Li</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>Z.</given-names> <surname>Wang</surname></string-name>, <string-name><given-names>B.</given-names> <surname>Li</surname></string-name> and <string-name><given-names>C.</given-names> <surname>Zeng</surname></string-name></person-group>, &#x201C;<article-title>Integrating user short-term intentions and long-term preferences in heterogeneous hypergraph networks for sequential recommendation</article-title>,&#x201D; <source>Inform. Process. Manag.</source>, vol. <volume>61</volume>, no. <issue>3</issue>, <year>2024</year>, Art. no. <comment>103680</comment>. doi: <pub-id pub-id-type="doi">10.1016/j.ipm.2024.103680</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>K.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Qian</surname></string-name>, <string-name><given-names>T.</given-names> <surname>Chen</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>Q. V. H.</given-names> <surname>Nguyen</surname></string-name> and <string-name><given-names>H.</given-names> <surname>Yin</surname></string-name></person-group>, &#x201C;<article-title>Where to go next: Modeling long- and short-term user preferences for point-of-interest recommendation</article-title>,&#x201D; <source>Proc. AAAI Conf. Artif. Intell.</source>, vol. <volume>34</volume>, no. <issue>1</issue>, pp. <fpage>214</fpage>&#x2013;<lpage>221</lpage>, <year>Apr. 2020</year>. doi: <pub-id pub-id-type="doi">10.1609/aaai.v34i01.5353</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>J.</given-names> <surname>Shao</surname></string-name></person-group>, &#x201C;<article-title>Effectiveness of explicit and implicit corrective feedback in a video-based scmc environment</article-title>,&#x201D; <source>Int. J. Linguist. Trans. Stud.</source>, vol. <volume>3</volume>, no. <issue>3</issue>, pp. <fpage>15</fpage>&#x2013;<lpage>28</lpage>, <year>Aug. 2022</year>. doi: <pub-id pub-id-type="doi">10.36892/ijlts.v3i3.249</pub-id>.</mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>D.</given-names> <surname>Li</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>CARM: Confidence-aware recommender model via review representation learning and historical rating behavior in the online platforms</article-title>,&#x201D; <source>Neurocomputing</source>, vol. <volume>455</volume>, pp. <fpage>283</fpage>&#x2013;<lpage>296</lpage>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.1016/j.neucom.2021.03.122</pub-id>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>F.</given-names> <surname>Daneshfar</surname></string-name></person-group>, &#x201C;<article-title>Enhancing low-resource sentiment analysis: A transfer learning approach</article-title>,&#x201D; <source>Passer J. Basic and Appl. Sci.</source>, vol. <volume>6</volume>, no. <issue>2</issue>, pp. <fpage>265</fpage>&#x2013;<lpage>274</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.24271/psr.2024.440793.1484</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>N.</given-names> <surname>Ranjbar</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Momtazi</surname></string-name>, and <string-name><given-names>M.</given-names> <surname>Homayoonpour</surname></string-name></person-group>, &#x201C;<article-title>Explaining recommendation system using counterfactual textual explanations</article-title>,&#x201D; <source>Mach. Learn.</source>, vol. <volume>113</volume>, no. <issue>4</issue>, pp. <fpage>1989</fpage>&#x2013;<lpage>2012</lpage>, <year>2024</year>. doi: <pub-id pub-id-type="doi">10.1007/s10994-023-06390-1</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Vaswani</surname></string-name> <etal>et al.</etal></person-group>, &#x201C;<article-title>Attention is all you need</article-title>,&#x201D; <year>2017</year>, <comment>arXiv:1706.03762</comment>.</mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><given-names>Y.</given-names> <surname>Liu</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Sun</surname></string-name>, <string-name><given-names>Y.</given-names> <surname>Qiu</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Zhang</surname></string-name>, and <string-name><given-names>L. V.</given-names> <surname>Gool</surname></string-name></person-group>, &#x201C;<article-title>Transformer in convolutional neural networks</article-title>,&#x201D; <source>Comput. Sci.-Comput. Vis. Pattern Recognit.</source>, vol. <volume>3</volume>, <year>2021</year>. doi: <pub-id pub-id-type="doi">10.48550/arXiv.2106.03180</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><given-names>A.</given-names> <surname>Rashed</surname></string-name>, <string-name><given-names>S.</given-names> <surname>Elsayed</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Schmidt-Thieme</surname></string-name></person-group>, &#x201C;<article-title>Context and attribute-aware sequential recommendation via cross-attention</article-title>,&#x201D; in <conf-name>Proc. 16th ACM CRS, RecSys &#x2019;22</conf-name>, <year>2022</year>, pp. <fpage>71</fpage>&#x2013;<lpage>80</lpage>.</mixed-citation></ref>
</ref-list>
</back></article>