<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.1 20151215//EN" "http://jats.nlm.nih.gov/publishing/1.1/JATS-journalpublishing1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xml:lang="en" article-type="research-article" dtd-version="1.1">
<front>
<journal-meta>
<journal-id journal-id-type="pmc">CMC</journal-id>
<journal-id journal-id-type="nlm-ta">CMC</journal-id>
<journal-id journal-id-type="publisher-id">CMC</journal-id>
<journal-title-group>
<journal-title>Computers, Materials &#x0026; Continua</journal-title>
</journal-title-group>
<issn pub-type="epub">1546-2226</issn>
<issn pub-type="ppub">1546-2218</issn>
<publisher>
<publisher-name>Tech Science Press</publisher-name>
<publisher-loc>USA</publisher-loc>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">73850</article-id>
<article-id pub-id-type="doi">10.32604/cmc.2025.073850</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>Enhancing Anomaly Detection with Causal Reasoning and Semantic Guidance</article-title>
<alt-title alt-title-type="left-running-head">Enhancing Anomaly Detection with Causal Reasoning and Semantic Guidance</alt-title>
<alt-title alt-title-type="right-running-head">Enhancing Anomaly Detection with Causal Reasoning and Semantic Guidance</alt-title>
</title-group>
<contrib-group>
<contrib id="author-1" contrib-type="author">
<name name-style="western"><surname>Gao</surname><given-names>Weishan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-2" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Ye</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-3" contrib-type="author">
<name name-style="western"><surname>Wang</surname><given-names>Xiaoyin</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref></contrib>
<contrib id="author-4" contrib-type="author" corresp="yes">
<name name-style="western"><surname>Jing</surname><given-names>Xiaochuan</given-names></name><xref ref-type="aff" rid="aff-1">1</xref><xref ref-type="aff" rid="aff-2">2</xref><email>w654672829@163.com</email></contrib>
<aff id="aff-1"><label>1</label><institution>China Aerospace Academy of Systems Science and Engineering</institution>, <addr-line>Beijing, 100048</addr-line>, <country>China</country></aff>
<aff id="aff-2"><label>2</label><institution>Aerospace Hongka Intelligent Technology (Beijing) Co., Ltd.</institution>, <addr-line>Beijing, 100048</addr-line>, <country>China</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>&#x002A;</label>Corresponding Author: Xiaochuan Jing. Email: <email>w654672829@163.com</email></corresp>
</author-notes>
<pub-date date-type="collection" publication-format="electronic">
<year>2026</year>
</pub-date>
<pub-date date-type="pub" publication-format="electronic">
<day>12</day><month>1</month><year>2026</year>
</pub-date>
<volume>86</volume>
<issue>3</issue>
<elocation-id>84</elocation-id>
<history>
<date date-type="received">
<day>27</day>
<month>09</month>
<year>2025</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>11</month>
<year>2025</year>
</date>
</history>
<permissions>
<copyright-statement>&#x00A9; 2025 The Authors.</copyright-statement>
<copyright-year>2025</copyright-year>
<copyright-holder>Published by Tech Science Press.</copyright-holder>
<license xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>This work is licensed under a <ext-link ext-link-type="uri" xlink:type="simple" xlink:href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</ext-link>, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
</license>
</permissions>
<self-uri content-type="pdf" xlink:href="TSP_CMC_73850.pdf"></self-uri>
<abstract>
<p>In the field of intelligent surveillance, weakly supervised video anomaly detection (WSVAD) has garnered widespread attention as a key technology that identifies anomalous events using only video-level labels. Although multiple instance learning (MIL) has dominated the WSVAD for a long time, its reliance solely on video-level labels without semantic grounding hinders a fine-grained understanding of visually similar yet semantically distinct events. In addition, insufficient temporal modeling obscures causal relationships between events, making anomaly decisions reactive rather than reasoning-based. To overcome the limitations above, this paper proposes an adaptive knowledge-based guidance method that integrates external structured knowledge. The approach combines hierarchical category information with learnable prompt vectors. It then constructs continuously updated contextual references within the feature space, enabling fine-grained meaning-based guidance over video content. Building on this, the work introduces an event relation analysis module. This module explicitly models temporal dependencies and causal correlations between video snippets. It constructs an evolving logic chain of anomalous events, revealing the process by which isolated anomalous snippets develop into a complete event. Experiments on multiple benchmark datasets show that the proposed method achieves highly competitive performance, achieving an AUC of 88.19% on UCF-Crime and an AP of 86.49% on XD-Violence. More importantly, the method provides temporal and causal explanations derived from event relationships alongside its detection results. This capability significantly advances WSVAD from a simple binary classification to a new level of interpretable behavior analysis.</p>
</abstract>
<kwd-group kwd-group-type="author">
<kwd>Video anomaly detection (VAD)</kwd>
<kwd>computer vision</kwd>
<kwd>deep learning</kwd>
<kwd>explainable AI (XAI)</kwd>
<kwd>video understanding</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<label>1</label>
<title>Introduction</title>
<p>Video anomaly detection (VAD) is a critical technology in domains like public safety and industrial automation, providing a scalable alternative to the inherent limitations of human-led surveillance [<xref ref-type="bibr" rid="ref-1">1</xref>]. Within this field, the weakly supervised video anomaly detection (WSVAD) has become particularly influential. By training on video-level labels, WSVAD alleviates the intensive labor of temporal annotation and promotes better generalization across varied scenarios. However, despite these benefits, WSVAD frameworks struggle with a significant limitation: a lack of semantic specificity and predictive insight regarding anomalous events [<xref ref-type="bibr" rid="ref-2">2</xref>]. Contemporary models can flag the occurrence of an anomaly but often fail to classify its specific nature or offer actionable diagnostic information. In an industrial context, for example, a system might misinterpret harmless steam as smoke from equipment failure, or mistake routine welding sparks for a fire hazard, leading to costly operational disruptions.</p>
<p>This conceptual confusion stems from foundational constraints in WSVAD methodologies. Firstly, coarse, video-level binary labels provide a weak supervisory signal. This signal only confirms an anomaly&#x2019;s presence and is devoid of information about its specific nature or cause. The resulting lack of meaning-based grounding is the primary source of ambiguity. Secondly, dominant multiple instance learning (MIL) paradigms [<xref ref-type="bibr" rid="ref-3">3</xref>] are effective for leveraging weak labels. However, they typically aggregate snippet-level predictions through oversimplified temporal pooling strategies [<xref ref-type="bibr" rid="ref-4">4</xref>]. These strategies yield weak temporal modeling. They fail to capture the evolving progression and causal logic of events, treating videos as unordered sets of snippets rather than coherent narratives.</p>
<p>To overcome these challenges, research has progressed along two primary avenues. The first seeks to refine the MIL framework with more sophisticated temporal modeling [<xref ref-type="bibr" rid="ref-5">5</xref>&#x2013;<xref ref-type="bibr" rid="ref-7">7</xref>] or saliency-based feature extraction [<xref ref-type="bibr" rid="ref-8">8</xref>]. While these techniques improve anomaly localization, they remain constrained by the semantically weak video-level labels and do not address the fundamental need for semantic discrimination. The second direction leverages large-scale, pre-trained multimodal models, such as VadCLIP [<xref ref-type="bibr" rid="ref-9">9</xref>], to align visual data with language representations. Although promising, this approach often creates superficial cross-modal links without deep, structured knowledge, and its decision-making process remains opaque. More critically, neither direction adequately addresses the weak temporal modeling inherent in standard MIL at a causal level, leading to a persistent inability to differentiate fine-grained anomalies, a lack of interpretability, and a purely reactive framework with no predictive capability for proactive intervention.</p>
<p>To directly address the intertwined problems of semantic ambiguity and weak temporal modelling, recent state-of-the-art approaches present specific characteristics that leave room for advancement. In the domain of semantic integration, vision-language models such as AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>] leverage valuable semantic priors, yet their dependence on static, pre-defined textual prompts can limit the capture of fine-grained, dynamic semantics required to distinguish complex anomaly categories. Concurrently, in temporal modelling, methods like PE-MIL [<xref ref-type="bibr" rid="ref-11">11</xref>] enhance multi-scale feature extraction but typically do not model the causal logic and evolutionary pathways of events, maintaining a focus on post-hoc detection.</p>
<p>This work proposes a novel framework to advance beyond these characteristics. Its strength lies in the synergistic interaction of its core contributions. The framework first introduces an adaptive knowledge-based guidance mechanism. This component constructs hierarchical &#x201C;concept clouds&#x201D; from Wikidata [<xref ref-type="bibr" rid="ref-12">12</xref>] and encodes them via BLIP [<xref ref-type="bibr" rid="ref-13">13</xref>]. The goal is to achieve fine-grained conceptual discrimination. This mechanism provides the precise meaning-based concepts necessary for the event relation analysis (ERA) module. The ERA module can then construct meaningful, causal-temporal chains of events. In turn, the ERA module provides predictive capability and causal interpretability. It does so by explicitly modeling temporal dependencies and causal correlations. Critically, the contextual and causal understanding from the ERA guides the conceptual alignment. This focuses attention on the most spatiotemporally relevant visual evidence, creating a closed-loop reasoning system. A temporal context fusion (TCF) module supports this synergy by enriching long-range dependency modeling. Ultimately, this enables a transformative shift from merely detecting anomalies to interpreting and forecasting event evolution.</p>
<p>The remainder of this paper proceeds as follows: <xref ref-type="sec" rid="s2">Section 2</xref> provides an overview of related work, analyzing the current state of weakly supervised video anomaly detection and prompt learning. <xref ref-type="sec" rid="s3">Section 3</xref> introduces the overall architecture of the proposed framework, detailing the temporal context fusion, prompt learning using external knowledge, and the event relation analysis modules. <xref ref-type="sec" rid="s4">Section 4</xref> presents the extensive experimental results and corresponding analyses to validate the effectiveness of the method. Finally, <xref ref-type="sec" rid="s5">Section 5</xref> contains the conclusion and discussion, outlining the key findings of this research and discussing potential directions for future studies.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Related Works</title>
<sec id="s2_1">
<label>2.1</label>
<title>Weakly Supervised Video Anomaly Detection</title>
<p>VAD research has evolved along several distinct paradigms. A significant body of work focuses on unsupervised learning, which typically operates by learning a model of normal behavior from training data containing only normal events. Anomalies are then identified as deviations from this learned normality. Within this paradigm, one prominent approach is reconstruction-based modeling. For instance, the Inter-fused Autoencoder proposed by Aslam and Kolekar [<xref ref-type="bibr" rid="ref-14">14</xref>] utilizes a combination of CNN and LSTM layers to learn spatio-temporal patterns, flagging anomalies based on high reconstruction errors. An alternative and equally effective direction is prediction-based modeling. TransGANomaly [<xref ref-type="bibr" rid="ref-15">15</xref>] exemplifies this by using a Transformer-based generative adversarial network (GAN) to predict future video frames; a failure to accurately predict a frame indicates an anomalous event. While these methods are powerful, their effectiveness is centered on the assumption that anomalous events are patterns that deviate significantly from a well-defined normality. This assumption can present challenges in complex environments where the variety of normal events is vast and ever-changing.</p>
<p>In contrast, the key challenge of WSVAD is to achieve precise temporal localization of anomalous segments and effective prediction of risk evolution, using only video-level labels. Sultani et al. [<xref ref-type="bibr" rid="ref-16">16</xref>] pioneered a ranking loss&#x2013;based framework that built the groundwork for later developments. Based on this foundation, Zhong et al. [<xref ref-type="bibr" rid="ref-17">17</xref>] introduced an in-packet consistency loss to enhance training efficiency. However, these early methods typically relied on selecting the highest-scoring segments within a video and treated them as independent instances. Such an approach disregarded the temporal continuity of behavior, resulting in the loss of critical contextual information.</p>
<p>To address this shortcoming, the focus of subsequent research shifted towards recovering temporal dependencies through contextual modeling. For example, Zhu and Newsam [<xref ref-type="bibr" rid="ref-18">18</xref>] applied an attention-based mechanism to capture inter-segment relations. This partially restored temporal structure. However, this approach lacked scene-level structural priors and demonstrated limited performance when handling complex behavioral patterns. To further improve temporal understanding, Cho et al. [<xref ref-type="bibr" rid="ref-3">3</xref>] employed Graph Convolutional Networks (GCNs) to explicitly model hierarchical dependencies within visual representations. These strategies progressively enhanced the modeling of evolving behaviors by optimizing contextual associations. Nevertheless, they all shared a critical and fundamental limitation: they are fundamentally reactive.</p>
<p>This &#x201C;reactive&#x201D; nature constitutes a significant gap in current WSVAD methods. Even as temporal modeling becomes increasingly sophisticated, existing frameworks still only detect an event as it occurs or after it has occurred. They cannot anticipate the evolution of a situation. This reactive nature persists even in recent methods with advanced temporal modeling. For instance, PE-MIL [<xref ref-type="bibr" rid="ref-11">11</xref>] employs pyramidal encoding to capture multi-scale temporal contexts; this enhances anomaly localization. However, its temporal reasoning remains implicit and correlational, confined within the MIL framework. The method lacks an explicit mechanism to model the causal logic and evolutionary pathways of events. Such a mechanism is essential for predictive risk assessment. This status quo necessitates a shift in research focus. To transition from passive detection to active risk assessment, the proposed ERA module is designed precisely to fill this gap. By learning these dynamic evolutionary patterns of events, the model transitions from simply identifying anomalies to proactively predicting their potential progression. Furthermore, to provide the rich contextual representation needed for such analysis, the TCF module is designed to capture both short-term and long-range dependencies more effectively than prior temporal modeling approaches.</p>
<p>Beyond the core algorithmic challenges of semantic grounding and causal reasoning, recent research has also explored other practical dimensions of VAD. One critical direction is improving system-level efficiency for real-time deployment. For example, some work proposes an edge-assisted framework. This framework utilizes a lightweight network on edge devices for initial anomaly screening. It transmits only suspicious frames to the cloud for in-depth recognition, saving bandwidth and computational resources [<xref ref-type="bibr" rid="ref-19">19</xref>]. Another important challenge is enabling systems to adapt to new, previously unseen anomaly types without requiring complete retraining. To this end, class-incremental learning networks have been developed, allowing models to learn new anomaly classes on the fly while retaining knowledge of existing ones [<xref ref-type="bibr" rid="ref-20">20</xref>]. While these directions address important system-level and lifelong learning challenges, this work remains focused on the fundamental problem of enhancing the core detection and reasoning capabilities of WSVAD models for known anomaly categories.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Applications of Prompt Learning in Video Understanding</title>
<p>Prompt learning has gained attention for enhancing conceptual awareness in video understanding. By using knowledge-based references in cross-modal learning, prompts guide model attention toward meaningful regions during inference. This improves alignment and recognition across modalities. In the context of anomaly detection, several studies have begun exploring the integration of prompt learning into WSVAD. Wang et al. [<xref ref-type="bibr" rid="ref-21">21</xref>] proposed using fixed natural language templates to guide action recognition, enabling the model to focus on action-relevant features. However, this manually defined prompting strategy lacked flexibility and struggled to generalize to the diverse and complex anomaly types encountered in WSVAD. To resolve this issue, Ju et al. [<xref ref-type="bibr" rid="ref-22">22</xref>] introduced learnable prompt vectors that can adaptively align with video features. This approach improves the detection of specific anomaly categories. However, these prompt-based methods remained limited to predefined class labels, lacking the capacity to represent visual phenomena.</p>
<p>Most existing prompt-based approaches rely on coarse-grained alignment strategies that are ill-suited for the fine-grained demands of anomaly detection. However, studies that attempt to incorporate external knowledge [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-11">11</xref>] tend to align concepts at the video level. However, they still fail to localize prompts to the relevant anomalous regions accurately. Yang et al. [<xref ref-type="bibr" rid="ref-23">23</xref>] applied implicit graph-based alignment through joint prediction, modestly improving feature-prompt correlation. However, this approach often underperformed in weak anomaly signals, misaligning prompts with irrelevant background regions and leading to a decline in detection accuracy. The limitations of implicit and coarse-grained alignment are also evident in state-of-the-art vision-language models adapted for VAD. Methods like AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>] leverage the powerful pre-trained knowledge of models like CLIP. However, they typically rely on static, hand-crafted textual prompts. These prompts are insufficient for capturing the fine-grained, hierarchical meaning of complex anomalous events. As a result, this often leads to superficial cross-modal alignment and limited conceptual discriminability.</p>
<p>To tackle these problems fundamentally, this study introduces a fine-grained visual-text alignment method for noise-sensitive WSVAD. A conceptual prompting module facilitates the precise integration of external knowledge. Specifically, the knowledge-based anchors module moves beyond static prompts. It constructs adaptive &#x201C;concept clouds&#x201D; from structured knowledge bases and leverages learnable prompt vectors. This integration achieves fine-grained, context-aware visual-text alignment. It provides the precise conceptual grounding needed to guide model attention. In parallel, an event relation analysis module models the temporal evolution of events. Together, these components significantly enhance the model&#x2019;s capacity for subtle anomaly detection and proactive risk assessment.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Proposed Methodology</title>
<sec id="s3_1">
<label>3.1</label>
<title>Overview</title>
<p>The proposed method restructures the conventional end-to-end anomaly detection process into three distinct yet interconnected stages, each designed to address a specific aspect of weakly supervised video anomaly detection. As illustrated in <xref ref-type="fig" rid="fig-1">Fig. 1</xref>, these stages operate collaboratively and are jointly optimized through backpropagation, ensuring seamless information flow and unified performance enhancement.</p>
<fig id="fig-1">
<label>Figure 1</label>
<caption>
<title>The overall architecture of the proposed framework. The model processes inputs through two parallel streams: a video stream for extracting visual features and a prompt stream for generating semantic anchors. These streams are aligned and fused, with the resulting features fed into the final event relation analysis module to model event evolution for classification and prediction</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-1.tif"/>
</fig>
<p>The first stage, the basic detection module. It determines if an anomaly exists and identifies its temporal location. This stage uses two submodules: the temporal context fusion component (<xref ref-type="sec" rid="s3_2">Section 3.2</xref>), which captures short- and long-range dependencies, while the causal temporal predictor (<xref ref-type="sec" rid="s3_3">Section 3.3</xref>), which out-puts segment-level anomaly scores. These scores serve as attention cues for subsequent stages.</p>
<p>The second stage is the semantic enhancement module. It performs a critical transformation of the visual features by integrating them with external conceptual knowledge. It leverages the anomaly scores from the first stage to distinguish between foreground (potentially anomalous) and background regions. Concept-based anchors, derived from knowledge graphs, are aligned with the most relevant foreground segments using an auxiliary alignment loss (<xref ref-type="sec" rid="s3_4">Section 3.4</xref>). This process refines the raw visual representations into contextually enriched features that are more suitable for downstream tasks.</p>
<p>The third stage, the event relation analysis module, elevates the framework&#x2019;s capability from passive analysis to active pre-warning. This module models the evolving and logical sequence of events, transcending judgments of isolated moments. Building on the semantically enriched features from the second stage, it utilizes a gated recurrent unit (GRU) network to encode the temporal progression (<xref ref-type="sec" rid="s3_5">Section 3.5</xref>). Finally, it employs two parallel prediction heads to forecast the current and subsequent event probabilities. This enables the system to anticipate future risks based on learned causal patterns.</p>
<p>The entire method is trained end-to-end with a multi-task objective. The primary MIL-based loss from the first stage drives accurate anomaly detection. In contrast, the conceptual alignment loss from the second stage encourages feature representations to become more interpretable and discriminative. Joint training enables effective gradient sharing across all components. This allows the model to achieve high detection performance and meaningful semantic understanding simultaneously.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Temporal Context Fusion</title>
<p>For the fundamental anomaly detection phase, the TCF module is designed to capture intricate temporal connections adeptly. Current methodologies either fail to effectively simulate long-range interdependence by emphasizing local interactions [<xref ref-type="bibr" rid="ref-8">8</xref>] or inadequately represent local correlations by depending on global self-attention [<xref ref-type="bibr" rid="ref-24">24</xref>]. The overall architecture of the TCF module, which addresses these challenges, is illustrated in <xref ref-type="fig" rid="fig-2">Fig. 2</xref>. The TCF module concurrently gathers information from extensive dependencies and neighbourhood connections, integrating them adaptively via an adaptive gating mechanism. The module focuses on creating an affinity matrix that represents the inherent relationships among pieces. An input feature sequence <inline-formula id="ieqn-1"><mml:math id="mml-ieqn-1"><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is initially transformed into queries (Q), keys (K), and values (V) by three distinct linear projection functions: <inline-formula id="ieqn-2"><mml:math id="mml-ieqn-2"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-3"><mml:math id="mml-ieqn-3"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-4"><mml:math id="mml-ieqn-4"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The initial inter-segment similarity <inline-formula id="ieqn-5"><mml:math id="mml-ieqn-5"><mml:msub><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is calculated using the dot product of Q and K, with the outcome serving as the output of the linear projection functions <inline-formula id="ieqn-6"><mml:math id="mml-ieqn-6"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, <inline-formula id="ieqn-7"><mml:math id="mml-ieqn-7"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> and <inline-formula id="ieqn-8"><mml:math id="mml-ieqn-8"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, defined in <xref ref-type="disp-formula" rid="eqn-1">Eq. (1)</xref>.
<disp-formula id="eqn-1"><label>(1)</label><mml:math id="mml-eqn-1" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>q</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-2">
<label>Figure 2</label>
<caption>
<title>Architecture of the temporal context fusion module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-2.tif"/>
</fig>
<p>The explicit integration of positional information into the similarity computation is achieved through a temporal relationality embedding (TRE), <inline-formula id="ieqn-9"><mml:math id="mml-ieqn-9"><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The components of this embedding matrix (<inline-formula id="ieqn-10"><mml:math id="mml-ieqn-10"><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>) are created dynamically according to the absolute position indices of the segments (<inline-formula id="ieqn-11"><mml:math id="mml-ieqn-11"><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>) and are modulated by learnable scaling factors (<inline-formula id="ieqn-12"><mml:math id="mml-ieqn-12"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-13"><mml:math id="mml-ieqn-13"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula>). These are generated dynamically and regulated by a learnable scale factor, <inline-formula id="ieqn-14"><mml:math id="mml-ieqn-14"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula>, and a bias, <inline-formula id="ieqn-15"><mml:math id="mml-ieqn-15"><mml:mi>&#x03B4;</mml:mi></mml:math></inline-formula>, as given in <xref ref-type="disp-formula" rid="eqn-2">Eq. (2)</xref>.
<disp-formula id="eqn-2"><label>(2)</label><mml:math id="mml-eqn-2" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mi>&#x03BA;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x2212;</mml:mo><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:msub><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>+</mml:mo><mml:mi>&#x03B4;</mml:mi><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The temporal relevance embedding is incorporated into the original similarity to create the location-aware affinity matrix, <inline-formula id="ieqn-16"><mml:math id="mml-ieqn-16"><mml:msub><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mtext>E</mml:mtext></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. The Gaussian kernel embedding inherently accentuates relationships among adjacent segments, allowing the model to adjust to varying spatial representations and disparate video durations. The wide-area dependency flow aims to capture contextual relationships on a global scale. The location-aware affinity matrix, <inline-formula id="ieqn-17"><mml:math id="mml-ieqn-17"><mml:msub><mml:mrow><mml:mtext>S</mml:mtext></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, undergoes a conventional scaled dot-product attention calculation to derive a wide-area dependence map, <inline-formula id="ieqn-18"><mml:math id="mml-ieqn-18"><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>softmax</mml:mtext></mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:msub><mml:mrow><mml:mover><mml:mrow><mml:mtext mathvariant="bold">S</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msqrt><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mi>e</mml:mi><mml:mi>y</mml:mi></mml:mrow></mml:msub></mml:msqrt></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula>. This map is subsequently employed to evaluate and aggregate the value representation V, producing the wide-area contextual features <inline-formula id="ieqn-19"><mml:math id="mml-ieqn-19"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>W</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. A learnable adaptive balancing gate, <inline-formula id="ieqn-20"><mml:math id="mml-ieqn-20"><mml:msub><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:math></inline-formula>, is utilized to combine the contextual information derived from the two streams. This adaptive balancing gate enables the model to adjust the contribution weights of global and local information in response to the inputs, as shown in <xref ref-type="disp-formula" rid="eqn-3">Eq. (3)</xref>.
<disp-formula id="eqn-3"><label>(3)</label><mml:math id="mml-eqn-3" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>g</mml:mi><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>b</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2299;</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>In this context, <inline-formula id="ieqn-21"><mml:math id="mml-ieqn-21"><mml:mo>&#x2299;</mml:mo></mml:math></inline-formula> signifies element-wise multiplication, and the <inline-formula id="ieqn-22"><mml:math id="mml-ieqn-22"><mml:msub><mml:mrow><mml:mtext>g</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x03B1;</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> function can be inferred from input features via a compact network. The integrated feature, <inline-formula id="ieqn-23"><mml:math id="mml-ieqn-23"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, is subjected to particular normalization and a linear transformation, <inline-formula id="ieqn-24"><mml:math id="mml-ieqn-24"><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, before being amalgamated with the original input, <inline-formula id="ieqn-25"><mml:math id="mml-ieqn-25"><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow></mml:math></inline-formula>, by residual concatenation. Ultimately, layer normalization (LN) is employed to provide the final, context-augmented output, <inline-formula id="ieqn-26"><mml:math id="mml-ieqn-26"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which can be expressed as:
<disp-formula id="eqn-4"><label>(4)</label><mml:math id="mml-eqn-4" display="block"><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>LN</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Z</mml:mtext></mml:mrow><mml:mo>+</mml:mo><mml:msub><mml:mi>g</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D4A9;</mml:mi></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mn>2</mml:mn><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></disp-formula></p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Causal Temporal Predictor and Classifier</title>
<p>After the enhancement of the TCF module, to further refine the anomaly discrimination-oriented features and make predictions, the causal temporal predictor and classifier is introduced. The module consists of two sequential one-dimensional convolutional layers. Each layer incorporates a GELU activation function and a dropout layer, which enhances the model&#x2019;s nonlinear representation capability while mitigating overfitting. The procedure is depicted as follows:
<disp-formula id="eqn-5"><label>(5)</label><mml:math id="mml-eqn-5" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>inter</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Dropout</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>GELU</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mn>1</mml:mn><mml:msub><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>C</mml:mtext></mml:mrow><mml:mrow><mml:mi>o</mml:mi><mml:mi>u</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="ueqn-6"><mml:math id="mml-ueqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>Dropout</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>GELU</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>Conv</mml:mtext></mml:mrow><mml:mn>1</mml:mn><mml:msub><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mrow><mml:mtext>inter</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Utilizing the finely tuned discriminative features <inline-formula id="ieqn-27"><mml:math id="mml-ieqn-27"><mml:msub><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, a temporal causal convolution head <inline-formula id="ieqn-28"><mml:math id="mml-ieqn-28"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> is employed to forecast the likelihood of anomalies in each video segment. This causal convolution ensures that only present and past data are utilized when forecasting the abnormal condition of the current clip, which is crucial for online or real-time detection contexts. The ultimate sequence of clip-level anomaly probabilities, <inline-formula id="ieqn-29"><mml:math id="mml-ieqn-29"><mml:msub><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, is derived by the sigmoid activation function, <inline-formula id="ieqn-30"><mml:math id="mml-ieqn-30"><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>:<inline-formula id="ieqn-31"><mml:math id="mml-ieqn-31"><mml:msub><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mrow><mml:mtext>H</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, where <inline-formula id="ieqn-32"><mml:math id="mml-ieqn-32"><mml:msub><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> represents the anomaly probability of the <italic>t</italic>-th clip in the video.</p>
<p>The MIL paradigm is employed to formulate the principal anomaly detection loss function, <inline-formula id="ieqn-33"><mml:math id="mml-ieqn-33"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> [<xref ref-type="bibr" rid="ref-24">24</xref>]. A video-level anomaly prediction score <inline-formula id="ieqn-34"><mml:math id="mml-ieqn-34"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is produced by aggregating the segment-level anomaly probability sequences <inline-formula id="ieqn-35"><mml:math id="mml-ieqn-35"><mml:msub><mml:mrow><mml:mtext>P</mml:mtext></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. For a positive packet, the anomaly prediction score for the video is determined as the average of the <inline-formula id="ieqn-36"><mml:math id="mml-ieqn-36"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> probability values corresponding to the highest segment-level anomaly probabilities, where <inline-formula id="ieqn-37"><mml:math id="mml-ieqn-37"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is defined as <inline-formula id="ieqn-38"><mml:math id="mml-ieqn-38"><mml:mo fence="false" stretchy="false">&#x230A;</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>16</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">&#x230B;</mml:mo></mml:math></inline-formula>. Generally, the highest clip-level anomaly probability is chosen as the anomaly prediction score for the video, specifically <inline-formula id="ieqn-39"><mml:math id="mml-ieqn-39"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, as demonstrated in <xref ref-type="fig" rid="fig-3">Fig. 3</xref>. For a small batch comprising <inline-formula id="ieqn-40"><mml:math id="mml-ieqn-40"><mml:mi>B</mml:mi></mml:math></inline-formula> video samples, each with video-level accurate labels<inline-formula id="ieqn-41"><mml:math id="mml-ieqn-41"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> from the set, and associated video-level prediction scores <inline-formula id="ieqn-42"><mml:math id="mml-ieqn-42"><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, the MIL-based binary cross-entropy loss <inline-formula id="ieqn-43"><mml:math id="mml-ieqn-43"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is defined as follows:
<disp-formula id="eqn-6"><label>(6)</label><mml:math id="mml-eqn-6" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>B</mml:mi></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>B</mml:mi></mml:mrow></mml:munderover><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">[</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>+</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>v</mml:mi><mml:mi>i</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<fig id="fig-3">
<label>Figure 3</label>
<caption>
<title>Top-k score selection module</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-3.tif"/>
</fig>
</sec>
<sec id="s3_4">
<label>3.4</label>
<title>Knowledge-Guided Semantic Integration</title>
<p>This module is the core of this method. It aims to integrate structured external knowledge into visual features and fundamentally improve the model&#x2019;s fine-grained conceptual recognition capability. Implementation involves three key steps: constructing semantic anchors, context separation, and visual-semantic alignment.</p>
<sec id="s3_4_1">
<label>3.4.1</label>
<title>External Knowledge Prompt Construction</title>
<p>To generate more representative semantic guidance signals than mere category labels, semantic anchors are constructed. Considering the versatility of semantic guidance, this work first selects 12 common relation types from Wikidata [<xref ref-type="bibr" rid="ref-12">12</xref>] as the pre-retrieval semantics, followed by identifying the relation types with the highest occurrence frequency across all anomaly categories as the core retrieval relations <inline-formula id="ieqn-44"><mml:math id="mml-ieqn-44"><mml:mi>R</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>. The concept dictionary is then constructed by retrieving all statements of the relation <inline-formula id="ieqn-45"><mml:math id="mml-ieqn-45"><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:mrow><mml:mtext>R</mml:mtext></mml:mrow></mml:math></inline-formula> established with a given class <inline-formula id="ieqn-46"><mml:math id="mml-ieqn-46"><mml:mi>c</mml:mi></mml:math></inline-formula> as the subject or object entity. The non-category entity in each statement is used as the key, and the number of references associated with the statement is taken as the value. To explicitly handle potential noise, ambiguity, and inconsistent entries from Wikidata that may interfere with the meaning of anomalies, a two-stage filtering process is applied. The method first eliminates all Wikidata statements with a ranking of &#x201C;deprecated&#x201D; to remove low-quality or disputed assertions. Next, for the remaining concepts, the number of references associated with each Wikidata statement is utilized as a relevance indicator. The remaining entries are then filtered by using the average of the reference counts as the threshold. In general, the higher the reference count, the stronger the reliability and consensus degree of the concept.</p>
<p>Subsequently, the text encoder <inline-formula id="ieqn-47"><mml:math id="mml-ieqn-47"><mml:msub><mml:mrow><mml:mi>&#x2130;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>text</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> of the BLIP model [<xref ref-type="bibr" rid="ref-13">13</xref>] is employed to encode the filtered concepts into stable and representative semantic anchor vectors. The choice of BLIP over CLIP [<xref ref-type="bibr" rid="ref-25">25</xref>] is motivated by its superior ability to capture fine-grained semantic relationships [<xref ref-type="bibr" rid="ref-26">26</xref>] and its cleaner training data, which is more suitable for this knowledge-driven framework. For a given class <inline-formula id="ieqn-48"><mml:math id="mml-ieqn-48"><mml:mi>c</mml:mi></mml:math></inline-formula>, a set of keys <inline-formula id="ieqn-49"><mml:math id="mml-ieqn-49"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msubsup></mml:math></inline-formula> from the concept dictionary is first extracted as the context concepts, which are then separately fed into the text encoder to extract <inline-formula id="ieqn-50"><mml:math id="mml-ieqn-50"><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>b</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>-dimensional feature vectors <inline-formula id="ieqn-51"><mml:math id="mml-ieqn-51"><mml:msubsup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>b</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>: <inline-formula id="ieqn-52"><mml:math id="mml-ieqn-52"><mml:msubsup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x2130;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>text</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">(</mml:mo><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi><mml:mi>e</mml:mi><mml:mi>p</mml:mi><mml:msubsup><mml:mi>t</mml:mi><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The semantic director <inline-formula id="ieqn-53"><mml:math id="mml-ieqn-53"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>e</mml:mi><mml:mi>m</mml:mi><mml:mi>b</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> of a category <inline-formula id="ieqn-54"><mml:math id="mml-ieqn-54"><mml:mi>c</mml:mi></mml:math></inline-formula> is then derived by averaging all feature vectors of its core concepts, as presented in <xref ref-type="disp-formula" rid="eqn-7">Eq. (7)</xref>.
<disp-formula id="eqn-7"><label>(7)</label><mml:math id="mml-eqn-7" display="block"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mfrac><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munderover><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mrow><mml:mtext>e</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></disp-formula>where <inline-formula id="ieqn-55"><mml:math id="mml-ieqn-55"><mml:msub><mml:mi>N</mml:mi><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the number of filtered concepts. The <inline-formula id="ieqn-56"><mml:math id="mml-ieqn-56"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> vector can serve as a reliable and representative anchor point to direct the ensuing visual feature learning process.</p>
</sec>
<sec id="s3_4_2">
<label>3.4.2</label>
<title>Context-Sensitive Separation</title>
<p>To achieve accurate alignment, it is essential to distinguish between the foreground (anomalous) information and the background (normal) information in the video clip. We creatively employ the anomalous saliency score <inline-formula id="ieqn-57"><mml:math id="mml-ieqn-57"><mml:mi>s</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mo stretchy="false">[</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:msup><mml:mo stretchy="false">]</mml:mo><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> generated during the base detection phase (<xref ref-type="sec" rid="s3_3">Section 3.3</xref>) to derive foreground features <inline-formula id="ieqn-58"><mml:math id="mml-ieqn-58"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> and background features <inline-formula id="ieqn-59"><mml:math id="mml-ieqn-59"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> using a straightforward attentional method. This approach enables the model to &#x201C;adaptively concentrate&#x201D; on visual regions that are most likely to harbour aberrant information. The foreground attention weights, <inline-formula id="ieqn-60"><mml:math id="mml-ieqn-60"><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, are calculated as follows:
<disp-formula id="eqn-8"><label>(8)</label><mml:math id="mml-eqn-8" display="block"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mi>&#x03BA;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03BA;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-61"><mml:math id="mml-ieqn-61"><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is the anomalous significance score for the <italic>t</italic>-th fragment, while <inline-formula id="ieqn-62"><mml:math id="mml-ieqn-62"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula> represents a predetermined scaling factor employed to amplify the importance of high significance scores. The expression <inline-formula id="ieqn-63"><mml:math id="mml-ieqn-63"><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mo>&#x22C5;</mml:mo><mml:mo>)</mml:mo></mml:mrow><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> guarantees non-negativity and diminishes the weighting for lower scores. Similarly, the background attention weight, <inline-formula id="ieqn-64"><mml:math id="mml-ieqn-64"><mml:msup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, is calculated utilizing the normal confidence of the clip, <inline-formula id="ieqn-65"><mml:math id="mml-ieqn-65"><mml:msub><mml:mover><mml:mi>s</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x2212;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, as shown in <xref ref-type="disp-formula" rid="eqn-9">Eq. (9)</xref>.
<disp-formula id="eqn-9"><label>(9)</label><mml:math id="mml-eqn-9" display="block"><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03BA;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover><mml:mi>s</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mo stretchy="false">(</mml:mo><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03BA;</mml:mi><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mover><mml:mi>s</mml:mi><mml:mo accent="false">&#x00AF;</mml:mo></mml:mover><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p>Utilizing these attention weights, we can derive the weighted video-level foreground feature, <inline-formula id="ieqn-66"><mml:math id="mml-ieqn-66"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, and the background feature, <inline-formula id="ieqn-67"><mml:math id="mml-ieqn-67"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula>, according to <xref ref-type="disp-formula" rid="eqn-10">Eq. (10)</xref>.
<disp-formula id="ueqn-11"><mml:math id="mml-ueqn-11" display="block"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>
<disp-formula id="eqn-10"><label>(10)</label><mml:math id="mml-eqn-10" display="block"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:mspace width="thinmathspace" /><mml:msubsup><mml:mi>&#x03B1;</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></disp-formula>where <inline-formula id="ieqn-68"><mml:math id="mml-ieqn-68"><mml:msub><mml:mrow><mml:mtext>X</mml:mtext></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>f</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> denotes the <inline-formula id="ieqn-69"><mml:math id="mml-ieqn-69"><mml:mi>t</mml:mi></mml:math></inline-formula>-th feature segment in the sequence. Thus, <inline-formula id="ieqn-70"><mml:math id="mml-ieqn-70"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> emphasizes the characteristics of segments deemed abnormal by the model, whereas <inline-formula id="ieqn-71"><mml:math id="mml-ieqn-71"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> concentrates on the attributes of normal segments.</p>
</sec>
<sec id="s3_4_3">
<label>3.4.3</label>
<title>Visual and Semantic Alignment</title>
<p>During the alignment step, the objective is to minimize the distance between the foreground visual feature <inline-formula id="ieqn-72"><mml:math id="mml-ieqn-72"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and its associated conceptual anchor <inline-formula id="ieqn-73"><mml:math id="mml-ieqn-73"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> of the anomalous category within the feature space. The matching probability <inline-formula id="ieqn-74"><mml:math id="mml-ieqn-74"><mml:mrow><mml:mtext>sim</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>&#x22C5;</mml:mo><mml:msup><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo><mml:mrow><mml:mtext>a</mml:mtext></mml:mrow><mml:mo>&#x2225;&#x2225;</mml:mo><mml:mrow><mml:mtext>b</mml:mtext></mml:mrow><mml:mo stretchy="false">&#x2225;</mml:mo></mml:mrow></mml:mfrac></mml:math></inline-formula> is calculated using temperature-scaled cosine similarity, and a knowledge-distilled alignment loss <inline-formula id="ieqn-75"><mml:math id="mml-ieqn-75"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is formulated to accomplish this objective. The primary anomaly detection loss, <inline-formula id="ieqn-76"><mml:math id="mml-ieqn-76"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, and the alignment loss, <inline-formula id="ieqn-77"><mml:math id="mml-ieqn-77"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, which are used for comprehensive joint optimization, determine the overall loss of the model.</p>
<p>For a particular video, if it is anomalous, its foreground feature <inline-formula id="ieqn-78"><mml:math id="mml-ieqn-78"><mml:msup><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>g</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> must coincide with the associated semantic direction for the anomaly category, <inline-formula id="ieqn-79"><mml:math id="mml-ieqn-79"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>b</mml:mi><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. If the video is normal, its overall characteristics should correspond with the conventional &#x2018;normal&#x2019; semantic framework <inline-formula id="ieqn-80"><mml:math id="mml-ieqn-80"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mi>o</mml:mi><mml:mi>r</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. A collection <inline-formula id="ieqn-81"><mml:math id="mml-ieqn-81"><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula> is created, comprising <inline-formula id="ieqn-82"><mml:math id="mml-ieqn-82"><mml:mi>C</mml:mi></mml:math></inline-formula> anomalous category semantic guides and one standard semantic guide. The likelihood that a visual characteristic <inline-formula id="ieqn-83"><mml:math id="mml-ieqn-83"><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow></mml:math></inline-formula> corresponds to the <inline-formula id="ieqn-84"><mml:math id="mml-ieqn-84"><mml:mi>k</mml:mi></mml:math></inline-formula>-th semantic guide <inline-formula id="ieqn-85"><mml:math id="mml-ieqn-85"><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, <inline-formula id="ieqn-86"><mml:math id="mml-ieqn-86"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>, is determined using temperature-scaled cosine similarity followed by softmax normalization, shown in <xref ref-type="disp-formula" rid="eqn-11">Eq. (11)</xref>.
<disp-formula id="eqn-11"><label>(11)</label><mml:math id="mml-eqn-11" display="block"><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:msubsup><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>m</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mspace width="thinmathspace" /><mml:mi>exp</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mi>sin</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac></mml:math></disp-formula></p>
<p><inline-formula id="ieqn-87"><mml:math id="mml-ieqn-87"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a parameter for temperature that can be learned. Alignment aims to optimize the likelihood of correlating positive sample pairs while reducing the likelihood of correlating negative sample pairs. This can be accomplished through the application of a knowledge-distilled alignment loss, which is calculated as follows:
<disp-formula id="eqn-12"><label>(12)</label><mml:math id="mml-eqn-12" display="block"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>E</mml:mi><mml:mrow><mml:mrow><mml:mtext mathvariant="bold">v</mml:mtext></mml:mrow><mml:mo>&#x223C;</mml:mo><mml:mi>p</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext mathvariant="bold">v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:munderover><mml:mo>&#x2211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>C</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mspace width="thinmathspace" /><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo><mml:mi>log</mml:mi><mml:mo>&#x2061;</mml:mo><mml:mfrac><mml:mrow><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mfrac><mml:mo>]</mml:mo></mml:mrow></mml:math></disp-formula>where <inline-formula id="ieqn-88"><mml:math id="mml-ieqn-88"><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> denotes the target distribution. <inline-formula id="ieqn-89"><mml:math id="mml-ieqn-89"><mml:mi>Q</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msup><mml:mrow><mml:mtext>D</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow><mml:mrow><mml:mtext>v</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> approaches 1 for positive sample pairings and approaches 0 for negative sample pairs.</p>
</sec>
</sec>
<sec id="s3_5">
<label>3.5</label>
<title>Event Relation Analysis for Active Risk Assessment</title>
<p>The ERA module elevates the model&#x2019;s capability from passive detection to active pre-warning. Its primary objective is to model the evolving and logical sequence of event progressions. This enables the prediction of future events, transcending the judgment of isolated moments. The module&#x2019;s workflow begins by receiving the concept-enriched feature sequence from the TCF and KGSI modules. It then encodes the temporal information using a Recurrent Neural Network (RNN). Finally, two parallel prediction heads output probabilities for both current and subsequent events.</p>
<p>Specifically, the input to the ERA module is the semantic-enhanced feature sequence <inline-formula id="ieqn-90"><mml:math id="mml-ieqn-90"><mml:msup><mml:mi>X</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo fence="false" stretchy="false">}</mml:mo><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>D</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where <inline-formula id="ieqn-91"><mml:math id="mml-ieqn-91"><mml:mi>T</mml:mi></mml:math></inline-formula> is the sequence length and <inline-formula id="ieqn-92"><mml:math id="mml-ieqn-92"><mml:mi>D</mml:mi></mml:math></inline-formula> is the feature dimension. The core of the module&#x2019;s internal processing is an event state encoding stage, which utilizes a Gated Recurrent Unit (GRU) network. The GRU is selected for its ability to effectively capture long-term dependencies while maintaining fewer parameters and higher computational efficiency compared to other recurrent structures, such as LSTM. The GRU network processes the input sequence <inline-formula id="ieqn-93"><mml:math id="mml-ieqn-93"><mml:msup><mml:mi>x</mml:mi><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> sequentially at each time step <inline-formula id="ieqn-94"><mml:math id="mml-ieqn-94"><mml:mrow><mml:mtext>t</mml:mtext></mml:mrow></mml:math></inline-formula>, taking the current feature <inline-formula id="ieqn-95"><mml:math id="mml-ieqn-95"><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the previous hidden state <inline-formula id="ieqn-96"><mml:math id="mml-ieqn-96"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> to compute the current hidden state:
<disp-formula id="eqn-13"><label>(13)</label><mml:math id="mml-eqn-13" display="block"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>G</mml:mi><mml:mi>R</mml:mi><mml:mi>U</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:msubsup><mml:mi>x</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x2212;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>This hidden state <inline-formula id="ieqn-97"><mml:math id="mml-ieqn-97"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mrow><mml:msub><mml:mi>D</mml:mi><mml:mrow><mml:mi>h</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mrow></mml:msup></mml:math></inline-formula> is crucial, as it serves as a dynamic representation of the event&#x2019;s history up to time <italic>t</italic>, encoding not only the semantic information of the current snippet but also key features from the historical sequence.</p>
<p>Based on the encoded event state sequence <inline-formula id="ieqn-98"><mml:math id="mml-ieqn-98"><mml:mi>H</mml:mi><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">{</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x2026;</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo fence="false" stretchy="false">}</mml:mo></mml:math></inline-formula>, two parallel predictive outputs are designed that function as prediction heads. To determine the current event probability (<inline-formula id="ieqn-99"><mml:math id="mml-ieqn-99"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mi>e</mml:mi><mml:mi>v</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) at time step <inline-formula id="ieqn-100"><mml:math id="mml-ieqn-100"><mml:mi>t</mml:mi></mml:math></inline-formula>, the hidden state <inline-formula id="ieqn-101"><mml:math id="mml-ieqn-101"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is passed through a linear transformation layer, <inline-formula id="ieqn-102"><mml:math id="mml-ieqn-102"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>c</mml:mi><mml:mi>u</mml:mi><mml:mi>r</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>n</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and a Sigmoid activation function, <inline-formula id="ieqn-103"><mml:math id="mml-ieqn-103"><mml:mi>&#x03C3;</mml:mi></mml:math></inline-formula>, yielding the prediction <inline-formula id="ieqn-104"><mml:math id="mml-ieqn-104"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>current</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:mtext>current</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. This results in a final prediction sequence <inline-formula id="ieqn-105"><mml:math id="mml-ieqn-105"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>current</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>T</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. Similarly, to predict the probability of the next event (<inline-formula id="ieqn-106"><mml:math id="mml-ieqn-106"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>) at time step <inline-formula id="ieqn-107"><mml:math id="mml-ieqn-107"><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, the same hidden state <inline-formula id="ieqn-108"><mml:math id="mml-ieqn-108"><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is leveraged and pass it through an independent linear layer, <inline-formula id="ieqn-109"><mml:math id="mml-ieqn-109"><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and a Sigmoid function to produce <inline-formula id="ieqn-110"><mml:math id="mml-ieqn-110"><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>&#x03C3;</mml:mi><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>W</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The operation of the ERA module is further guided by an external knowledge base, the core of which is an event transition matrix <inline-formula id="ieqn-111"><mml:math id="mml-ieqn-111"><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>trans</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x2208;</mml:mo><mml:msup><mml:mi>R</mml:mi><mml:mrow><mml:mi>K</mml:mi><mml:mo>&#x00D7;</mml:mo><mml:mi>K</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. This matrix provides prior knowledge about event causality, it provides supervision signals during the training phase, as detailed in the following section.</p>
</sec>
<sec id="s3_6">
<label>3.6</label>
<title>Model Training and Optimization Objective</title>
<p>This paper employs a multi-task learning strategy to jointly optimize all modules of the framework end-to-end. The overall objective function, <inline-formula id="ieqn-112"><mml:math id="mml-ieqn-112"><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow></mml:math></inline-formula>, is a composite loss consisting of three main components, corresponding to the key tasks of anomaly detection, knowledge enhancement, and event relation prediction. In addition to the MIL-based anomaly detection loss <inline-formula id="ieqn-113"><mml:math id="mml-ieqn-113"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> and the prompt-based semantic alignment loss <inline-formula id="ieqn-114"><mml:math id="mml-ieqn-114"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> from the base framework, a new event relation modeling objective (<inline-formula id="ieqn-115"><mml:math id="mml-ieqn-115"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>) is introduced. This new loss function is responsible for supervising the learning of the internal evolutionary rules of events and is itself composed of two constituent parts.</p>
<p>The first part is the current event classification loss <bold>(</bold><inline-formula id="ieqn-116"><mml:math id="mml-ieqn-116"><mml:mspace width="negativethinmathspace" /><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mspace width="negativethinmathspace" /></mml:math></inline-formula><bold>)</bold>, which supervises the ERA module&#x2019;s ability to recognize the class of the current video snippet. It is formulated as a Binary Cross-Entropy (BCE) loss: <inline-formula id="ieqn-117"><mml:math id="mml-ieqn-117"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>BCE</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>current</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The target label <inline-formula id="ieqn-118"><mml:math id="mml-ieqn-118"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is a multi-hot vector generated from the video&#x2019;s ground-truth annotations; the accuracy and granularity of these annotations are critical for the model&#x2019;s performance. The second part is the event transition prediction Loss <bold>(</bold><inline-formula id="ieqn-119"><mml:math id="mml-ieqn-119"><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>relation</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mspace width="negativethinmathspace" /><mml:mspace width="negativethinmathspace" /></mml:math></inline-formula><bold>)</bold>, which trains the model&#x2019;s ability to predict future events. Its supervisory signal, <inline-formula id="ieqn-120"><mml:math id="mml-ieqn-120"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, is dynamically generated by multiplying the true current event label <inline-formula id="ieqn-121"><mml:math id="mml-ieqn-121"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> with the knowledge base&#x2019;s event transition matrix <inline-formula id="ieqn-122"><mml:math id="mml-ieqn-122"><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>trans</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>, yielding <inline-formula id="ieqn-123"><mml:math id="mml-ieqn-123"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>&#x22C5;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mrow><mml:mtext>trans</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>. The loss is then computed as <inline-formula id="ieqn-124"><mml:math id="mml-ieqn-124"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>relation</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mtext>BCE</mml:mtext></mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>next</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula>. The total loss for the ERA module is the sum of these two components: <inline-formula id="ieqn-125"><mml:math id="mml-ieqn-125"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>era</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>relation</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>Ultimately, the final joint loss function for the entire framework is a weighted sum of these three objectives:
<disp-formula id="eqn-14"><label>(14)</label><mml:math id="mml-eqn-14" display="block"><mml:mtable columnalign="right left right left right left right left right left right left" rowspacing="3pt" columnspacing="0em 2em 0em 2em 0em 2em 0em 2em 0em 2em 0em" displaystyle="true"><mml:mtr><mml:mtd /><mml:mtd><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03BB;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>era</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where <inline-formula id="ieqn-126"><mml:math id="mml-ieqn-126"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-127"><mml:math id="mml-ieqn-127"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> are hyperparameters that balance the importance of the different tasks, by minimizing this joint loss function, the model synergistically learns to detect anomalies, understand meaning and predict trends. This constructs a comprehensive and robust video anomaly analysis system transcending traditional detectors.</p>
</sec>
</sec>
<sec id="s4">
<label>4</label>
<title>Experiments</title>
<sec id="s4_1">
<label>4.1</label>
<title>Datasets and Evaluation Metrics</title>
<p>The XD-Violence dataset [<xref ref-type="bibr" rid="ref-27">27</xref>] is an extensive collection of 4754 unedited videos, covering six categories of violent events sourced from movies, surveillance cameras, and web content. For weakly supervised settings, this dataset is divided into 3954 training videos and 800 testing videos. In contrast, the UCF-Crime dataset [<xref ref-type="bibr" rid="ref-16">16</xref>] comprises 13 categories of abnormal events captured in a wide range of environments, including streets, residential areas, and commercial spaces. This dataset provides 1610 training videos and 290 testing videos.</p>
<p>Following established protocols in previous studies [<xref ref-type="bibr" rid="ref-8">8</xref>,<xref ref-type="bibr" rid="ref-16">16</xref>], the area under the curve (AUC) of the frame-level receiver operating characteristic (ROC) curve is adopted as the evaluation metric for the UCF-Crime dataset. Meanwhile, for the XD-Violence dataset, the area under the precision-recall curve (AP) at the frame level is used as the evaluation metric [<xref ref-type="bibr" rid="ref-28">28</xref>].</p>
</sec>
<sec id="s4_2">
<label>4.2</label>
<title>Experimental Settings</title>
<p>Following established methodologies [<xref ref-type="bibr" rid="ref-5">5</xref>,<xref ref-type="bibr" rid="ref-24">24</xref>], the framework utilises a pre-trained I3D model [<xref ref-type="bibr" rid="ref-29">29</xref>] to extract 1024-dimensional features from RGB video streams based on the Kinetics dataset [<xref ref-type="bibr" rid="ref-30">30</xref>]. The features are extracted from non-overlapping 16-frame segments, which is the standard temporal unit for I3D features and ensures fair comparison with existing methods. For prompt learning with external knowledge, conceptual representations for 13 UCF-Crime anomaly categories and generic XD-Violence categories are expanded using Wikidata. These expanded concepts are encoded into 768-dimensional semantic prototypes using the BLIP ViT-B/16 text encoder. Visual features are aligned with semantic prototypes via a temperature-guided contrastive learning mechanism, where the temperature parameter <inline-formula id="ieqn-128"><mml:math id="mml-ieqn-128"><mml:msub><mml:mi>&#x03C4;</mml:mi><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> is a learnable parameter initialized to 0.07, a standard value in contrastive learning frameworks. For the Top-K strategy in the MIL loss, the adaptive approach <inline-formula id="ieqn-129"><mml:math id="mml-ieqn-129"><mml:msub><mml:mi>k</mml:mi><mml:mrow><mml:mi>p</mml:mi><mml:mi>o</mml:mi><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo fence="false" stretchy="false">&#x230A;</mml:mo><mml:mrow><mml:mtext>N</mml:mtext></mml:mrow><mml:mi mathvariant="normal">&#x005F;</mml:mi><mml:mrow><mml:mtext>seg</mml:mtext></mml:mrow><mml:mrow><mml:mo>/</mml:mo></mml:mrow><mml:mn>16</mml:mn><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo fence="false" stretchy="false">&#x230B;</mml:mo></mml:math></inline-formula> was adopted, following the established practice in prior works [<xref ref-type="bibr" rid="ref-11">11</xref>,<xref ref-type="bibr" rid="ref-28">28</xref>] which has proven robust for videos of varying lengths.</p>
<p>Model training is conducted in an end-to-end multi-task learning framework. The Adam optimizer is adopted with an initial learning rate of 1 &#x00D7; 10<sup>&#x2212;4</sup>, weight decay of 5 &#x00D7; 10<sup>&#x2212;4</sup>, batch size of 128, and a total of 50 training epochs. A cosine annealing schedule is applied to adjust the learning rate over time.</p>
<p>For generating snippet-level pseudo-labels for the ERA module under weak supervision, the pseudo-label <inline-formula id="ieqn-130"><mml:math id="mml-ieqn-130"><mml:msub><mml:mi>y</mml:mi><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is derived from the video-level ground-truth annotations. For instance, in a video labeled as anomalous due to &#x201C;Arson&#x201D;, this category label is propagated to every snippet within that video, serving as the initial supervisory signal for the current event classification loss (<inline-formula id="ieqn-131"><mml:math id="mml-ieqn-131"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>event</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>). Although this labeling strategy introduces noise, the MIL framework helps the model implicitly identify salient snippets. The ERA module then leverages these snippets to model temporal event evolution.</p>
</sec>
<sec id="s4_3">
<label>4.3</label>
<title>Comparison to Current Methods</title>
<p><bold>Results on XD-Violence.</bold> <xref ref-type="table" rid="table-1">Table 1</xref> reports the evaluation results on the XD-Violence dataset. Using the exact I3D feature representations as competing methods, the proposed method achieves an AP of 86.49%, outperforming most existing semi-supervised and weakly supervised approaches. This result highlights the effectiveness of the multi-task learning design in detecting anomalies under complex scene conditions.</p>
<table-wrap id="table-1">
<label>Table 1</label>
<caption>
<title>Comparison with other methods on XD-Violence</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Supervision</th>
<th>Year</th>
<th>Method</th>
<th>Feature</th>
<th>AP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2">Semi-supervised</td>
<td>&#x2013;</td>
<td>SVM baseline</td>
<td>&#x2013;</td>
<td>50.78</td>
</tr>
<tr>
<td>2016</td>
<td>Conv-AE [<xref ref-type="bibr" rid="ref-31">31</xref>]</td>
<td>&#x2013;</td>
<td>30.77</td>
</tr>
<tr>
<td rowspan="11">Weakly-supervised</td>
<td>2021</td>
<td>RTFM [<xref ref-type="bibr" rid="ref-4">4</xref>]</td>
<td>I3D-RGB</td>
<td>77.81</td>
</tr>
<tr>
<td>2022</td>
<td>MSL [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>I3D-RGB</td>
<td>78.28</td>
</tr>
<tr>
<td>2022</td>
<td>MSL [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>VSwin-RGB</td>
<td>78.59</td>
</tr>
<tr>
<td>2023</td>
<td>Cho et al. [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>I3D-RGB</td>
<td>81.30</td>
</tr>
<tr>
<td>2023</td>
<td>NG-MIL [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>I3D-RGB</td>
<td>78.51</td>
</tr>
<tr>
<td>2024</td>
<td>AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>ViT-B/16</td>
<td>78.51</td>
</tr>
<tr>
<td>2024</td>
<td>PE-MIL [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>I3D-RGB</td>
<td>88.05</td>
</tr>
<tr>
<td>2024</td>
<td>Ghadiya et al. [<xref ref-type="bibr" rid="ref-34">34</xref>]</td>
<td>I3D-RGB</td>
<td>86.34</td>
</tr>
<tr>
<td>2025</td>
<td>MMVAD [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>I3D-RGB</td>
<td>81.98</td>
</tr>
<tr>
<td>2025</td>
<td>DAKD [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>I3D-RGB</td>
<td>85.12</td>
</tr>
<tr>
<td>&#x2013;</td>
<td>Ours</td>
<td>I3D-RGB</td>
<td>86.49</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-1fn1" fn-type="other">
<p>Note: SVM, Support Vector Machine; Conv-AE, Conv-Auto-Encoder; RTFM, Robust Temporal Feature Magnitude learning; MSL, Multi-Sequence Learning; NG-MIL, Normality Guided Multiple Instance Learning; PE-MIL, Prompt-Enhanced Multiple Instance Learning; MMVAD, Multimodal VAD; DAKD, Distilling Aggregated Knowledge with Disentangled Attention.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p><bold>Results on UCF-Crime.</bold> <xref ref-type="table" rid="table-2">Table 2</xref> presents the comparative performance on the UCF-Crime dataset. The proposed framework achieves an AUC of 88.19%, demonstrating competitive results against existing methods. This improvement is primarily attributed to the Knowledge-Guided Semantic Integration module (KGSI). This module integrates structured conceptual information via &#x201C;semantic anchors.&#x201D; Unlike traditional approaches, this mechanism enhances the model&#x2019;s discriminative capacity, particularly on UCF-Crime. This dataset features diverse anomaly categories and high conceptual complexity. These findings indicate the method&#x2019;s effectiveness in addressing the long-standing challenge of knowledge insufficiency in WSVAD.</p>
<table-wrap id="table-2">
<label>Table 2</label>
<caption>
<title>Comparison with other methods on UCF-Crime</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Supervision</th>
<th>Year</th>
<th>Method</th>
<th>Feature</th>
<th>AUC (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Semi-supervised</td>
<td>2019</td>
<td>GODS [<xref ref-type="bibr" rid="ref-37">37</xref>]</td>
<td>BOW&#x002B;TCN</td>
<td>70.46</td>
</tr>
<tr>
<td rowspan="12">Weakly-supervised</td>
<td>2022</td>
<td>MSL [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>C3D-RGB</td>
<td>82.85</td>
</tr>
<tr>
<td>2022</td>
<td>MSL [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>I3D-RGB</td>
<td>85.30</td>
</tr>
<tr>
<td>2022</td>
<td>MSL [<xref ref-type="bibr" rid="ref-32">32</xref>]</td>
<td>VideoSwin-RGB</td>
<td>85.62</td>
</tr>
<tr>
<td>2023</td>
<td>NG-MIL [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>C3D-RGB</td>
<td>83.43</td>
</tr>
<tr>
<td>2023</td>
<td>NG-MIL [<xref ref-type="bibr" rid="ref-33">33</xref>]</td>
<td>I3D-RGB</td>
<td>85.63</td>
</tr>
<tr>
<td>2023</td>
<td>CoMo [<xref ref-type="bibr" rid="ref-3">3</xref>]</td>
<td>I3D-RGB</td>
<td>86.10</td>
</tr>
<tr>
<td>2024</td>
<td>AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>]</td>
<td>ViT-B/16</td>
<td>86.36</td>
</tr>
<tr>
<td>2023</td>
<td>UR-DMU [<xref ref-type="bibr" rid="ref-8">8</xref>]</td>
<td>I3D-RGB</td>
<td>86.97</td>
</tr>
<tr>
<td>2024</td>
<td>PE-MIL [<xref ref-type="bibr" rid="ref-11">11</xref>]</td>
<td>I3D-RGB</td>
<td>86.83</td>
</tr>
<tr>
<td>2025</td>
<td>DAKD [<xref ref-type="bibr" rid="ref-36">36</xref>]</td>
<td>I3D-RGB</td>
<td>87.71</td>
</tr>
<tr>
<td>2025</td>
<td>MMVAD [<xref ref-type="bibr" rid="ref-35">35</xref>]</td>
<td>I3D-RGB</td>
<td>87.78</td>
</tr>
<tr>
<td>2025</td>
<td>ViCap-AD [<xref ref-type="bibr" rid="ref-38">38</xref>]</td>
<td>I3D-RGB</td>
<td>87.20</td>
</tr>
<tr>
<td/>
<td>&#x2013;</td>
<td>Ours</td>
<td>I3D-RGB</td>
<td>88.19</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-2fn1" fn-type="other">
<p>Note: GODS, Generalized One-class Discriminative Subspaces; CoMo, Context&#x2013;Motion Interrelation Module; UR-DMU, Uncertainty Regulated Dual Memory Units; ViCap-AD, Video Caption Anomaly Detector.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p><bold>Fine-Grained Recognition Performance.</bold> To further validate its efficacy, this study evaluates the proposed approach on the more demanding task of fine-grained anomaly recognition. This capability is demonstrated across both datasets through fine-grained mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds (mAP@IoU) results for UCF-Crime (<xref ref-type="table" rid="table-3">Table 3</xref>) and detailed per-class AP results for XD-Violence (<xref ref-type="fig" rid="fig-4">Fig. 4</xref>). As outlined in <xref ref-type="table" rid="table-3">Table 3</xref>, the model consistently outperforms contemporary methods, including the fine-grained specialist VADCLIP, across the entire range of IoU thresholds, achieving a state-of-the-art average of mAP (AVG) of 11.97. This result empirically corroborates the central hypothesis: a meticulous alignment between visual evidence and precise semantics is pivotal for enhancing recognition accuracy. The performance margin is particularly accentuated at higher IoU thresholds, underscoring the robustness of the proposed saliency-guided selective alignment strategy.</p>
<table-wrap id="table-3">
<label>Table 3</label>
<caption>
<title>Fine-grained comparisons on UCF-Crime</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th align="center" rowspan="2">Method</th>
<th colspan="6">mAP@IoU (%)</th>
</tr>
<tr>
<th>0.1</th>
<th>0.2</th>
<th>0.3</th>
<th>0.4</th>
<th>0.5</th>
<th>AVG</th>
</tr>
</thead>
<tbody>
<tr>
<td>AVVD [<xref ref-type="bibr" rid="ref-39">39</xref>]</td>
<td>10.27</td>
<td>7.01</td>
<td>6.25</td>
<td>3.42</td>
<td>3.29</td>
<td>6.05</td>
</tr>
<tr>
<td>VADCLIP [<xref ref-type="bibr" rid="ref-9">9</xref>]</td>
<td>11.72</td>
<td>7.83</td>
<td>6.40</td>
<td>4.53</td>
<td>2.93</td>
<td>6.68</td>
</tr>
<tr>
<td>ITC [<xref ref-type="bibr" rid="ref-40">40</xref>]</td>
<td>13.54</td>
<td>9.24</td>
<td>7.45</td>
<td>5.46</td>
<td>3.79</td>
<td>7.90</td>
</tr>
<tr>
<td>Ours</td>
<td>15.92</td>
<td>14.13</td>
<td>12.01</td>
<td>9.72</td>
<td>8.05</td>
<td>11.97</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-3fn1" fn-type="other">
<p>Note: AVVD, Audio-Visual Violence Detection; ITC, Injecting Text Clues.</p>
</fn>
</table-wrap-foot>
</table-wrap><fig id="fig-4">
<label>Figure 4</label>
<caption>
<title>AP results on individual anomaly classes of XD-Violence</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-4.tif"/>
</fig>
<p>In contrast to prior works that often establish coarse-grained, localized associations between category labels and event proposals, this framework introduces a dual-level granularity refinement. Firstly, it enriches semantic representation by substituting rudimentary category labels with comprehensive semantic anchors derived from external knowledge. Secondly, it institutes a dynamic attention mechanism that leverages anomaly salience scores to selectively focus on the most discriminative frames within a candidate region. This direct and refined correlation between semantically rich concepts and visually salient evidence empowers the model to delineate the full temporal extent of anomalous events with superior precision, especially for those that are subtle or intricate in nature.</p>
<p>To provide a more granular analysis of the model&#x2019;s fine-grained recognition capabilities, <xref ref-type="fig" rid="fig-4">Fig. 4</xref> presents a per-class performance comparison on the XD-Violence dataset against AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>]. The results show that this approach demonstrates a significant advantage in most categories. For instance, it achieves a nearly tenfold increase in AP for &#x201C;Abuse&#x201D; (60.5% vs. 6.1%) and a clear improvement for &#x201C;Shooting&#x201D; (59.3% vs. 26.1%). While AnomalyCLIP [<xref ref-type="bibr" rid="ref-10">10</xref>] holds a slight edge in the well-represented &#x201C;Riot&#x201D; category (92.7% vs. 91.0%), this method remains highly competitive and surpasses the baseline in most other tested scenarios. This detailed breakdown highlights the effectiveness of the proposed framework in enhancing the discriminative power for a diverse range of complex events, particularly those that require a deeper semantic understanding.</p>

</sec>
<sec id="s4_4">
<label>4.4</label>
<title>Ablation Study</title>
<p>This study conducts a comprehensive ablation study to dissect the contributions of individual components and their synergistic effects within the proposed method. For this analysis, a baseline model is established equipped solely with the temporal context fusion module. This baseline configuration omits the TRE mechanism, utilizing standard self-attention for contextual encoding. Furthermore, the anomaly classification head processes the fused visual features directly to generate an anomaly score. The training of this baseline is supervised exclusively by the anomaly detection loss (<inline-formula id="ieqn-132"><mml:math id="mml-ieqn-132"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>an</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>), without the contrastive loss from the KGSI module. <xref ref-type="table" rid="table-4">Table 4</xref> details the results of this study. To assess the practical viability of the framework, its operational efficiency is also evaluated using two key metrics: Test Time (s), which represents the average inference latency for a single video clip, and Model Size (MB), which denotes the storage footprint of the final model weights. A detailed analysis of these efficiency metrics reveals that the incremental computational cost of the proposed modules is minimal. The complete model incurs only a 57% relative increase in inference time (from 0.042 s to 0.066 s) and a 34% increase in model size (from 94 MB to 126 MB) over the baseline, while delivering a substantial 3.93% absolute improvement in AUC. This favorable trade-off underscores the architectural efficiency of the design. The findings collectively demonstrate that the complete model strikes an effective balance between predictive accuracy and computational overhead, underscoring its suitability for deployment in practical applications.</p>
<table-wrap id="table-4">
<label>Table 4</label>
<caption>
<title>Ablation studies of proposed modules</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Baseline</th>
<th>TRE</th>
<th>KGSI</th>
<th>ERA</th>
<th>UCF AUC (%)</th>
<th>XD AP (%)</th>
<th>Test times (s)</th>
<th>Model size (MB)</th>
</tr>
</thead>
<tbody>
<tr>
<td>&#x2713;</td>
<td></td>
<td></td>
<td></td>
<td>84.26</td>
<td>83.29</td>
<td>0.042</td>
<td>94</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td></td>
<td>85.47</td>
<td>84.39</td>
<td>0.049</td>
<td>96</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td>87.25</td>
<td>85.98</td>
<td>0.064</td>
<td>112</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td></td>
<td>&#x2713;</td>
<td>86.38</td>
<td>85.03</td>
<td>0.054</td>
<td>101</td>
</tr>
<tr>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>&#x2713;</td>
<td>88.19</td>
<td>86.49</td>
<td>0.066</td>
<td>126</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn id="table-4fn1" fn-type="other">
<p>Note: The table follows a cumulative design where each row adds a new component to the previous configuration. A checkmark (&#x2713;) indicates the module is included in that configuration, while an empty cell indicates it is excluded.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>An analysis of the ablation results in <xref ref-type="table" rid="table-4">Table 4</xref> elucidates the contribution of each key component. Initially, incorporating the TRE mechanism into the baseline (Row 1 vs. Row 2) yields a 1.21% increase in AUC, with a negligible increase in latency (0.007 s) and model size (2 MB), confirming its efficiency in enhancing temporal modeling. Subsequently, the integration of the KGSI module (Row 2 vs. Row 3) delivers a further AUC enhancement of 1.78%. This significant performance gain comes with a moderate computational cost (an additional 0.015 s and 16 MB). This cost is justified by the critical role of conceptual disambiguation. Finally, the entire model configuration, which incorporates the ERA module within a multi-task framework (Row 3 vs. Row 5), achieves peak performance with an AUC of 88.19%. The incorporation of the ERA module for causal reasoning adds only 0.002 s of inference time and 14 MB of parameters, demonstrating that the advanced capability for active risk assessment can be achieved with minimal overhead. This pinnacle result illustrates that a well-formulated auxiliary task, such as fine-grained anomaly categorization, can provide informative gradient signals that regularize the main task, reinforcing the advantages of a co-training paradigm. Crucially, this runtime analysis confirms the model&#x2019;s practical viability. The complete model&#x2019;s inference speed of 0.066 s per clip translates to a processing throughput of approximately 15 clips per second. This high throughput rate, representing the average inference latency, comfortably meets the stringent requirements for real-time analysis, underscoring the model&#x2019;s suitability for deployment in practical surveillance applications.</p>

<p>This study further validates the utility of semantic anchors through a comparison of different prompt templates (see <xref ref-type="table" rid="table-5">Table 5</xref>). The results demonstrate that employing Wikidata-augmented semantic prototypes yields notable performance gains over a baseline that uses only class names. Specifically, this approach boosts the AUC on UCF-Crime by 1.30% and the AP on XD-Violence by 1.23%. This outcome suggests a strong link between the semantic richness of prompts and the model&#x2019;s discriminative power. It confirms that a more descriptive knowledge-based foundation is crucial for enhancing anomaly detection accuracy.</p>
<table-wrap id="table-5">
<label>Table 5</label>
<caption>
<title>Performance comparison of different prompt templates in the KGSI module</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>Prompt template</th>
<th>UCF-Crime AUC (%)</th>
<th>XD-Violence AP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>KGSI module using only class names as prototypes</td>
<td>86.89</td>
<td>85.26</td>
</tr>
<tr>
<td>KGSI module with Wikidata-extended prototypes</td>
<td>88.19</td>
<td>86.49</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_5">
<label>4.5</label>
<title>Hyperparameter Sensitivity Analysis</title>
<p>The final performance of the proposed model relies on the effective balancing of its multi-task learning objectives, controlled by hyperparameters <inline-formula id="ieqn-133"><mml:math id="mml-ieqn-133"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-134"><mml:math id="mml-ieqn-134"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula>. A comprehensive sensitivity analysis was conducted by varying one hyperparameter across a wide range of values while keeping the other fixed, with performance reported on both the UCF-Crime and XD-Violence datasets, as shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref>.</p>
<fig id="fig-5">
<label>Figure 5</label>
<caption>
<title>Impact of hyperparameters <inline-formula id="ieqn-135"><mml:math id="mml-ieqn-135"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> and <inline-formula id="ieqn-136"><mml:math id="mml-ieqn-136"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> on model performance</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-5.tif"/>
</fig>
<p>The hyperparameter <inline-formula id="ieqn-137"><mml:math id="mml-ieqn-137"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> governs the contribution of the semantic alignment loss <inline-formula id="ieqn-138"><mml:math id="mml-ieqn-138"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. As illustrated in <xref ref-type="fig" rid="fig-5">Fig. 5</xref> (left), the model&#x2019;s performance on both datasets consistently peaks when <inline-formula id="ieqn-139"><mml:math id="mml-ieqn-139"><mml:mi>&#x03BB;</mml:mi><mml:mo>=</mml:mo><mml:mn>1.0</mml:mn></mml:math></inline-formula>. An optimal weight of 1.0, which makes the semantic alignment loss equally crucial as the primary anomaly detection loss (<inline-formula id="ieqn-140"><mml:math id="mml-ieqn-140"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>). This underscores that this module is not merely a regularizer but a core component, as it makes the semantic alignment loss equally crucial as the primary anomaly detection loss. When <inline-formula id="ieqn-141"><mml:math id="mml-ieqn-141"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> is too small (e.g., 0.01), the model cannot fully leverage the provided semantic guidance, limiting its fine-grained recognition capability. Conversely, when <inline-formula id="ieqn-142"><mml:math id="mml-ieqn-142"><mml:mi>&#x03BB;</mml:mi></mml:math></inline-formula> becomes excessively large (e.g., 10.0), the training objective overemphasizes semantic matching at the expense of the primary detection task, leading to performance degradation.</p>

<p>Similarly, the weight <inline-formula id="ieqn-143"><mml:math id="mml-ieqn-143"><mml:mi>&#x03B3;</mml:mi></mml:math></inline-formula> for the event relation modeling loss (<inline-formula id="ieqn-144"><mml:math id="mml-ieqn-144"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>era</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula>) acts as a powerful, yet subtle, regularizer. The performance trend, shown in <xref ref-type="fig" rid="fig-5">Fig. 5</xref> (right), confirms that completely removing this objective <inline-formula id="ieqn-145"><mml:math id="mml-ieqn-145"><mml:mo stretchy="false">(</mml:mo><mml:mi>&#x03B3;</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo stretchy="false">)</mml:mo></mml:math></inline-formula> results in a significant performance drop. The model&#x2019;s performance reaches its optimum on both datasets when <inline-formula id="ieqn-146"><mml:math id="mml-ieqn-146"><mml:mi>&#x03B3;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula>. However, as <italic>&#x03B3;</italic> increases beyond this point (e.g., to 1.0), a consistent decline is observed. This finding suggests that while the predictive guidance from <inline-formula id="ieqn-147"><mml:math id="mml-ieqn-147"><mml:msub><mml:mrow><mml:mi>&#x02112;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mtext>era</mml:mtext></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> is highly beneficial, an excessively high weight can lead to over-regularization, where the more complex future-prediction task begins to interfere with the stable learning signals from the primary detection and alignment tasks. Therefore, a weight of <inline-formula id="ieqn-148"><mml:math id="mml-ieqn-148"><mml:mi>&#x03B3;</mml:mi><mml:mo>=</mml:mo><mml:mn>0.1</mml:mn></mml:math></inline-formula> provides the ideal balance.</p>

<p>Impact of attention scaling factor <inline-formula id="ieqn-149"><mml:math id="mml-ieqn-149"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula>. Beyond the loss weighting factors, the sensitivity of the attention scaling factor <inline-formula id="ieqn-150"><mml:math id="mml-ieqn-150"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula>, used in the context-sensitive separation (<xref ref-type="disp-formula" rid="eqn-8">Eqs. (8)</xref> and <xref ref-type="disp-formula" rid="eqn-9">(9)</xref>), was investigated. This parameter controls the sharpness of the distribution separating foreground and background features. As summarized in <xref ref-type="table" rid="table-6">Table 6</xref>, performance peaks at <inline-formula id="ieqn-151"><mml:math id="mml-ieqn-151"><mml:mi>&#x03BA;</mml:mi><mml:mo>=</mml:mo><mml:mn>10</mml:mn></mml:math></inline-formula>, achieving an AUC of 88.19% on UCF-Crime and an AP of 86.49% on XD-Violence. Lower values (e.g., <inline-formula id="ieqn-152"><mml:math id="mml-ieqn-152"><mml:mi>&#x03BA;</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula> or 5) result in a more uniform attention distribution, diluting the focus on the most salient anomalous regions. Conversely, a higher value (<inline-formula id="ieqn-153"><mml:math id="mml-ieqn-153"><mml:mi>&#x03BA;</mml:mi><mml:mo>=</mml:mo><mml:mn>20</mml:mn></mml:math></inline-formula>) over-concentrates the attention, potentially excluding relevant contextual information and leading to a slight performance decline. Thus, <inline-formula id="ieqn-154"><mml:math id="mml-ieqn-154"><mml:mi>&#x03BA;</mml:mi><mml:mo>=</mml:mo><mml:mn>10</mml:mn></mml:math></inline-formula> is identified as the optimal setting, effectively balancing focused attention with contextual awareness.</p>
<table-wrap id="table-6">
<label>Table 6</label>
<caption>
<title>The impact of attention scaling factor <inline-formula id="ieqn-155"><mml:math id="mml-ieqn-155"><mml:mi>&#x03BA;</mml:mi></mml:math></inline-formula> on model performance</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th><inline-formula id="ieqn-156"><mml:math id="mml-ieqn-156"><mml:mi mathvariant="bold-italic">&#x03BA;</mml:mi></mml:math></inline-formula> Value</th>
<th>UCF-Crime AUC (%)</th>
<th>XD-Violence AP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>86.12</td>
<td>84.55</td>
</tr>
<tr>
<td>5</td>
<td>87.78</td>
<td>86.01</td>
</tr>
<tr>
<td>10</td>
<td>88.19</td>
<td>86.49</td>
</tr>
<tr>
<td>20</td>
<td>88.03</td>
<td>86.17</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Impact of GRU hidden layer size. The dimensionality of the hidden state in the GRU network directly influences the capacity of the ERA module to encode temporal event evolution. The impact of varying this hidden size is presented in <xref ref-type="table" rid="table-7">Table 7</xref>. A hidden size of 256 yields the best performance, indicating it provides sufficient representational capacity for capturing complex event dynamics without overfitting. A smaller hidden size of 128 appears to limit the model&#x2019;s temporal modelling capability, while a larger size of 512 may introduce overfitting, as evidenced by the slight degradation in performance metrics.</p>
<table-wrap id="table-7">
<label>Table 7</label>
<caption>
<title>The impact of GRU hidden layer size on model performance</title>
</caption>
<table>
<colgroup>
<col align="center"/>
<col align="center"/>
<col align="center"/> </colgroup>
<thead>
<tr>
<th>GRU Hidden Size</th>
<th>UCF-Crime AUC (%)</th>
<th>XD-Violence AP (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>128</td>
<td>87.72</td>
<td>85.91</td>
</tr>
<tr>
<td>256</td>
<td>88.19</td>
<td>86.49</td>
</tr>
<tr>
<td>512</td>
<td>88.02</td>
<td>86.04</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s4_6">
<label>4.6</label>
<title>Qualitative Results</title>
<p>To intuitively demonstrate the effectiveness and interpretive power of the proposed ERA module, the relational knowledge it learned after being trained on the UCF-Crime dataset is visualized. This qualitative analysis offers insight into the model&#x2019;s capacity to comprehend the logical progression of events. <xref ref-type="fig" rid="fig-6">Figs. 6</xref> and <xref ref-type="fig" rid="fig-7">7</xref> present the results.</p>
<fig id="fig-6">
<label>Figure 6</label>
<caption>
<title>Heatmap of the learned event transition strengths on the UCF-Crime dataset</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-6.tif"/>
</fig><fig id="fig-7">
<label>Figure 7</label>
<caption>
<title>Learned evolutionary pathways of high-risk anomalies</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-7.tif"/>
</fig>
<p><xref ref-type="fig" rid="fig-6">Fig. 6</xref> presents a heatmap of the global event transition matrix captured by the ERA module after training on UCF-Crime. In this matrix, each cell (<italic>i</italic>, <italic>j</italic>) represents the learned transition strength from a source event <italic>i</italic> (on the <italic>y</italic>-axis) to a subsequent event <italic>j</italic> (on the <italic>x</italic>-axis), providing a comprehensive map of all pairwise event correlations. This heatmap serves as a visual &#x201C;map&#x201D; of the model&#x2019;s learned common sense, revealing which events are likely to follow others based on the model&#x2019;s understanding. Brighter cells indicate stronger learned connections.</p>

<p>The heatmap demonstrates that the model has successfully learned logical and high-frequency event progressions that align with real-world intuition. Several strong correlations are observable that would be expected by security professionals: A particularly bright cell at the intersection of the &#x201C;Explosion&#x201D; row and &#x201C;Assault&#x201D; column, with the highest value of 0.062 in the matrix, indicates the model has robustly learned that explosions often lead to violent confrontations or assaults in the aftermath. Similarly, a strong link is visible from &#x201C;Stealing&#x201D; to &#x201C;Assault&#x201D; (0.052), capturing a logical escalation from theft to violent confrontation. The model also identifies a notable connection from &#x201C;Robbery&#x201D; to &#x201C;Burglary&#x201D; (0.046), reflecting a recognized association between different types of property crimes. This visual evidence is crucial as it demonstrates that the ERA module is not a &#x201C;black box.&#x201D; The ability to inspect such a matrix provides practitioners with an interpretable tool to understand the model&#x2019;s reasoning, confirming that its predictions are grounded in logical, learnable event sequences rather than just opaque statistical correlations. This degree of transparency represents a key step toward building trust in automated surveillance systems.</p>
<p>Building upon this global relationship map, <xref ref-type="fig" rid="fig-7">Fig. 7</xref> zooms in on the most critical findings by visualizing the top-3 ranked high-risk event trajectories. Each trajectory illustrates a multi-step sequence of events that is highly likely to escalate into a severe anomaly. Crucially, the ability to identify these longer chains showcases the ERA module&#x2019;s primary strength in modeling long-range temporal dependencies. For instance, the top-ranked trajectory reveals a complex pattern of &#x201C;Arrest&#x201D; &#x2192; &#x201C;Assault&#x201D; &#x2192; &#x201C;Burglary&#x201D; &#x2192; &#x201C;Normal&#x201D; &#x2192; &#x201C;Stealing&#x201D; &#x2192; &#x201C;Arson&#x201D;, with a cumulative score of 0.207. The presence of a &#x201C;Normal&#x201D; segment within a high-risk chain is particularly insightful, demonstrating that the model can capture subtle, real-world criminal behaviors, such as a perpetrator feigning normalcy between illicit acts.</p>
<p>This result vividly demonstrates the model&#x2019;s capacity for active risk assessment. By moving beyond the detection of isolated events to understanding their evolutionary patterns, the ERA module provides a deeper, more actionable insight into how dangerous situations develop over time.</p>
<p>To visually substantiate the quantitative results and demonstrate the method&#x2019;s practical efficacy, a qualitative analysis of the generated anomaly scores on representative videos from the UCF-Crime and XD-Violence datasets is presented. The anomaly score curves, depicted in <xref ref-type="fig" rid="fig-8">Fig. 8</xref>, offer a clear visualization of the model&#x2019;s proficiency in temporal localization. The figure provides compelling evidence of the model&#x2019;s adaptability in handling anomalies of varying durations and characteristics. For instance, in the &#x2018;Explosion&#x2019; example shown in <xref ref-type="fig" rid="fig-8">Fig. 8a</xref>, the model accurately captures the entire duration of a long, developing event, with the anomaly score gradually increasing as the situation escalates. Conversely, for an abrupt, short-duration event like &#x2018;Fighting&#x2019; shown in <xref ref-type="fig" rid="fig-8">Fig. 8b</xref>, the model responds with a sharp and distinct spike, precisely pinpointing the critical moment. Furthermore, the method performs equally well when handling entirely normal scenarios, as <xref ref-type="fig" rid="fig-8">Fig. 8c</xref> demonstrates, where the anomaly score consistently remains low, robustly confirming the model&#x2019;s reliability in distinguishing between normal and abnormal behavior. Similarly, for an anomaly event like &#x2018;Shooting&#x2019; shown in <xref ref-type="fig" rid="fig-8">Fig. 8d</xref>, which occurs in the first half of the video, the model quickly raises the anomaly score close to 1.0 and maintains a high score throughout the event&#x2019;s duration. This robust capability to handle diverse temporal patterns validates the effectiveness of the TCF module. It confirms that the module, enhanced by the TRE, successfully captures both long-range contextual dependencies and short-term, sudden changes in the video sequence. The strong alignment between the predicted scores and the ground-truth abnormal periods (indicated by the shaded regions) demonstrates that this approach can learn discriminative spatio-temporal representations for accurate anomaly localization, even under the challenging conditions of weak supervision without frame-level annotations.</p>
<fig id="fig-8">
<label>Figure 8</label>
<caption>
<title>Visual results of the proposed method on the UCF-Crime and XD-Violence datasets. The blue curve shows the predicted anomaly scores, the light-pink shaded areas mark ground-truth anomalous frames, and red/green boxes highlight representative abnormal and normal events</title>
</caption>
<graphic mimetype="image" mime-subtype="tif" xlink:href="CMC_73850-fig-8.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<label>5</label>
<title>Conclusion and Discussion</title>
<p>This study addresses two central limitations in weakly supervised video anomaly detection: the coarse-grained supervision of the MIL framework, which hinders fine-grained recognition, and the absence of temporal-causal reasoning, leading to opacity in the decision-making process. To overcome these challenges, a novel method is introduced that synergistically integrates dynamic semantic guidance with event relation analysis. This approach departs from traditional binary-supervised methods by first incorporating a dynamic semantic guidance module. This component leverages semantic prototypes derived from external knowledge (Wikidata) to provide robust, fine-grained supervision, enabling the model to not only detect anomalies but also distinguish between specific types of anomalies. On the XD-Violence dataset, the model achieves an average AP of 65.6% in fine-grained anomaly classification, demonstrating its capability to perform detailed and accurate semantic discrimination. Furthermore, an ERA module is introduced to model the explicit evolutionary patterns of events. By learning the logical sequences of how situations escalate, the ERA module provides the crucial temporal and causal context for its predictions. By revealing an event&#x2019;s development over time, the framework demonstrates the logical reasoning behind why a situation is deemed anomalous, thereby addressing a core interpretability limitation of conventional weakly supervised models. The effectiveness of the integrated framework is confirmed by its strong detection performance, achieving a frame-level AUC of 88.19% on the UCF-Crime dataset and an AP of 86.49% on the XD-Violence dataset.</p>
<p>Notably, the presented framework achieves this performance utilizing only RGB inputs. This design choice is motivated by the need to ensure model generality and deployment feasibility. In numerous real-world surveillance scenarios, such as public live streams or legacy security systems, audio data is often unavailable, unreliable, or sensitive to privacy concerns. The establishment of a state-of-the-art baseline with the most universally available visual modality reinforces the framework&#x2019;s broad applicability. It is acknowledged that audio can provide valuable complementary cues, and its exclusion here constitutes a limitation that points to a clear future direction.</p>
<p>Ablation studies further confirm the complementary effects of the two core contributions: the semantic guidance that enhances what the model sees, and the event relation analysis that provides a logical structure for how events unfold. Together, these facilitate a qualitative transition from simple anomaly detection to a more comprehensive understanding of anomalies. This progression reflects a broader shift from data-driven pattern recognition to knowledge-guided video comprehension, establishing a new paradigm for weakly supervised visual analysis.</p>
<p>Building upon this foundation, future research will focus on advancing the framework&#x2019;s predictive and reasoning capabilities. A clear path forward involves extending the ERA module&#x2019;s predictive horizon to multi-step risk forecasting and creating a more holistic system by incorporating complementary data streams. This direction directly addresses the current limitation by integrating critical audio cues available in datasets like XD-Violence. Furthermore, to build upon the introduced causal-temporal reasoning, subsequent work will incorporate more formal causal frameworks, such as structural causal models (SCMs), to achieve a more theoretically grounded understanding of event causality.</p>
</sec>
</body>
<back>
<ack>
<p>Not applicable.</p>
</ack>
<sec>
<title>Funding Statement</title>
<p>The authors received no specific funding for this study.</p>
</sec>
<sec>
<title>Author Contributions</title>
<p>The authors confirm contribution to the paper as follows: study conception and design, Weishan Gao and Ye Wang; writing&#x2014;original draft preparation, Weishan Gao and Xiaoyin Wang, writing&#x2014;review and editing, Xiaoyin Wang and Xiaochuan Jing; data curation, Ye Wang; supervision, Xiaochuan Jing. All authors reviewed the results and approved the final version of the manuscript.</p>
</sec>
<sec sec-type="data-availability">
<title>Availability of Data and Materials</title>
<p>The data that support the findings of this study are openly available on the website: <ext-link ext-link-type="uri" xlink:href="https://www.crcv.ucf.edu/projects/real-world/">https://www.crcv.ucf.edu/projects/real-world/</ext-link> (accessed on 01 September 2025) and <ext-link ext-link-type="uri" xlink:href="https://roc-ng.github.io/XD-Violence/">https://roc-ng.github.io/XD-Violence/</ext-link> (accessed on 01 September 2025).</p>
</sec>
<sec>
<title>Ethics Approval</title>
<p>Not applicable.</p>
</sec>
<sec sec-type="COI-statement">
<title>Conflicts of Interest</title>
<p>The authors declare no conflicts of interest to report regarding the present study.</p>
</sec>
<ref-list content-type="authoryear">
<title>References</title>
<ref id="ref-1"><label>[1]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Abdalla</surname> <given-names>M</given-names></string-name>, <string-name><surname>Javed</surname> <given-names>S</given-names></string-name>, <string-name><surname>Al Radi</surname> <given-names>M</given-names></string-name>, <string-name><surname>Ulhaq</surname> <given-names>A</given-names></string-name>, <string-name><surname>Werghi</surname> <given-names>N</given-names></string-name></person-group>. <article-title>Video anomaly detection in 10 years: a survey and outlook</article-title>. <source>Neural Comput Appl</source>. <year>2025</year>;<volume>37</volume>(<issue>32</issue>):<fpage>26321</fpage>&#x2013;<lpage>64</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s00521-025-11659-8</pub-id>.</mixed-citation></ref>
<ref id="ref-2"><label>[2]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Caetano</surname> <given-names>F</given-names></string-name>, <string-name><surname>Carvalho</surname> <given-names>P</given-names></string-name>, <string-name><surname>Mastralexi</surname> <given-names>C</given-names></string-name>, <string-name><surname>Cardoso</surname> <given-names>JS</given-names></string-name></person-group>. <article-title>Enhancing weakly-supervised video anomaly detection with temporal constraints</article-title>. <source>IEEE Access</source>. <year>2025</year>;<volume>13</volume>:<fpage>70882</fpage>&#x2013;<lpage>94</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ACCESS.2025.3560767</pub-id>.</mixed-citation></ref>
<ref id="ref-3"><label>[3]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Cho</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>M</given-names></string-name>, <string-name><surname>Hwang</surname> <given-names>S</given-names></string-name>, <string-name><surname>Park</surname> <given-names>C</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>K</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Look around for anomalies: weakly-supervised anomaly detection via context-motion relational learning</article-title>. In: <conf-name>Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023 Jun 17&#x2013;24</conf-name>; <publisher-loc>Vancouver, BC, Canada</publisher-loc>. p. <fpage>12137</fpage>&#x2013;<lpage>46</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52729.2023.01168</pub-id>.</mixed-citation></ref>
<ref id="ref-4"><label>[4]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Tian</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Singh</surname> <given-names>R</given-names></string-name>, <string-name><surname>Verjans</surname> <given-names>JW</given-names></string-name>, <string-name><surname>Carneiro</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Weakly-supervised video anomaly detection with robust temporal feature magnitude learning</article-title>. In: <conf-name>Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10&#x2013;17</conf-name>; <publisher-loc>Montreal, QC, Canada</publisher-loc>. p. <fpage>4955</fpage>&#x2013;<lpage>66</lpage>. doi:<pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00493</pub-id>.</mixed-citation></ref>
<ref id="ref-5"><label>[5]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Pu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>L</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Learning prompt-enhanced context features for weakly-supervised video anomaly detection</article-title>. <source>IEEE Trans Image Process</source>. <year>2024</year>;<volume>33</volume>(<issue>11</issue>):<fpage>4923</fpage>&#x2013;<lpage>36</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2024.3451935</pub-id>; <pub-id pub-id-type="pmid">39236124</pub-id></mixed-citation></ref>
<ref id="ref-6"><label>[6]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Dynamic local aggregation network with adaptive clusterer for anomaly detection</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2022; 2022 Oct 23&#x2013;27</conf-name>; <publisher-loc>Tel Aviv, Israel</publisher-loc>. p. <fpage>404</fpage>&#x2013;<lpage>21</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-19772-7_24</pub-id>.</mixed-citation></ref>
<ref id="ref-7"><label>[7]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>G</given-names></string-name>, <string-name><surname>Cai</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zeng</surname> <given-names>X</given-names></string-name>, <string-name><surname>Zhao</surname> <given-names>R</given-names></string-name></person-group>. <article-title>Scale-aware spatio-temporal relation learning for video anomaly detection</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2022; 2022 Oct 23&#x2013;27</conf-name>; <publisher-loc>Tel Aviv, Israel</publisher-loc>. p. <fpage>333</fpage>&#x2013;<lpage>50</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-19772-7_20</pub-id>.</mixed-citation></ref>
<ref id="ref-8"><label>[8]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zhou</surname> <given-names>H</given-names></string-name>, <string-name><surname>Yu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Yang</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Dual memory units with uncertainty regulation for weakly supervised video anomaly detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2023</year>;<volume>37</volume>(<issue>3</issue>):<fpage>3769</fpage>&#x2013;<lpage>77</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v37i3.25489</pub-id>.</mixed-citation></ref>
<ref id="ref-9"><label>[9]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>X</given-names></string-name>, <string-name><surname>Pang</surname> <given-names>G</given-names></string-name>, <string-name><surname>Zhou</surname> <given-names>L</given-names></string-name>, <string-name><surname>Yan</surname> <given-names>Q</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>P</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>VadCLIP: adapting vision-language models for weakly supervised video anomaly detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2024</year>;<volume>38</volume>(<issue>6</issue>):<fpage>6074</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v38i6.28423</pub-id>.</mixed-citation></ref>
<ref id="ref-10"><label>[10]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Zanella</surname> <given-names>L</given-names></string-name>, <string-name><surname>Liberatori</surname> <given-names>B</given-names></string-name>, <string-name><surname>Menapace</surname> <given-names>W</given-names></string-name>, <string-name><surname>Poiesi</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Ricci</surname> <given-names>E</given-names></string-name></person-group>. <article-title>Delving into CLIP latent space for video anomaly recognition</article-title>. <source>Comput Vis Image Underst</source>. <year>2024</year>;<volume>249</volume>:<fpage>104163</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.cviu.2024.104163</pub-id>.</mixed-citation></ref>
<ref id="ref-11"><label>[11]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Chen</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>L</given-names></string-name>, <string-name><surname>Su</surname> <given-names>L</given-names></string-name>, <string-name><surname>Zha</surname> <given-names>ZJ</given-names></string-name>, <string-name><surname>Huang</surname> <given-names>Q</given-names></string-name></person-group>. <article-title>Prompt-enhanced multiple instance learning for weakly supervised video anomaly detection</article-title>. In: <conf-name>Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 16&#x2013;22</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>18319</fpage>&#x2013;<lpage>29</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52733.2024.01734</pub-id>.</mixed-citation></ref>
<ref id="ref-12"><label>[12]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Vrande&#x010D;i&#x0107;</surname> <given-names>D</given-names></string-name>, <string-name><surname>Kr&#x00F6;tzsch</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Wikidata: a free collaborative knowledgebase</article-title>. <source>Commun ACM</source>. <year>2014</year>;<volume>57</volume>(<issue>10</issue>):<fpage>78</fpage>&#x2013;<lpage>85</lpage>. doi:<pub-id pub-id-type="doi">10.1145/2629489</pub-id>.</mixed-citation></ref>
<ref id="ref-13"><label>[13]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>J</given-names></string-name>, <string-name><surname>Li</surname> <given-names>D</given-names></string-name>, <string-name><surname>Xiong</surname> <given-names>C</given-names></string-name>, <string-name><surname>Hoi</surname> <given-names>S</given-names></string-name></person-group>. <article-title>BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation</article-title>. In: <conf-name>Proceedings of the 2022 International Conference on Machine Learning (ICML); 2022 Jul 17&#x2013;23</conf-name>; <publisher-loc>Baltimore, MD, USA</publisher-loc>. p. <fpage>12888</fpage>&#x2013;<lpage>900</lpage>.</mixed-citation></ref>
<ref id="ref-14"><label>[14]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Aslam</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kolekar</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>Unsupervised anomalous event detection in videos using spatio-temporal inter-fused autoencoder</article-title>. <source>Multimed Tools Appl</source>. <year>2022</year>;<volume>81</volume>(<issue>29</issue>):<fpage>42457</fpage>&#x2013;<lpage>82</lpage>. doi:<pub-id pub-id-type="doi">10.1007/s11042-022-13496-6</pub-id>.</mixed-citation></ref>
<ref id="ref-15"><label>[15]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Aslam</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kolekar</surname> <given-names>MH</given-names></string-name></person-group>. <article-title>TransGANomaly: transformer based generative adversarial network for video anomaly detection</article-title>. <source>J Vis Commun Image Represent</source>. <year>2024</year>;<volume>100</volume>(<issue>7</issue>):<fpage>104108</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.jvcir.2024.104108</pub-id>.</mixed-citation></ref>
<ref id="ref-16"><label>[16]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Sultani</surname> <given-names>W</given-names></string-name>, <string-name><surname>Chen</surname> <given-names>C</given-names></string-name>, <string-name><surname>Shah</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Real-world anomaly detection in surveillance videos</article-title>. In: <conf-name>Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18&#x2013;22</conf-name>; <publisher-loc>Salt Lake City, UT, USA</publisher-loc>. p. <fpage>6479</fpage>&#x2013;<lpage>88</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2018.00678</pub-id>.</mixed-citation></ref>
<ref id="ref-17"><label>[17]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Zhong</surname> <given-names>JX</given-names></string-name>, <string-name><surname>Li</surname> <given-names>N</given-names></string-name>, <string-name><surname>Kong</surname> <given-names>W</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>S</given-names></string-name>, <string-name><surname>Li</surname> <given-names>TH</given-names></string-name>, <string-name><surname>Li</surname> <given-names>G</given-names></string-name></person-group>. <article-title>Graph convolutional label noise cleaner: train a plug-and-play action classifier for anomaly detection</article-title>. In: <conf-name>Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 16&#x2013;20</conf-name>; <publisher-loc>Long Beach, CA, USA</publisher-loc>. p. <fpage>1237</fpage>&#x2013;<lpage>46</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2019.00133</pub-id>.</mixed-citation></ref>
<ref id="ref-18"><label>[18]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Zhu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Newsam</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Motion-aware feature for improved video anomaly detection</article-title>. <comment>arXiv:1907.10211. 2019</comment>. doi:<pub-id pub-id-type="doi">10.48550/arxiv.1907.10211</pub-id>.</mixed-citation></ref>
<ref id="ref-19"><label>[19]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hussain</surname> <given-names>A</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>N</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>ZA</given-names></string-name>, <string-name><surname>Yar</surname> <given-names>H</given-names></string-name>, <string-name><surname>Baik</surname> <given-names>SW</given-names></string-name></person-group>. <article-title>Edge-assisted framework for instant anomaly detection and cloud-based anomaly recognition in smart surveillance</article-title>. <source>Eng Appl Artif Intell</source>. <year>2025</year>;<volume>160</volume>(<issue>1</issue>):<fpage>111936</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.engappai.2025.111936</pub-id>.</mixed-citation></ref>
<ref id="ref-20"><label>[20]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Hussain</surname> <given-names>A</given-names></string-name>, <string-name><surname>Ullah</surname> <given-names>W</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>N</given-names></string-name>, <string-name><surname>Khan</surname> <given-names>ZA</given-names></string-name>, <string-name><surname>Yar</surname> <given-names>H</given-names></string-name>, <string-name><surname>Baik</surname> <given-names>SW</given-names></string-name></person-group>. <article-title>Class-incremental learning network for real-time anomaly recognition in surveillance environments</article-title>. <source>Pattern Recognit</source>. <year>2026</year>;<volume>170</volume>(<issue>5</issue>):<fpage>112064</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.patcog.2025.112064</pub-id>.</mixed-citation></ref>
<ref id="ref-21"><label>[21]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>M</given-names></string-name>, <string-name><surname>Xing</surname> <given-names>J</given-names></string-name>, <string-name><surname>Mei</surname> <given-names>J</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jiang</surname> <given-names>Y</given-names></string-name></person-group>. <article-title>ActionCLIP: adapting language-image pretrained models for video action recognition</article-title>. <source>IEEE Trans Neural Netw Learn Syst</source>. <year>2025</year>;<volume>36</volume>(<issue>1</issue>):<fpage>625</fpage>&#x2013;<lpage>37</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TNNLS.2023.3331841</pub-id>; <pub-id pub-id-type="pmid">37988204</pub-id></mixed-citation></ref>
<ref id="ref-22"><label>[22]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ju</surname> <given-names>C</given-names></string-name>, <string-name><surname>Han</surname> <given-names>T</given-names></string-name>, <string-name><surname>Zheng</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Xie</surname> <given-names>W</given-names></string-name></person-group>. <article-title>Prompting visual-language models for efficient video understanding</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2022; 2022 Oct 23&#x2013;27</conf-name>; <publisher-loc>Tel Aviv, Israel</publisher-loc>. p. <fpage>105</fpage>&#x2013;<lpage>24</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-031-19833-5_7</pub-id>.</mixed-citation></ref>
<ref id="ref-23"><label>[23]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Yang</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Text prompt with normality guidance for weakly supervised video anomaly detection</article-title>. In: <conf-name>Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024 Jun 17&#x2013;21</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>18899</fpage>&#x2013;<lpage>908</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR52733.2024.01788</pub-id>.</mixed-citation></ref>
<ref id="ref-24"><label>[24]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Learning causal temporal relation and feature discrimination for anomaly detection</article-title>. <source>IEEE Trans Image Process</source>. <year>2021</year>;<volume>30</volume>:<fpage>3513</fpage>&#x2013;<lpage>27</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2021.3062192</pub-id>; <pub-id pub-id-type="pmid">33656993</pub-id></mixed-citation></ref>
<ref id="ref-25"><label>[25]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Radford</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>JW</given-names></string-name>, <string-name><surname>Hallacy</surname> <given-names>C</given-names></string-name>, <string-name><surname>Ramesh</surname> <given-names>A</given-names></string-name>, <string-name><surname>Goh</surname> <given-names>G</given-names></string-name>, <string-name><surname>Agarwal</surname> <given-names>S</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Learning transferable visual models from natural language supervision</article-title>. <comment>arXiv:2103.00020. 2021</comment>.</mixed-citation></ref>
<ref id="ref-26"><label>[26]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Manzoor</surname> <given-names>MA</given-names></string-name>, <string-name><surname>Albarri</surname> <given-names>S</given-names></string-name>, <string-name><surname>Xian</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Meng</surname> <given-names>Z</given-names></string-name>, <string-name><surname>Nakov</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liang</surname> <given-names>S</given-names></string-name></person-group>. <article-title>Multimodality representation learning: a survey on evolution, pretraining and its applications</article-title>. <source>ACM Trans Multimed Comput Commun Appl</source>. <year>2024</year>;<volume>20</volume>(<issue>3</issue>):<fpage>1</fpage>&#x2013;<lpage>34</lpage>. doi:<pub-id pub-id-type="doi">10.1145/3617833</pub-id>.</mixed-citation></ref>
<ref id="ref-27"><label>[27]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name>, <string-name><surname>Shi</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Sun</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Shao</surname> <given-names>F</given-names></string-name>, <string-name><surname>Wu</surname> <given-names>Z</given-names></string-name>, <etal>et al</etal></person-group>. <article-title>Not only look, but also listen: learning multimodal violence detection under weak supervision</article-title>. In: <conf-name>Proceedings of the Computer Vision&#x2014;ECCV 2020; 2020 Aug 23&#x2013;28</conf-name>; <publisher-loc>Glasgow, UK</publisher-loc>. p. <fpage>322</fpage>&#x2013;<lpage>39</lpage>. doi:<pub-id pub-id-type="doi">10.1007/978-3-030-58577-8_20</pub-id>.</mixed-citation></ref>
<ref id="ref-28"><label>[28]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Gao</surname> <given-names>W</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>X</given-names></string-name>, <string-name><surname>Wang</surname> <given-names>Y</given-names></string-name>, <string-name><surname>Jing</surname> <given-names>X</given-names></string-name></person-group>. <article-title>Dual-stream attention-enhanced memory networks for video anomaly detection</article-title>. <source>Sensors</source>. <year>2025</year>;<volume>25</volume>(<issue>17</issue>):<fpage>5496</fpage>. doi:<pub-id pub-id-type="doi">10.3390/s25175496</pub-id>; <pub-id pub-id-type="pmid">40942925</pub-id></mixed-citation></ref>
<ref id="ref-29"><label>[29]</label><mixed-citation publication-type="other"><person-group person-group-type="author"><string-name><surname>Kay</surname> <given-names>W</given-names></string-name>, <string-name><surname>Carreira</surname> <given-names>J</given-names></string-name>, <string-name><surname>Simonyan</surname> <given-names>K</given-names></string-name>, <string-name><surname>Zhang</surname> <given-names>B</given-names></string-name>, <string-name><surname>Hillier</surname> <given-names>C</given-names></string-name>, <string-name><surname>Vijayanarasimhan</surname> <given-names>S</given-names></string-name>, <etal>et al.</etal></person-group> <article-title>The kinetics human action video dataset</article-title>. <comment>arXiv:1705.06950. 2017</comment>. doi:<pub-id pub-id-type="doi">10.48550/arXiv.1705.0695</pub-id>.</mixed-citation></ref>
<ref id="ref-30"><label>[30]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Carreira</surname> <given-names>J</given-names></string-name>, <string-name><surname>Zisserman</surname> <given-names>A</given-names></string-name></person-group>. <article-title>Quo vadis, action recognition? A new model and the kinetics dataset</article-title>. In: <conf-name>Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21&#x2013;26</conf-name>; <publisher-loc>Honolulu, HI, USA</publisher-loc>. p. <fpage>4724</fpage>&#x2013;<lpage>33</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2017.502</pub-id>.</mixed-citation></ref>
<ref id="ref-31"><label>[31]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Hasan</surname> <given-names>M</given-names></string-name>, <string-name><surname>Choi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Neumann</surname> <given-names>J</given-names></string-name>, <string-name><surname>Roy-Chowdhury</surname> <given-names>AK</given-names></string-name>, <string-name><surname>Davis</surname> <given-names>LS</given-names></string-name></person-group>. <article-title>Learning temporal regularity in video sequences</article-title>. In: <conf-name>Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016 Jun 27&#x2013;30</conf-name>; <publisher-loc>Las Vegas, NV, USA</publisher-loc>. p. <fpage>733</fpage>&#x2013;<lpage>42</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPR.2016.86</pub-id>.</mixed-citation></ref>
<ref id="ref-32"><label>[32]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Li</surname> <given-names>S</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>F</given-names></string-name>, <string-name><surname>Jiao</surname> <given-names>L</given-names></string-name></person-group>. <article-title>Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection</article-title>. <source>Proc AAAI Conf Artif Intell</source>. <year>2022</year>;<volume>36</volume>(<issue>2</issue>):<fpage>1395</fpage>&#x2013;<lpage>403</lpage>. doi:<pub-id pub-id-type="doi">10.1609/aaai.v36i2.20028</pub-id>.</mixed-citation></ref>
<ref id="ref-33"><label>[33]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Park</surname> <given-names>S</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>M</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>D</given-names></string-name>, <string-name><surname>Sohn</surname> <given-names>K</given-names></string-name></person-group>. <article-title>Normality guided multiple instance learning for weakly supervised video anomaly detection</article-title>. In: <conf-name>Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2023 Jan 2&#x2013;7</conf-name>; <publisher-loc>Waikoloa, HI, USA</publisher-loc>. p. <fpage>2664</fpage>&#x2013;<lpage>73</lpage>. doi:<pub-id pub-id-type="doi">10.1109/WACV56688.2023.00269</pub-id>.</mixed-citation></ref>
<ref id="ref-34"><label>[34]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Ghadiya</surname> <given-names>A</given-names></string-name>, <string-name><surname>Kar</surname> <given-names>P</given-names></string-name>, <string-name><surname>Chudasama</surname> <given-names>V</given-names></string-name>, <string-name><surname>Wasnik</surname> <given-names>P</given-names></string-name></person-group>. <article-title>Cross-modal fusion and attention mechanism for weakly supervised video anomaly detection</article-title>. In: <conf-name>Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2024 Jun 17&#x2013;18</conf-name>; <publisher-loc>Seattle, WA, USA</publisher-loc>. p. <fpage>1965</fpage>&#x2013;<lpage>74</lpage>. doi:<pub-id pub-id-type="doi">10.1109/CVPRW63382.2024.00202</pub-id>.</mixed-citation></ref>
<ref id="ref-35"><label>[35]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Biswas</surname> <given-names>D</given-names></string-name>, <string-name><surname>Tesic</surname> <given-names>J</given-names></string-name></person-group>. <article-title>MMVAD: a vision-language model for cross-domain video anomaly detection with contrastive learning and scale-adaptive frame segmentation</article-title>. <source>Expert Syst Appl</source>. <year>2025</year>;<volume>285</volume>(<issue>1</issue>):<fpage>127857</fpage>. doi:<pub-id pub-id-type="doi">10.1016/j.eswa.2025.127857</pub-id>.</mixed-citation></ref>
<ref id="ref-36"><label>[36]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Dalvi</surname> <given-names>J</given-names></string-name>, <string-name><surname>Dabouei</surname> <given-names>A</given-names></string-name>, <string-name><surname>Dhanuka</surname> <given-names>G</given-names></string-name>, <string-name><surname>Xu</surname> <given-names>M</given-names></string-name></person-group>. <article-title>Distilling aggregated knowledge for weakly-supervised video anomaly detection</article-title>. In: <conf-name>Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2025 Feb 26&#x2013;Mar 6</conf-name>; <publisher-loc>Tucson, AZ, USA</publisher-loc>. p. <fpage>5439</fpage>&#x2013;<lpage>48</lpage>. doi:<pub-id pub-id-type="doi">10.1109/WACV61041.2025.00531</pub-id>.</mixed-citation></ref>
<ref id="ref-37"><label>[37]</label><mixed-citation publication-type="conf-proc"><person-group person-group-type="author"><string-name><surname>Wang</surname> <given-names>J</given-names></string-name>, <string-name><surname>Cherian</surname> <given-names>A</given-names></string-name></person-group>. <article-title>GODS: generalized one-class discriminative subspaces for anomaly detection</article-title>. In: <conf-name>Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27&#x2013;Nov 2</conf-name>; <publisher-loc>Seoul, Republic of Korea</publisher-loc>. p. <fpage>8200</fpage>&#x2013;<lpage>10</lpage>. doi:<pub-id pub-id-type="doi">10.1109/iccv.2019.00829</pub-id>.</mixed-citation></ref>
<ref id="ref-38"><label>[38]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Lim</surname> <given-names>J</given-names></string-name>, <string-name><surname>Lee</surname> <given-names>J</given-names></string-name>, <string-name><surname>Kim</surname> <given-names>H</given-names></string-name>, <string-name><surname>Park</surname> <given-names>E</given-names></string-name></person-group>. <article-title>ViCap-AD: video caption-based weakly supervised video anomaly detection</article-title>. <source>Mach Vis Appl</source>. <year>2025</year>;<volume>36</volume>(<issue>3</issue>):<fpage>61</fpage>. doi:<pub-id pub-id-type="doi">10.1007/s00138-025-01676-x</pub-id>.</mixed-citation></ref>
<ref id="ref-39"><label>[39]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Wu</surname> <given-names>P</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>X</given-names></string-name>, <string-name><surname>Liu</surname> <given-names>J</given-names></string-name></person-group>. <article-title>Weakly supervised audio-visual violence detection</article-title>. <source>IEEE Trans Multimed</source>. <year>2023</year>;<volume>25</volume>:<fpage>1674</fpage>&#x2013;<lpage>85</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TMM.2022.3147369</pub-id>.</mixed-citation></ref>
<ref id="ref-40"><label>[40]</label><mixed-citation publication-type="journal"><person-group person-group-type="author"><string-name><surname>Liu</surname> <given-names>T</given-names></string-name>, <string-name><surname>Lam</surname> <given-names>KM</given-names></string-name>, <string-name><surname>Bao</surname> <given-names>BK</given-names></string-name></person-group>. <article-title>Injecting text clues for improving anomalous event detection from weakly labeled videos</article-title>. <source>IEEE Trans Image Process</source>. <year>2024</year>;<volume>33</volume>(<issue>11</issue>):<fpage>5907</fpage>&#x2013;<lpage>20</lpage>. doi:<pub-id pub-id-type="doi">10.1109/TIP.2024.3477351</pub-id>; <pub-id pub-id-type="pmid">39405144</pub-id></mixed-citation></ref>
</ref-list>
</back></article>