Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.Details the Q-Former architecture: a 12-layer BERT-based model using 32 learnable query embeddings. These queries use cross-attention to extract visual information for MLLM input.

Visual Prompt Generation: Cross-Attention in Q-Former

Abstract and 1 Introduction

  1. Related Work

    2.1. Multimodal Learning

    2.2. Multiple Instance Learning

  2. Methodology

    3.1. Preliminaries and Notations

    3.2. Relations between Attention-based VPG and MIL

    3.3. MIVPG for Multiple Visual Inputs

    3.4. Unveiling Instance Correlation in MIVPG for Enhanced Multi-instance Scenarios

  3. Experiments and 4.1. General Setup

    4.2. Scenario 1: Samples with Single Image

    4.3. Scenario 2: Samples with Multiple Images, with Each Image as a General Embedding

    4.4. Scenario 3: Samples with Multiple Images, with Each Image Having Multiple Patches to be Considered and 4.5. Case Study

  4. Conclusion and References

\ Supplementary Material

A. Detailed Architecture of QFormer

B. Proof of Proposition

C. More Experiments

\ Figure 7. Overview of QFormer

A. Detailed Architecture of QFormer

The architecture overview is depicted in Figure 7. Specifically, QFormer is initialized as a BERT-based model[8] comprising a total of L = 12 layers. In contrast to typical BERT models that process textual inputs, QFormer takes R = 32 learnable query embeddings as inputs. These embeddings are utilized to extract visual information from the input visual data during Stage-1 pretraining in BLIP2[22]. Subsequently, they serve as visual prompt embeddings for the LLM inputs after projection.

\ Inside the QFormer, each layer includes a self-attention module composed of a Multi-Head Attention component and a Forward module (consisting of Linear, LayerNorm, and Residual Connection). The cross-attention module, initialized with random values, is inserted every G layers, where learnable query embeddings interact with visual embeddings. In the main paper, for the sake of conciseness, we condensed the representation of the multi-head attention and forward modules into self(cross) attention modules. Furthermore, we exclusively illustrated the modifications made to the cross-attention module in MIVPG, as the self-attention modules remain unchanged. The final QFormer output is represented by the last layer’s query embeddings.

\ For a more comprehensive understanding, readers are encouraged to refer to [22].

\

:::info Authors:

(1) Wenliang Zhong, The University of Texas at Arlington (wxz9204@mavs.uta.edu);

(2) Wenyi Wu, Amazon (wenyiwu@amazon.com);

(3) Qi Li, Amazon (qlimz@amazon.com);

(4) Rob Barton, Amazon (rab@amazon.com);

(5) Boxin Du, Amazon (boxin@amazon.com);

(6) Shioulin Sam, Amazon (shioulin@amazon.com);

(7) Karim Bouyarmane, Amazon (bouykari@amazon.com);

(8) Ismail Tutar, Amazon (ismailt@amazon.com);

(9) Junzhou Huang, The University of Texas at Arlington (jzhuang@uta.edu).

:::


:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Market Opportunity
Prompt Logo
Prompt Price(PROMPT)
$0.06574
$0.06574$0.06574
+3.00%
USD
Prompt (PROMPT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

EUR/CHF slides as Euro struggles post-inflation data

EUR/CHF slides as Euro struggles post-inflation data

The post EUR/CHF slides as Euro struggles post-inflation data appeared on BitcoinEthereumNews.com. EUR/CHF weakens for a second straight session as the euro struggles to recover post-Eurozone inflation data. Eurozone core inflation steady at 2.3%, headline CPI eases to 2.0% in August. SNB maintains a flexible policy outlook ahead of its September 25 decision, with no immediate need for easing. The Euro (EUR) trades under pressure against the Swiss Franc (CHF) on Wednesday, with EUR/CHF extending losses for the second straight session as the common currency struggles to gain traction following Eurozone inflation data. At the time of writing, the cross is trading around 0.9320 during the American session. The latest inflation data from Eurostat showed that Eurozone price growth remained broadly stable in August, reinforcing the European Central Bank’s (ECB) cautious stance on monetary policy. The Core Harmonized Index of Consumer Prices (HICP), which excludes volatile items such as food and energy, rose 2.3% YoY, in line with both forecasts and the previous month’s reading. On a monthly basis, core inflation increased by 0.3%, unchanged from July, highlighting persistent underlying price pressures in the bloc. Meanwhile, headline inflation eased to 2.0% YoY in August, down from 2.1% in July and slightly below expectations. On a monthly basis, prices rose just 0.1%, missing forecasts for a 0.2% increase and decelerating from July’s 0.2% rise. The inflation release follows last week’s ECB policy decision, where the central bank kept all three key interest rates unchanged and signaled that policy is likely at its terminal level. While officials acknowledged progress in bringing inflation down, they reiterated a cautious, data-dependent approach going forward, emphasizing the need to maintain restrictive conditions for an extended period to ensure price stability. On the Swiss side, disinflation appears to be deepening. The Producer and Import Price Index dropped 0.6% in August, marking a sharp 1.8% annual decline. Broader inflation remains…
Share
BitcoinEthereumNews2025/09/18 03:08
Zero Knowledge Proof (ZKP) vs DOGE, SHIB, and PEPE: Good Crypto to Buy Now for Structure-Driven Gains

Zero Knowledge Proof (ZKP) vs DOGE, SHIB, and PEPE: Good Crypto to Buy Now for Structure-Driven Gains

In crypto, most gains don’t come when a chart is trending; they come before it. Real returns are usually locked in through smart entry, not loud exit points. That
Share
LiveBitcoinNews2026/01/16 08:00
XRP Could Explode as XRPL Targets Weak Links and Long-Trapped Liquidity

XRP Could Explode as XRPL Targets Weak Links and Long-Trapped Liquidity

The post XRP Could Explode as XRPL Targets Weak Links and Long-Trapped Liquidity appeared on BitcoinEthereumNews.com. XRP optimism is rebounding as long-term builders
Share
BitcoinEthereumNews2026/01/16 08:37