Cross-game Slay the Spire 1 and 2 multimodal embeddings with Qwen3-VL: text-as-primary, art-as-secondary

Community Article Published May 20, 2026

Two of the six datasets in slaythespire-codex are joint text+image embeddings produced by Qwen/Qwen3-VL-Embedding-2B. One row per card across both Slay the Spire games, ~940 cards total, every vector unit-normalized and sitting in the same 1024-D space. Mechanically similar cards land near each other regardless of which game they came from, and visually similar but mechanically distant cards do not dominate the neighborhoods.

That last clause is the design problem this post is about. Out of the box, a vision-language encoder will happily decide that two cards with red painterly art are "similar." For a deckbuilder dataset, that ranking is worse than useless: it pulls the retriever away from the mechanics that actually determine whether two cards play alike. The recipe below is the set of design decisions that make a VLM treat text mechanics as primary and portrait art as a supporting cue, and the changes are smaller and more verbal than you would guess.

This is post 3 in a series. Post 1 introduced the dataset bundle and the three Gradio Spaces that sit on top of it. Post 2 walked through the prompt-to-deck algorithm that uses the text-only embeddings. This one is about the multimodal repos.

Why naive multimodal retrieval fails

The default move with two-tower VLMs (CLIP-family, BLIP, SigLIP) is to encode images on one side and text on the other and compare across them. That works for "find me images that match this caption." It does not work for "find me cards that play like this card," because:

  • Art style dominates similarity. Two cards by the same illustrator, in the same palette, will cluster regardless of mechanics. Brand-coherent art is the rule in card games, not the exception.
  • Mechanics live almost entirely in text. "Deal 9 damage. Apply 2 Weak." is the load-bearing signal. The portrait is genre flavor.
  • Caption-style instructions don't transfer. A model trained to align "a man holding a sword" with a sword image has no native concept of "Attack card with 1 cost and 9 damage that applies Weak."

So you have to pick a unified encoder that takes both modalities and weight them yourself through how you encode and what you ask the model to optimize for. Qwen3-VL-Embedding-2B is a good fit: it ingests {text, image} payloads, supports instruction-prefixing, and emits unit-comparable vectors in a single space.

The recipe

Four design decisions, all small. The hardest part is the instruction string.

1. Build a structured text document, not a sentence. STS card data is more like a typed record than prose. The encoder is fed prettified JSON with intentional indentation:

{
  "name": "Heavy Blade",
  "type": "Attack",
  "rarity": "Common",
  "color": "ironclad",
  "cost": "2",
  "description": "Deal 14 damage. Strength affects ~ 3 times.",
  "description_upgraded": "Deal 14 damage. Strength affects ~ 5 times.",
  "keywords": ["strike"]
}

The card's own name is replaced with ~ inside its description text. That stops the model from anchoring on the name token; what matters is the mechanic pattern, not the name as a string. The indented-JSON recipe is borrowed from minimaxir's MTG embeddings writeup and the gain is measurable enough to be worth the extra bytes - mechanical structure becomes a feature, not noise.

2. Pad images, don't center-crop. Card art carries identifying detail at the edges - weapons, character silhouettes, status icons. A center-crop loses that. The pipeline resizes the longest side to 512 pixels and pads the shorter side onto a neutral grey background:

PAD_SIZE = 512
PAD_COLOR = (128, 128, 128)  # neutral grey

img = Image.open(io.BytesIO(png_bytes)).convert("RGB")
w, h = img.size
scale = PAD_SIZE / max(w, h)
new_w, new_h = int(round(w * scale)), int(round(h * scale))
resized = img.resize((new_w, new_h), Image.LANCZOS)
canvas = Image.new("RGB", (PAD_SIZE, PAD_SIZE), PAD_COLOR)
canvas.paste(resized, ((PAD_SIZE - new_w) // 2, (PAD_SIZE - new_h) // 2))

Grey is chosen over black or white because both extremes produce a high-contrast frame the visual tower will pick up as a feature. Mid-grey reads as "absence" instead.

3. Tell the model what counts as similar. This is the load-bearing line. The exact instruction string:

Represent this Slay the Spire card so that mechanically similar cards (same archetype, comparable damage/block patterns, related keywords) are close in embedding space, using the card's text mechanics as the primary signal and the portrait art as a secondary cue for character/class and visual archetype.

Two things matter here:

  • The instruction names what similar means in this domain. The default for any retrieval model is "things that look or read alike." For a card-game corpus, similar means "plays alike," and the model only knows that if you say so.
  • The instruction explicitly demotes the secondary modality. "Text as primary, art as secondary" is not a hint - it is a verbal contract. Drop that clause and watercolor-palette neighborhoods reappear.

Qwen3-VL-Embedding accepts instructions via the standard Instruct: ...\nText: ... prefix:

prompt = f"Instruct: {TASK_INSTRUCTION}\nText: "
embeddings = model.encode(payloads, prompt=prompt, normalize_embeddings=True)

4. Cards without art go through text-only. ~1 card in STS1 (IMPULSE) ships without a portrait in the source files. Rather than skip it or send it through a separate encoder, the pipeline encodes the same {text} payload through Qwen3-VL with no image attached. The model handles single-modality inputs natively. This preserves the joint coordinate system: every card in the corpus lives in the same space, joinable on id. Provenance records n_with_image and n_without_image so consumers can audit.

Matryoshka truncation: 2048 -> 1024

Qwen3-VL-Embedding-2B emits 2048-D vectors natively. The published dataset truncates to the first 1024 dimensions and re-normalizes:

embeddings = embeddings[:, :matryoshka_dim]                          # 2048 -> 1024
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

Why truncate at all? Two reasons:

  • Parquet on disk halves. 940 cards × 1024 floats vs 940 × 2048 is a small absolute saving here, but the same recipe scales to much larger corpora where it stops being free.
  • Dim alignment with the text-only repo. The companion text-embedding dataset uses Qwen3-Embedding-0.6B at 1024-D. Truncating the multimodal repo to the same dim lets downstream code reason about cross-modal comparisons without conditional reshapes. Matryoshka-trained encoders preserve most retrieval quality under prefix truncation, so the trade is favorable. (If you're wiring this up yourself: always re-normalize after truncating, otherwise unit-norm guarantees break and emb @ query stops being cosine.)

Cross-game coordinate system

Both games go through the same model and the same instruction string. That single discipline is what makes STS1 and STS2 vectors comparable. Drop the instruction for one game's run, or swap the model version mid-pipeline, and you have two separate coordinate systems wearing the same column name. The Archetype Map Space takes advantage of this: it plots both games on a single UMAP and lets you ask which STS2 card is closest to Heavy Blade from STS1, which only means something because the vectors share a space.

Provenance pins this down. Every multimodal embedding write records model_id, task_instruction, embedding_dim, matryoshka_dim, and image_preprocessing ("rgb-resize-pad-512x512-grey"). Re-running with any of these changed produces a new repo commit, not a silent rotation.

What this isn't

It isn't a fine-tune. The model is frozen Qwen3-VL-Embedding-2B; the only learned weights are the encoder's own. All of the domain shaping is verbal (instruction string) and structural (JSON shape, pad recipe).

It isn't a CLIP-style cross-modal retriever. Image-to-image and text-to-card cross-retrieval do work, but the system isn't tuned to optimize for them; it's tuned for card-to-card mechanical similarity with art as a tiebreaker.

It also isn't a free pass on subjectivity. "Mechanically similar" is itself a judgment - is Bash (Attack + Vulnerable) more similar to Pummel (multi-hit Attack) or to Inflame (a Power that makes Attacks bigger)? The instruction string nudges the model toward one answer, but reasonable people would draw the boundary elsewhere. The dataset is opinionated; the dataset card surfaces the opinion explicitly.

Generalizing

The recipe transfers to any structured corpus where a primary modality should dominate similarity:

  • Product catalogs. Specs + reviews are the primary signal, packaging photo is the secondary cue. Instruction: "compare on technical capabilities first, visual presentation second."
  • Real estate listings. Listing text + room counts are primary; staged photos are secondary. Same pattern.
  • Trading-card games beyond STS. MTG, Hearthstone, Lorcana all fit. Replace the JSON schema and the instruction string; everything else carries.

The two pieces you have to provide per domain are (a) the JSON shape (which fields the encoder gets to see), and (b) the instruction string that names what "similar" should mean. Everything else, including the image pad recipe and the Matryoshka step, is portable.

Try it / fork it

The two repos are t22000t/slay-the-spire-1-card-multimodal-embeddings and t22000t/slay-the-spire-2-card-multimodal-embeddings. Pipeline source for the encoder is at src/sts_cards/multimodal_embed.py; the text-document builder is at src/sts_cards/normalize.py. Clone the repo, point the loader at your own corpus, rewrite MECHANICS_FIELDS and the task instruction for your domain, and the pipeline ports over.

That closes the three-post series. Post 1 was the index, post 2 was the applied-retrieval story most readers can transfer to their own domains, and post 3 was the narrowest but deepest. If you want a single entry point, the collection holds all six datasets plus the three Spaces.

Community

Sign up or log in to comment