Attention of a Kiss

Adam Cole
Video, series of three, stereo sound, ESP32 board, wire
2026

Selected Publications
ACM Creativity and Cognition XAIxArts Workshop, 2025
Springer XAIxArts Book Chapter, 2026 (in review)

Our work is a dialogue between the tool and the image, so we would not preconceive an image...We would rather make a tool and dialogue with it.
— Woody Vasulka, 1971

Attention of a Kiss is a series of experimental video studies that repurpose the internal mechanics of a generative AI video model as raw artistic material. The project utilizes a novel technique to capture the attention maps of a video diffusion transformer, exploring the iconic image of a kiss across single-channel, dual-channel, and multi-channel orientations. This approach not only exposes the underlying technical reality of the system, but also harnesses that behavior for original creative purposes. By compiling these maps together from their most diffuse stages into coherent forms, the AI's internal diffusion process becomes both an explanatory and narrative device.

In most commercial AI tools, the internal mechanics of the model are kept invisible, restricting artists to text prompts and pixel outputs. By intervening in the black-box system of diffusion models, this custom tool opens the path to new aesthetic possibilities beyond the model's intended domain. Specifically, this pipeline intercepts the generation process of an open-source video model to extract cross-attention maps: the internal calculations that determine how a specific word corresponds to localized regions within a visual output. These values are usually calculated in a fraction of a second and immediately flushed from memory; here, they are visualized as dynamic heatmaps and stored across the diffusion steps, making the model's internal process tangible.

This project sits at the intersection of Explainable AI for the Arts (XAIxArts) and the larger tradition of experimental video art. Just as early video artists, such as Nam June Paik and the Vasulkas, built their own tools to understand and subvert the signal-based logic of analog video, this work interrogates the opaque architecture of AI. By treating the neural network itself as a malleable medium rather than just an image generator, it cultivates a material intimacy with the system. Ultimately, these artworks emerge not only from what is seen, but from exposing how the network sees.

Attention of a Kiss (video)

The video study Attention of a Kiss visualizes the evolving attention map of the specific token kiss across the generation timeline. The video begins in total abstraction within a wash of latent noise, gradually gains structure, and then dissolves back into noise. This arc parallels both the diffusion process and the formation (and inevitable dissolution) of emotional intimacy. This metaphorical alignment between the model's construction of meaning and human interpretive processes suggests new narrative forms grounded specifically in AI diffusion mechanics.

Multi-Head Attention

Eight kiss scenes crystallize into focus within a strip of latent noise, only to disintegrate back into abstraction. This crescendo — structure gradually materializing from noise, peaking and then rapidly dissolving — draws a similar parallel between the diffusion process and the fragile formation of intimacy.

While the final outputs echo familiar cinematic conventions of romance, the mid-process visualizations carry far greater ambiguity, especially in their disintegration. Familiar identity and gender markers are nearly invisible in this abstracted view, particularly at lower diffusion steps. What remains is the iconic image of the kiss pared down to a more mythical ideal and choreography, both physical and emotional. Here, the inherent noise of the diffusion model offers a novel way to interpret intimacy beyond photorealistic representation, where the model's construction of form aligns with human perceptual experience.

Diffusion /\/\ Collision

In this diptych, we pair two scenes of vastly different scales: on the left, the iconic kiss, and on the right, a cosmic collision between planets (with the attention colors inverted). To generate these pairs, we use nearly identical generation settings — shared seed, resolution, and guidance — varying only the keyword tokens, so the two videos share a compositional skeleton even as their semantic content diverges. This correspondence leads to interesting overlaps in the development of the scenes, both in composition and action. Early in the process, we see a commonality in the ambiguous forms developing in each scene and the rhythm of their impact. Near the end, the distinction between the two scales and subject matters becomes clearer. Developing across the diffusion process, they create a metaphor for the explosive collision of bodies, spanning from the personal to the galactic.

Attention of a Kiss

Attention of a Kiss (video)

Multi-Head Attention

Diffusion /\/\ Collision

Crash Me, Gently

Vertigo Vertigo