Board-Demo_01_00_40_18.jpg

Attention of a Kiss

Adam Cole
Video, series of three, stereo sound, ESP32 board, wire
2026

Selected Publications
ACM Creativity and Cognition XAIxArts Workshop, 2025
Springer XAIxArts Book Chapter, 2026 (in review)

Our work is a dialogue between the tool and the image, so we would not preconceive an image...We would rather make a tool and dialogue with it.

— Woody Vasulka, 1971

Attention of a Kiss is a series of experimental video studies that repurpose the internal mechanics of a generative AI video model as raw artistic material. The project utilizes a novel technique to capture the attention maps of a video diffusion transformer, exploring the iconic image of a kiss across single-channel, dual-channel, and multi-channel orientations. This approach not only exposes the underlying technical reality of the system, but also harnesses that behavior for original creative purposes. By compiling these maps together from their most diffuse stages into coherent forms, the AI's internal diffusion process becomes both an explanatory and narrative device.

In most commercial AI tools, the internal mechanics of the model are kept invisible, restricting artists to text prompts and pixel outputs. By intervening in the black-box system of diffusion models, this custom tool opens the path to new aesthetic possibilities beyond the model's intended domain. Specifically, this pipeline intercepts the generation process of an open-source video model to extract cross-attention maps: the internal calculations that determine how a specific word corresponds to localized regions within a visual output. These values are usually calculated in a fraction of a second and immediately flushed from memory; here, they are visualized as dynamic heatmaps and stored across the diffusion steps, making the model's internal process tangible.

This project sits at the intersection of Explainable AI for the Arts (XAIxArts) and the larger tradition of experimental video art. Just as early video artists, such as Nam June Paik and the Vasulkas, built their own tools to understand and subvert the signal-based logic of analog video, this work interrogates the opaque architecture of AI. By treating the neural network itself as a malleable medium rather than just an image generator, it cultivates a material intimacy with the system. Ultimately, these artworks emerge not only from what is seen, but from exposing how the network sees.

Attention of a Kiss (video)

The video study Attention of a Kiss visualizes the evolving attention map of the specific token "kiss" across the generation timeline. The video begins in total abstraction, a wash of latent noise, and gradually gains structure. This move parallels both the diffusion process and the development (and inevitable dissolution) of emotional intimacy. This metaphorical alignment between the model's construction of meaning and human perceptual experience suggests new narrative forms grounded specifically in AI diffusion mechanics.

Multi-Head Attention

In this study, we extract the attention maps from the token "cinematic" to visualize eight distinct kiss scenes as they develop over diffusion steps. While the output videos often echo familiar cinematic conventions of romance, the attention maps contain much more ambiguity. Conventional identity and gender markers are all but invisible in this abstracted view, especially at lower diffusion steps. Instead, we see the iconic image of the kiss reduced to a more mythical ideal. Repeated across scenes, the maps reveal both the flexibility of the pose and the underlying consistency of its choreography, both physically and emotionally. Here, the ambiguity of the attention maps (as opposed to the specificity of the pixel-space outputs) offers a novel way to interpret the kiss beyond photorealistic representations.

Diffusion /\/\ Collision

In this diptych, we pair two scenes of vastly different scales: on the left, the iconic kiss, and on the right, a cosmic collision between planets (with the attention colors inverted). To generate these pairs, we use identical generation settings (seed, resolution, guidance etc.) and nearly identical text prompts with variations for keywords. This correspondence leads to interesting overlaps in the development of the scenes, both in composition and action. At low diffusion steps across both videos, we see a commonality in the ambiguous forms developing in each scene and the rhythm of their impact. At higher diffusion steps, the distinction between the two scales and subject matters becomes clearer. Developing across the diffusion process, they create an interesting metaphor for the explosive collision of bodies, spanning from the personal to the galactic.

Previous
Previous

Crash Me, Gently

Next
Next

Rejected By My Own Robot