Compared with the widely studied Human-Object Interaction DETection (HOI-DET), no effort has been devoted to its inverse problem, i.e. to generate an HOI scene image according to the given relationship triplet , to our best knowledge. We term this new task “Human-Object Interaction Image Generation” (HOI-IG). HOI-IG is a research-worthy task with great application prospects, such as online shopping, film production and interactive entertainment. In this work, we introduce an InteractGAN to solve this challenging task. Our method is composed of two stages: (1) manipulating the posture of a given human image conditioned on a predicate. (2) merging the transformed human image and object image to one realistic scene image while satisfying their expected relative position and ratio. Besides, to address the large spatial misalignment issue caused by fusing two images content with reasonable spatial layout, we propose a Relation-based Spatial Transformer Network (RSTN) to adaptively process the images conditioned on their interaction. Extensive experiments on two challenging datasets demonstrate the effectiveness and superiority of our approach. We advocate for the image generation community to draw more attention to the new Human-Object Interaction Image Generation problem.


Our dataset is available at Baidu Drive (pwd: od1a) and


Our paper: InteractGAN


Still in construction


Still in construction