As sentient beings our consciousness is supported by all our five senses working together creating a holistic impression of our world such that the whole is greater than the sum of each. Hence it is not surprising that when we see an image its emotional impact on our mind is exaggerated if combined with music or the impact of the written word is heightened if it is overlaid on an image.
The question is whether or not we can quantify this seemingly subjective change in perception caused by mixing different mediums. There has already been a lot of research in the field of perceptual psychology, advertising and information technology to quantify this interaction between the visual and the auditory sensory modalities. It has been demonstrated that these interaction can be quantified using some of the features of the medium such as RGB values, HSI values, traverse lines for images and volume, pitch and timbre for sound (1 & 2) and by conducting simple experiments a model can be built which predicts the effect of music on the emotional impression of an image. The experiments cited in the research were conducted by first showing only images to the subjects and then showing the same images with the addition of different types of background music and measuring the change in the impression the image had before and after. However, given the limited number of subjects the model does have its limitations.
Today, a lot of the audio and visual information that we consume is fed via social media. Our daily lives are inseparably tied to different sources of data such as Facebook, Twitter, Pintrest, Spotify, Soundcloud, Pandora, Instagram, Flickr etc. that we consume for both – information and entertainment. By leveraging some of the existing research a machine learning (ML) model can be built such that given these different sources of content it can create combinations which achieve consonance – a presentation in which the audio, visual and the text elements complement each other and enhance the overall impact. Collective intelligence and a feedback loop will improve the model overtime – a limitation of the earlier research. Moreover, such a blended stream of curated content will create a unique entertainment experience adjusting to the user mood, time of the day, weather and which may be explicitly seeded to a theme such as ‘nostalgia’, ‘relaxation’, ‘party’.
Perhaps, an ML based system can be created on the research that exist out there in this field. The system would provide a service which delivers content in a way which minimizes clutter and produces an ambient symphony of photos, music and all types of social media content.
Here is my attempt on creating such a model –