Through the COSMOROE Search Engine, one can navigate through examples of image-language associations in TV travel series. In particular, one can find examples from two series, both of which have been translated for the needs of this service in both Greek and English:


All annotation/analysis related to these examples belongs to CSRI and is licensed through an Open Commons Non Commercial ShareAlike license. The annotation files and a re-packaging of their content for the needs of different research tasks, along as related software are available at: CMR Downloads.

Ownership of the copyright of each video segment or static frame presented in each example remains with the original owners; this material is included here only for illustration purposes. See full details here. The analyses presented in these examples are the researchers'own and do not reflect the views of the video copyright owners.

Why TV Travel

TV travel series usually involve one or more presenters visiting different places, being in contact with the locals, interviewing people, explaining the habits, traditions and way of life in specific locations. They are highly interactive and adventurous. Language is used to refer to a variety of things, ranging from tangible things directly depicted in the programme to more abstract concepts. It covers a wide range of concrete and abstract concepts, as is the case in everyday interaction. There is a mix of specific terms and everyday colloquial language that is being used, and there are no strict restrictions in terms of the vocabulary to be used, the length of the descriptions or the visual modalities.

Therefore, these audiovisual files include a variety of language modalities (speech, and text: subtitles, scene-text, graphic-text etc.), visual modalities (natural image sequences/filming, graphics such as maps), that depict not only objects/entities, but also gestures (e.g. deictic, emblems, iconic, metaphoric) and other body movements. In many cases, the files contain section-titles, i.e. captioned frames, in which one may observe modality interaction between an image and its caption, as one would with a static photograph and its accompanying caption. Thus, we have selected these TV travel series for cross-media semantic annotation due to the richness of the interacting modalities available in this genre.

Some Statistics

Relations and Arguments

The total number of multi-modal relations annotated in the two travel series and the number of textual and visual arguments participating in them are presented graphically in the following links:

Relations Arguments

The tables that follow present the same information as before, in a more compact form, focusing mainly on the total numbers per relation type and argument type respectively.

Relation TypeCount

Argument TypeCount

Language - Visual Element Pairs

The COSMOROE annotated data corpus comprises language - visual element pairs directly associated through specific relation types (the COSMOROE relations). The following table presents:

  1. Unique counts of these pairs versus corresponding instances of these pairs: unique tags/categories* denoted through the elements of the pair are counted in the former case, while all instances of the elements according to their tags are counted in the latter. For example, ["play basketball"-"make your own fun"] is a pair that is counted only once, no matter how many times it appears in the files; the total number of appearances is captured in the pair-instances count.
  2. Counts of these pairs dependent on the relation type through which they are associated versus counts that are independent of this. For example, ["play basketball"-"make your own fun"] is counted once when associated with an action-event relation, but twice if these elements were also linked through another type of relation.
Direct Relations
Language - Visual Element Pair Categories Independent of the Relation
Categories Instances
916 1540
Language - Visual Element Pair Categories Dependent on the Relation
Categories Instances
936 1540

The COSMOROE corpus comprises also a number of indirectly associated language-visual element pairs. In such cases, the close-in-time elements of the pair are associated through a number of inferred relations (semantic association path that goes through inferred concepts, not present in the multimodal discourse).

The indirect association comprises a chain of inferred relations. The first and last inferred relations in the chair comprise the verbal or visual argument (respectively) and an inferred conceptual argument (i.e. a concept that is not present in the multimedia discourse). Any in-between inferred relations comprise only conceptual arguments. Tail recursion characterises the inferred relations and this is how a chain is formulated, i.e. the argument of one inferred relation is shared with the next inferred relation; conceptual arguments drive the process.

Thus, in case of indirect associations, two physically present and close in time elements (language and visual element) get associated, through inferred relations, to concepts (which could have been realised physically through any modality). An example of such an indirect association, along with the inferred relations is shown below:

Indirectly Associated Language - Visual Element Pair:
"Pizza" (language element) - "dough" (visual element)
Inferred Relation 1:
"dough" (visual element) object of action "dough mixing" (conceptual element)
Inferred Relation 2:
"dough mixing" (conceptual element) step for event "pizza making" (conceptual element)
Inferred Relation 3:
"Pizza" (language element) result for event "pizza making" (conceptual element)

The next two tables present the counts of indirectly associated pairs and those of the inferred pairs respectively.

Indirect Associations
Language - Visual Element Pair Categories Independent of the Relation
Categories Instances
65 68
Language - Visual Element Pair Categories Dependent on the Relation
Categories Instances
65 68

Inferred Relations
Language - Concept Element Pair Categories Independent of the Relation
Categories Instances
32 38
Language - Concept Element Pair Categories Dependent on the Relation
Categories Instances
32 38
Visual - Concept Element Pair Categories Independent of the Relation
Categories Instances
46 102
Visual - Concept Element Pair Categories Dependent on the Relation
Categories Instances
46 102
Concept - Concept Element Pair Categories Independent of the Relation
Categories Instances
4 4
Concept - Concept Element Pair Categories Dependent on the Relation
Categories Instances
4 4

Human Activities

Human activities in the COSMOROE corpus are distinguished between body movements and gestures. In the corpus one can find body movements that share the same goal, e.g. "holds", "points", "kneads" but are not the same, since their tool or affected object changes, e.g. "holds handset", "holds beer glass", "points", "points with umbrella" or are the same, i.e. the corpus contains more than one visual instances of the same body movement, e.g. "kneads dough". Whenever no tool is mentioned for a body movement, a body part effector is the tool. Our treatment of body movements and gestures follows the "Minimalist Grammar of Action" theory (cf. Related Publications).

The total number of body movement categories* found in the COSMOROE corpus is given in the next table. The column "Goal specific" groups all body movements that have the same goal (share the same annotated goal label), while the "Complement specific" column differentiates between body movements that have different tool and/or affected object complements. The second, lower part of the table gives the numbers for the gestures, according to their type (deictic, emblem, iconic) and their respective sub-type (feature pantomime and metaphoric pantomime).

Human Activities
Body Movements
Goal specific Complement specific
68 88
Deictic Emblem Iconic
Feature Pantomime Metaphoric Pantomime
23 27 1 4


Visual objects are depicted in the whole foreground or background or both (mixed case) of a video segment or in a specific region of a frame; in the COSMOROE corpus, the former have been annotated through indication of the time offsets of the corresponding frame sequence, while the latter have been annotated through drawing of the object contour on a keyframe region. The object categories* that are found in the corpus, along with their respective instances are presented in the next table (cumulative numbers and per annotation type).

Objects (Cumulative)
Categories Instances
211 1313
Objects (in Frame sequences)
Categories Instances
126 912
Objects (in Keyframes)
Categories Instances
116 401

Related Publications


All research related to these annotations has been carried out in the framework of the FP7-ICT Project POETICON++ (Grant No: 288382) and its predecessor, POETICON (Grant No: 215843), by:

We also thank Maria Lada and Maria Koutsobogera for preliminary annotation of the files.

Back To Top

* Unique categories are counted using the lemmatised (canonical/non inflected form) of words/labels