Introduction

The COSMOROE Search Engine is a text-based search engine for multimedia documents. It has been designed as a support tool for COSMOROE (CrOSs-Media inteRactiOn rElations), a framework for modelling multimedia dialectics, i.e. the semantic interplay between images, language and body movements. COSMOROE defines a number of semantic relations between different modalities, for formulating multimedia messages (see Figure 1 and more details in papers in the Documentation section).

COSMOROE relations
Figure 1: The COSMOROE relations

Currently, the COSMOROE Search Engine indexes and retrieves audiovisual information from files that have been annotated manually. Their annotation comprises: speech transcription, optical character recognition (e.g. for subtitles, scene text, graphic text), identification of visual objects of interest, identification of body movements and gestures of interest, labelling/tagging of visual elements, and labelled association of language and visual elements. However, the ultimate objective is to use this search engine with data that will be augmented with such metadata automatically.

The engine allows one to:

The former functionality targets the general users, while the latter is addressed to people with a special interest in multimedia semantics and/or multimedia system development.

In both cases, functionalities related to sorting of results, presentation of results to the user and exploration of the contents of the underlying database, in the form of quantified profiles of the data, are provided.

The following sections, provide more details on all these aspects of the system.

Back To Top

Simple Search

From the "Simple Search" page the user can type the query of interest, in a straightforward manner, like in any text-based search engine. The keyword or keywords entered are searched among the multimedia files stored in a database. Figure 2 presents a simple example of keyword search. (For more information about keyword typing methods see Section Typing Conventions).

Simple Search
Figure 2: Simple search page

The meaning of the query is, as expected: Search for multimedia files that contain the word play, in the transcribed text or in the labelled visual content.

Also, as the user starts typing a search word, a list of the matching terms currently found on the database, is shown below the input field, as shown in Figure 3. Multiple search terms can be selected by clicking on each of them.

Simple Search - Suggestions
Figure 3: Simple search page suggestions

Back To Top

Typing Conventions

The user should pay special attention to certain conventions, used for keyword typing, in order to avoid error messages and achieve better retrieval results.

Back To Top

Advanced Search

In the "Advanced Search" page the user is presented with three different ways through which her query can be filtered. The page mainly consists of three sections, see Figure 4, each one devoted to the alternative ways with which the user may query the content of the multi-modal relations.

Advanced Search
Figure 4: Advanced search page - Sections division

Each section can be used either separately or in conjunction with the others. That means that the user is free to define suitable search criteria, filling any one from the three sections, or all of them. In case multiple sections are filled, their criteria are combined using a conjunctive approach ("AND" links).

Back To Top

Search by Argument Combination

The top section of the Advanced Search page, gives the user the possibility to search for either of the arguments participating in a multimodal relation (see Figure 5). Specifically, (s)he can search for the Textual Argument and/or the Visual Argument, i.e. for something said and/or something seen in a multimedia video.

Argument Combination Search
Figure 5: Advanced search page - Argument Combination

Figure 5 shows an example of searching for a combination of a textual and a visual argument. The meaning of this query is: Search for a multimedia file where someone says the word pizza, while some dough is shown in the video.

Like in the Simple Search, as soon as the user starts typing, a list of suggested terms appears, which in this case are different for each argument, since the terms appearing as textual arguments of a relation are not the same as the visual argument terms.

Back To Top

Operators

While the "Textual" and "Visual" arguments can be used separately, with the user filling either of the two forms, there is also the possibility of searching for a combination of them, by filling both forms and using the operators "OR", "AND", "NOT", for logically combining them.

The operator in the middle, defines the way the two arguments will be logically connected, in order to formulate the final query.

Back To Top

Search by Modality Combination

The middle section of the Advanced Search page, gives the user the possibility to search for either of the arguments participating in a multimodal relation not by defining their content, but rather by defining their type (see Figure 6).

Modality Combination Search
Figure 6: Advanced search page - Modality Combination

Back To Top

Textual Argument

Through the "Textual Argument Specification" part of the search interface, the user can search for any word or phrase that has been said ("Utterance" or "Overlapping Utterance" or "Subtitles"), or any text that can be seen on the video ("Scene Text" or "Graphic Text"). All these choices are possible through selecting one or more of the provided choices, as seen in Figure 6. In this example, the user selected to search for any word or phrase that has been said, by selecting the types "Utterance" and "Overlapping Utterance".

The options for the specification of the type of the textual argument are hierarchically structured. It can be seen that there is one major category, the Transcribed Text. This is further divided in 5 possible subcategories, namely Utterance, Overlapping Utterance, Subtitles, Scene Text, and Graphic Text.

Multiple selection of subcategories is possible by simply clicking on the desired options, while by clicking again on an option this becomes deselected. By clicking on the category, all options are automatically selected. Multiple selected options are matched disjunctively.

Back To Top

Visual Argument

Similarly to the "Textual Argument Specification", through the "Visual Argument Specification" part of the search interface, the user can search for something shown on the multimedia file that has been annotated and labelled. Specifically, the user can search for a body movement, a gesture, or an object. All these choices are possible through clicking any of the choices seen in Figure 6. In this case, the user selected to see any of the metaphoric or iconic gestures that can be seen in the video.

As in the case of the "Textual Argument", the options for the specification of the type of the visual argument are hierarchically structured. There are three major categories, the Body Movement, the Gesture, and the Object. The second choice has 4 subcategories, defining the type of gesture, while the third choice has 2 subcategories, namely, Frame Sequence and Keyframe Region.

Multiple hierarchical selection is also possible, with multiple selected options being matched disjunctively.

Back To Top

Operators

Like in the "Argument combination section" the user can use the operators to combine the different types of arguments. For example, in Figure 6 the user will actually search for any text said ("Utterance" or "Overlapping Utterance"), while at the same time a "Metaphoric" or "Iconic" Gesture is shown in the video.

Back To Top

Search by Multimodal Relation Type

An alternative way of searching through the multimedia files is by filtering the query with a specific type of relation. This is implemented at the "Search by Multimodal Relation Type" section, where the user can select the type or types of relation that combine the textual and visual arguments found in a multimedia file.

An example is given in Figure 7, where the user is searching for all the Token-Type (Literal Equivalence) relations, for all the Metonymy (Figurative Contradiction) relations and all the Non Essential (Complementarity) relations.

Relation Types
Figure 7: Search by specific relations

As in the cases of the "Textual" and "Visual" argument specification, multiple hierarchical selection of the types of relations is also possible. Again, by selecting an upper category, all the subcategories are automatically selected, whereas multiple selected options are being matched disjunctively. Note that categories and subcategories preceded with an arrow are expandable.

Back To Top

Search by Argument, Modality and Relation Type

By combining all sections in the Advanced Search page, the user can search by the most "specific" way, for a textual argument and a visual argument, found participating in a specific relation. For example, Figure 8 shows the following query: Search for mulitmedia files that contain the word park uttered by someone and the label "plays quitar" denoting a body movement, with those two arguments found in a Metonymy (Figurative Equivalence) relation.

Search by all sections
Figure 8: Search for "park" uttered and "plays guitar" as a body movement, found in a "metonymy" figurative equivalence

Back To Top

Results

In the "Results Page" the user can see the multimedia files that matched the given query, along with the arguments accompanying each relation and a hyperlink to the section of the video file that contains the relation. Figure 9 presents such an example page.

On the left, the type of the relation found is shown, In the middle column, the arguments of the relation are presented, so the user can quickly get an idea of the content of the relation and on the right column there is a link that opens a new window showing the part of the video itself. Similarly, the left column opens a new window, which contains more information for each result found.

Additionally to the types of relation provided in the advanced search mode, in the "Results Page" an additional type is shown, namely the "Indirect Relation", which denotes a set of inferred relations, which indirectly associates the language and visual arguments (cf. Section Indirect Relations).

One can notice that among the results there are elements that do not belong to any relation. These are labelled as "Element not in a relation" and usually are elements of the video, either textual or visual, that have been annotated for multimodal retrieval purposes.

An indication of the number of results found is given at the top. Results are sorted according to a modified version of a classical tf/idf score, based on a multifaceted search field scheme.

Results Page
Figure 9: Results page

Back To Top

Relation View

By accessing an individual result, the "Relation View Page" opens, where all the components of the relation are presented, as seen in Figure 10.

Relation View
Figure 10: Relation detailed view

The textual and visual arguments of the relation are depicted here, along with information about the type and direction of the relation connecting the two arguments. The user can also watch the video clip "containing" the whole relation, or the video clips of the modalities that participate in it and see static images of specific objects, with the objects' contour highlighted. Mouse over a keyframe region image maximizes the image for a clearer view.

Back To Top

Body movements and Gestures

When the visual argument is a body movement or a gesture, it is not just the argument itself that is depicted (its annotation label and its corresponding video clip), but also its complements, namely the agent, the tool, the affected object and the location of the action, following the principles of the minimalist grammar of action (see Documentation section). For example, Figure 11 shows the body movement labelled as "catches ball", with three of its complements annotated. Initially the complements are not expanded (only their general category is shown). Figure 12 shows the body movement with all the complements unfolded. The user can have a more detailed view, either seeing an annotated keyframe of the complement, or watching a video clip.

Body Movement
Figure 11: Detailed view of a body movement argument

Body Movement
Figure 12: Complements of a body movement

Back To Top

Indirect Relations

In some cases, the two arguments of the relation are not directly connected, but an additional argument, not explicitly said or shown, is needed. This inferred argument creates an inferred relation, which is necessary for indirectly associating the visual and textual arguments. An example is shown in Figure 13, which is the result of a query for "beer glass", or "pint". Here, the visual argument is "beer glass" and the textual argument is "pint", for which no direct relation exists. Adding the inferred argument "beer", a chain of two inferred relations is created and the visual and textual arguments are now related through the given relational sequence.

Indirect Relation
Figure 13: An example of an Indirect Relation

Of course, each inferred relation can also be retrieved by its own type. In the previous example, a query for all the "Container for Content" relations, will yeld a result similar to the one shown before, as it can be seen in Figure 14. Only this time, the "Result View" focuses on the inferred relation that matched the specific query, leaving the rest of the inferred relations partially visible.

Inferred Relation
Figure 14: An example of an Inferred Relation

Back To Top

Technical Specification

In developing the COSMOROE search interface, specific application needs had to be taken into consideration. The main goal was to develop a text-based search engine module, capable of handling files in the .xml format and accessed by local and remote users. The core implementation is actually a web application, mainly based on the Apache Lucene search engine library.

This choice is supported by Lucene's intrinsic characteristics, such as high-performance indexing and searching, scalability and customization options and open source, cross-platform implementation, that render it one of the most suitable solutions for text-based search.

Back To Top

Documentation

Back To Top