• Nov. 4, 2015: The file encoding was specified:

Note that the file encoding must be "UTF-8".

• Nov. 2, 2015: The way to generate a trailtext was modified:

Let $${\mathbf f} = (\ldots, u_{j-1}, l_{k}, u_{j}, \ldots)$$ where $$l_{k}$$ is a link of intent $$i$$.

• Nov. 1, 2015: A sentence was added:

Also note that only the first appearance of the same iUnit is relevant, while the other appearances are regarded as non-relevant.

• Oct. 5, 2015: Several updates on the task definition (they are highlighted)

## Overview

Current web search engines usually return a ranked list of URLs in response to a query. After inputting a query and clicking on the search button, the user often has to visit several web pages and locate relevant parts within those pages. While these actions require significant effort and attention, especially for mobile users, they could be avoided if a system returned a concise summary of relevant information to the query1.

The NTCIR-12 MobileClick task (and its predecessors, 1CLICK tasks in NTCIR-92 and NTCIR-103, as well as MobileClick task in NTCIR-114 ) aims to directly return a summary of relevant information and immediately satisfy the user without requiring heavy interaction with the device. Unlike the 1CLICK tasks, we expect the output to be two-layered text where the first layer contains the most important information and an outline of additional relevant information, while the second layers contain detailed information that can be accessed by clicking on an associated anchor text in the first layer. In the example below, for query "NTCIR-11", a MobileClick system presents general information about NTCIR-11 and a list of core tasks in the first layer. When the "MobileClick" link is clicked by the user, the system shows text in the second layer that is associated with that link.

Textual output of the MobileClick task is evaluated based on information units (iUnits) rather than document relevance. The performance of a submitted system is scored higher if it generates summaries including more important iUnits. In addition, we require systems to minimize the amount of text the user has to read or, equivalently, the time she has to spend in order to obtain relevant information. Although these evaluation principles were also taken into account in the 1CLICK tasks, here they are extended to two-layered summaries where users can read a summary in multiple ways. We assume a user model that reads different parts of the summary according to their interest, and compute an evaluation metric based on the importance of iUnits read as well as the time spent to obtain them.

The goal of MobileClick is to return a structured textual output in response to a given query, which can be achieved through two subtasks: iUnit ranking subtask and iUnit summarization subtask.

The iUnit ranking subtask is a task where systems are expected to rank a set of pieces of information (iUnits) based on their importance for a given query. This subtask was devised to enable componentized evaluation, where we can separately evaluate the performance of estimating important information pieces and summarizing them into two-layers.

We provide a set of queries, a set of iUnits, and documents from which the iUnits were extracted. Participants should submit, for each query, a list of iUnits that are ordered by their estimated importance. More concretely, we accept a tab-delimited-values (TSV) file as an iUnit ranking run, where the first line must be a simple system description, and each of the other lines must represents a single iUnit. Therefore, a run file should look like the one shown below:

This is an example run file
[qid]	[uid]	[score]
[qid]	[uid]	[score]
...


where [qid] is a query ID, [uid] is a iUnit ID, and [score] is estimated importance of the iUnit. Note that the file encoding must be "UTF-8". In many ways, the iUnit ranking runs are similar with TREC ad-hoc runs in that they are essentially a ranked list of the objects retrieved. The iUnits were assessed for relevance by human annotators and the runs are evaluated using ranking measures. Note that we do not use [score] values for evaluation, and use only the order of iUnits in run files.

The iUnit summarization subtask is defined as follows: Given a query, a set of iUnits, and a set of intents, generate a structured textual output. In MobileClick, more precisely, the output must consist of two layers. The first layer is a list of iUnits and links to the second layer, while the second layer consists of lists of iUnits. Each link must be one of the provided intents and be associated with one of the iUnit lists in the second layer. Each list of iUnits in the first and second layers can include at most $$X$$ characters so that it fits ordinary mobile screen size. The length of links is counted, while symbols and whitespaces are excluded. In MobileClick-2, $$X$$ is set to 420 for English and 280 for Japanese. Note that your runs can include more than $$X$$ characters in each iUnit list but will be truncated in evaluation.

Each run must be a XML file that satisfies a DTD shown below:

<!ELEMENT results (sysdesc, result*)>
<!ELEMENT sysdesc (#PCDATA)>
<!ELEMENT result (first, second*)>
<!ELEMENT second (iunit)*>
<!ELEMENT iunit EMPTY>
<!ATTLIST result qid NMTOKEN #REQUIRED>
<!ATTLIST iunit uid NMTOKEN #REQUIRED>
<!ATTLIST second iid NMTOKEN #REQUIRED>


where

• The XML file includes a [results] element as the root element;
• The [results] element contains exactly one [sysdesc] element;
• The [results] element also contains [result] elements, each of which corresponds a two-layered summary and has a [qid] attribute;
• A [result] element contains a [first] element and [second] elements;
• The [first] element contains [iunit] and [link] elements;
• A [second] element has an attribute [iid], and contains [iunit] elements;
• An [iunit] element has an attribute [uid] (iUnit ID); and
• A [link] element has an attribute [iid] (intent ID), which identifies a [second] element to be linked.
Note that the same [iunit] element may appear multiple times, e.g. an iUnit may appear in the [first] element and two [second] elements.

An XML file example that satisfies the DTD is shown below:

<?xml version="1.0" encoding="UTF-8" ?>

Organizers' Baseline

<iunit uid="MC-E-0001-U001" />
<iunit uid="MC-E-0001-U003" />
<iunit uid="MC-E-0001-U004" />

<iunit uid="MC-E-0001-U011" />
<iunit uid="MC-E-0001-U019" />

<iunit uid="MC-E-0001-U029" />
<iunit uid="MC-E-0001-U021" />



One can check whether an XML is valid or not at W3C Markup Validation Service by using "Validate by direct input". Note that the file encoding must be "UTF-8".

## Test Collection

The NTCIR-12 MobileClick test collection includes queries, iUnits, intents, and a document collection.

### Queries

The NTCIR-12 MobileClick test collection includes English and Japanese queries. Unlike the MobileClick-1 task, we selected more ambiguous/underspecified, or short queries like the 1CLICK tasks held in the past NTCIR. This is because we opt to focus on queries that are often utilized in mobile devices, and to tackle the problem of diverse intents in searchers.

We used a Korean toolbar log from April to July 2014 for obtaining real-users' queries, and translated them into English and Japanese. We used Google AdWords Keyword Planner for deciding queries to be included in our test collection. We first randomly generated queries, and then filtered out too frequent or infrequent queries based on the report from Google's tool.

### Documents

To provide participants with a set of iUnits for each query, we downloaded 500 top-ranked documents that were returned by Bing search engine in response to each query, from which we extracted iUnits as explained in the next subsection. As we failed to access some of the documents, the number of downloaded documents per query is fewer than 500. NTCIR participants can obtain this document collection after their registration, and utilize them to estimate the importance of each iUnit and intent probability, etc. Please contact the organizers if one wants to download this document collection.

### iUnits

Like the 1CLICK tasks held in the past NTCIR, we used iUnits as a unit of information in the MobileClick task. iUnits are defined as relevant, atomic, and dependent pieces of information, where

• Relevant means that an iUnit provides useful factual information to the user;
• Atomic means that an iUnit cannot be broken down into multiple iUnits without loss of the original semantics; and
• Dependent means that an iUnit can depend on other iUnits to be relevant.
Please refer to the 1CLICK-2 overview paper for the details of the definition1click2.

Although iUnits can depend on other iUnits to be relevant according to our definition, we excluded depending iUnits in this round for simplicity.

As this work requires careful assessment lasting for a long time and consideration on the three requirements of iUnits, we decided not to use crowd-sourcing mainly due to low controllability and high education cost. We hired assessors for extracting iUnits by hand and kept the quality of extracted iUnits by giving timely feedback on their results. For English queries, we hired three assessors for extracting iUnits by hand, who were trained well through assessment work on TREC Temporal Summarization Track6. For Japanese queries, Japanese organizers of this task extracted iUnits by ourselves. Examples of iUnits for English queries are shown below.

Examples of iUnits for NTCIR-11 MobileClick English queries. Query MC-E-0020 is "stevia safety".
Query ID iUnit Source
MC-E-0020There are some dangers and side effects when using stevia.MC-E-0020-008.html
MC-E-0020refined stevia preparations allowed in food and drinksMC-E-0020-001.html
MC-E-0020Stevia does interact with some other drugs.MC-E-0020-009.html
MC-E-0020Stevia may have an anti-inflammatory effect.MC-E-0020-011.html
MC-E-0020Stevia may help diarrhea.MC-E-0020-011.html

### Intents

We introduce the notion of intents to the MobileClick-2 task, which have been utilized in the NTCIR INTENT and IMine tasks. An intent can be defined as either a specific interpretation of an ambiguous query ("Mac OS" and "car brand" for "jaguar"), or an aspect of a faceted query ("windows 8" and "windows 10" for "windows"). In this round, intents were taken into account in evaluating the importance of iUnits, and were used as candidates of anchor text of links to the second layers in the iUnit summarization subtask.

We constructed intents by

1. Clustering iUnits,
2. Giving each cluster a label representing iUnits included in the cluster, and
3. Letting each label of a cluster represents an intent.

We hired assessors for The organizers of this task worked on the manual iUnit clustering, in which two iUnits were grouped together if

• They are information about the same interpretation of an ambiguous query or the same aspect of a faceted query, and
• They are likely to be interesting for the same user.

The criteria used in the label selection are listed below:

• The label of a cluster should be descriptive enough for users to grasp the iUnits included in the cluster, and
• The label of a cluster should be often used as a query or anchor text for the included iUnits.

Subsequently, we let assessors vote whether each intent is important or not. This voting was carried out to estimate the intent probability, which is the probability of intents of users who input a particular query, as was conducted in the NTCIR INTENT and IMine tasks. The assessors were asked to vote for an intent if they believed that they were interested in the intent when they had a chance to search by the query. All intents that did not receive any vote were excluded in this task, since it indicates that there are few users who are interested in such intents. We normalized the number of votes for each intent by the total number of votes for a query, and let $$P(i|q)$$ denote the normalized one for $$i$$, which we call intent probability of intent $$i$$ of query $$q$$.

### iUnit Importance

The importance of each iUnit was evaluated in terms of each intent, and global importance was derived from the per-intent importance and intent probability.

We asked assessors to assess each iUnit in terms of each intent, and evaluate the importance at a five-point scale: 0 (unimportant), 1, 2 (somewhat important), 3, and 4 (highly important). The assessors were instructed to evaluate the importance by assuming that they were interested in a given intent. We defined the importance of an iUnit in terms of an intent as follows: an iUnit is more important if it is more necessary for more users who are interested in the intent. For example, given intent "Mac OS" in response to query "jaguar", iUnit "car company in UK" is unimportant, while it is highly important given intent "car brand". We used the average of the per-intent importance scores given by multiple assessors in our evaluation.

In the iUnit ranking subtask, we used the global importance of each iUnit for evaluation. Letting $$P(i|q)$$ be the intent probability of query $$q$$, the global importance of iUnit $$u$$ is defined as follows: $$GG(u) = \sum_{i \in I_q} P(i|q)g_i(u),$$ where $$I_q$$ is a set of intents for query $$q$$, and $$g_i(u)$$ denotes the per-intent importance of iUnit $$u$$ in terms of intent $$i$$.

## Evaluation Measures

Runs submitted by participants include a ranked list of iUnit IDs for each query, which can be handled in the same way as ad-hoc retrieval runs. Therefore, we employed standard evaluation metrics for ad-hoc retrieval in this subtask.

One of the evaluation metrics The primary evaluation metric used in the iUnit ranking subtask was normalized discounted cumulative gain (nDCG). Discounted cumulative gain (DCG) is defined as follows: $${\rm nDCG}@K = \sum^{K}_{r=1} \frac{GG(u_r)}{\log_2(r+1)},$$ where $$K$$ is a cutoff parameter, and $$u_r$$ is the $$r$$-th iUnit in a submitted ranked list. The normalized version of DCG (nDCG) is therefore defined as follows: $${\rm nDCG}@K = \frac{{\rm DCG}@K}{{\rm iDCG}@K},$$ where $${\rm iDCG}$$ is DCG of the ideal ranked list of iUnits, which can be constructed by sorting all the iUnits for a query by their global importance.

Another The secondary evaluation metric is Q-measure proposed by Sakai7: $$Q = \frac{1}{R}\sum^{M}_{r=1} {\rm IsRel}(u_r)\frac{\sum^{r}_{r'=1}(\beta GG(u_{r'})+{\rm IsRel}(u_{r'}))}{\beta\sum^{r}_{r'=1}GG(u^{*}_{r'})+r},$$ where $${\rm IsRel}(u)$$ is an indicator function that returns 1 if $$GG(u) > 0$$; otherwise 0, $$R$$ is the number of iUnits with non-zero global importance (i.e. $$\sum_{u}{\rm IsRel}(u)$$), $$M$$ is the length of a ranked list, $$u^{*}_r$$ is the $$r$$-th iUnit in the ideal ranked list of iUnits, and $$\beta$$ is a patience parameter which we set to 1 following established standards8.

Q-measure is a recall-based graded-relevance metric, while nDCG is a rank-based graded-relevance metric. Thus, we expect that using both metrics will enable us to measure the performance from different perspectives. Moreover, both of them were shown to be reliableqmeasure. Q-measure is used for ranking submitted runs since it can take into account the quality of the whole ranking.

Runs submitted to the iUnit summarization subtask consists of the first layer $${\mathbf f}$$ and second layers $$S = \{{\mathbf s}_1, {\mathbf s}_2, \ldots, {\mathbf s}_n\}$$. The first layer $${\mathbf f}$$ consists of iUnits and links (e.g. $${\mathbf f}=(u_1, u_2, l_1, u_3)$$ where $$u_j$$ is an iUnit and $$l_j$$ is a link). Each link $$l_j$$ links to a second layer $${\mathbf s}_j$$. A second layer $${\mathbf s}_j$$ is composed of iUnits (e.g. $${\mathbf s}_1=(u_{1, 1}, u_{1, 2}, u_{1, 3})$$).

The principles of the iUnit summarization evaluation metric are summarized as follows:

1. The evaluation metric is the expected utility of users who probabilistically read a summary.
2. Users are interested in one of the intents by following the intent probability $$P(i|q)$$.
3. Users read a summary following the rules below:
1. They read the summary from the beginning of the first layer in order and stop after reading $$L$$ characters except symbols and white spaces.
2. When they reach the end of a link $$l_i$$, they click on the link and start to read its second layer if they are interested in the intent of $$l_i$$.
3. When they reach the end of a second layer $$s_j$$, they continue to read the first layer from the end of the link $$l_j$$.
4. The utility is measured by U-measure proposed by Sakai and Dou10,
5. which consists of a position-based gain and a position-based decay function.

We then generate the user tails (or trailtext) according to the user model explained above, compute a U-measure score for each trailtext, and finally estimate the expected U-measure by combining all the U-measure scores of different trailtexts. M-measure, the iUnit summarization evaluation metric, is defined as follows: \begin{eqnarray} M = \sum_{t \in T} P({\mathbf t})U({\mathbf t}), \label{eq:mmeasure} \end{eqnarray} where $$T$$ is a set of all possible trailtexts, $$P({\mathbf t})$$ is a probability of going through a trail $${\mathbf t}$$, and $$U({\mathbf t})$$ is the U-measure score of the trail.

A trailtext is a concatenation of all the texts read by a user, and can be defined as a list of iUnits and links in our case. According to our user model, a trailtext of a user who are interested in intent $$i$$ can be obtained by inserting after replacing the link of $$i$$ a list of iUnits in its second layer. More specifically, a trailtext of intent $$i$$ is obtained as follows:

1. Exclude all the links in first layer $${\mathbf f}$$ that are not associated with intent $$i$$, and obtainLet $${\mathbf f} = (\ldots, u_{j-1}, l_{k}, u_{j}, \ldots)$$ where $$l_{k}$$ is a link of intent $$i$$.
2. Generate trailtext $${\mathbf t} = (\ldots, u_{j-1}, l_{k}, u_{k,1}, \ldots, u_{k, |{\mathbf s}_k|}, u_{j}, \ldots)$$ for second layer $${\mathbf s}_{k} = (s_{k,1}, \ldots, s_{k, |{\mathbf s}_k|})$$.
Note that a link in the trailtext is regarded as a non-relevant iUnit for the sake of convenience. Also note that only the first appearance of the same iUnit is relevant, while the other appearances are regarded as non-relevant.

As mentioned above, we can generate a trailtext for each intent, and do not need consider the other trailtexts as the way to read a summary only depends on the intent of users. In addition, the probability of a trailtext is equivalent to that of an intent for which the trailtext is generated. Thus, M-measure can be simply re-defined as follows: \begin{eqnarray} M = \sum_{i \in I_q} P(i|q)U_i({\mathbf t}_i). \end{eqnarray} U is now measured in terms of intent $$i$$ in the equation above, since we assume that users going through $${\mathbf t}_i$$ are interested in $$i$$.

The utility is measured by U-measure proposed by Sakai and Douumeasure, and is computed by the importance and offset of iUnits in a trailtext. The offset of iUnit $$u$$ in a trailtext is defined as the number of characters between the beginning of the trailtext and the end of $$u$$. More precisely, the offset of the $$j$$-th iUnit in trailtext $${\mathbf t}$$ is defined as $${\rm pos}_{\mathbf t}(u) = \sum^{j}_{j'=1}{\rm chars}(u_{j'})$$ where $${\rm chars}(u)$$ is the number of characters of iUnit $$u$$ except symbols and white spaces. Recall that a link in the trailtext contributes to the offset as a non-relevant iUnit. According to Sakai and Dou's work, U-measure is defined as follows: \begin{eqnarray} U_i({\mathbf t}) = \frac{1}{\mathcal N}\sum^{|{\mathbf t}|}_{j = 1} g_i(u_j)d(u_j), \label{eq:umeasure} \end{eqnarray} where $$d$$ is a position-based decay function, and $${\mathcal N}$$ is a normalization factor (which we simply set to 1). The position-based decay function is defined as follows: \begin{eqnarray} d(u)=\max\left(0, 1-\frac{{\rm pos}_{\mathbf t}(u)}{L}\right), \end{eqnarray} where $$L$$ is a patience parameter of users. Note that no gain can be obtained after $$L$$ characters read, i.e. $$d(u) = 0$$. This is consistent with our user model in which users stop after reading $$L$$ characters. In MobileClick-2, $$L$$ is set to twice as many as X: 840 for English and 560 for Japanese, since $$L = 500$$ (or 250) for Japanese was recommended by a study on S-measure12.

1. T. Sakai, M. P. Kato, and Y.-I. Song. Click the search button and be happy: Evaluating direct and immediate information access. In Proc. of CIKM 2011, pages 621–630, 2011.

2. T. Sakai, M. P. Kato, and Y.-I. Song. Overview of NTCIR-9 1CLICK. In Proceedings of NTCIR-9, pages 180–201, 2011.

3. M. P. Kato, M. Ekstrand-Abueg, V. Pavlu, T. Sakai, T. Yamamoto, and M. Iwata. Overview of the NTCIR-10 1CLICK-2 Task. In NTCIR-10 Conference, pages 243–249, 2013.

4. M. P. Kato, M. Ekstrand-Abueg, V. Pavlu, T. Sakai, T. Yamamoto, and M. Iwata. Overview of the NTCIR-11 MobileClick Task. In NTCIR-11 Conference, pages 195–207, 2014.

5. M. P. Kato, M. Ekstrand-Abueg, V. Pavlu, T. Sakai, T. Yamamoto, and M. Iwata. Overview of the NTCIR-10 1CLICK-2 Task. In NTCIR-10 Conference, pages 243–249, 2013.

6. J. Aslam, M. Ekstrand-Abueg, V. Pavlu, F. Diaz, and T. Sakai. TREC 2013 temporal summarization. In Proceedings of the 22nd Text Retrieval Conference (TREC), 2013.

7. T. Sakai. On the reliability of information retrieval metrics based on graded relevance. Information processing & management, 43(2):531–548, 2007.

8. T. Sakai. On penalising late arrival of relevant documents in information retrieval evaluation with graded relevance. In Proceedings of the First Workshop on Evaluating Information Access (EVIA 2007), pages 32–43, 2007.

9. T. Sakai. On the reliability of information retrieval metrics based on graded relevance. Information processing & management, 43(2):531–548, 2007.

10. T. Sakai and Z. Dou. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In Proc. of SIGIR 2013, pages 473–482, 2013.

11. T. Sakai and Z. Dou. Summaries, ranked retrieval and sessions: a unified framework for information access evaluation. In Proc. of SIGIR 2013, pages 473–482, 2013.

12. T. Sakai, M. P. Kato. One Click One Revisited: Enhancing Evaluation based on Information Units, In Proc. of the 8th Asia Information Retrieval Societies Conference (AIRS 2012), pages 39-51, 2012.