[PDF] Characteristics of Dataset Retrieval Sessions: Experiences from a Real-life Digital Library

Abstract

Secondary analysis or the reuse of existing survey data is a common practice among social scientists. Searching for relevant datasets in Digital Libraries is a somehow unfamiliar behaviour for this community. Dataset retrieval, especially in the social sciences, incorporates additional material such as codebooks, questionnaires, raw data files and more. Our assumption is that due to the diverse nature of datasets, document retrieval models often do not work as efficiently for retrieving datasets. One way of enhancing these types of searches is to incorporate the users' interaction context in order to personalise dataset retrieval sessions. As a first step towards this long term goal, we study characteristics of dataset retrieval sessions from a real-life Digital Library for the social sciences that incorporates both: research data and publications. Previous studies reported a way of discerning queries between document search and dataset search by query length. In this paper, we argue the claim and report our findings of an indistinguishability of queries, whether aiming for a dataset or a document. Amongst others, we report our findings of dataset retrieval sessions with respect to query characteristics, interaction sequences and topical drift within 65,000 unique sessions.

Full PDF

CCharacteristics of Dataset Retrieval Sessions:Experiences from a Real-life Digital Library

Zeljko Carevic, Dwaipayan Roy, Philipp Mayr

GESIS - Leibniz Institute for the Social Sciences, Cologne, Germany

Abstract.

Secondary analysis or the reuse of existing survey data is acommon practice among social scientists. Searching for relevant datasetsin Digital Libraries is a somehow unfamiliar behaviour for this com-munity. Dataset retrieval, especially in the social sciences, incorporatesadditional material such as codebooks, questionnaires, raw data ﬁles andmore. Our assumption is that due to the diverse nature of datasets, doc-ument retrieval models often do not work as eﬃciently for retrievingdatasets. One way of enhancing these types of searches is to incorporatethe users’ interaction context in order to personalise dataset retrieval ses-sions. As a ﬁrst step towards this long term goal, we study characteristicsof dataset retrieval sessions from a real-life Digital Library for the socialsciences that incorporates both: research data and publications. Previ-ous studies reported a way of discerning queries between document search and dataset search by query length. In this paper, we argue the claim andreport our ﬁndings of an indistinguishability of queries, whether aimingfor a dataset or a document. Amongst others, we report our ﬁndings ofdataset retrieval sessions with respect to query characteristics, interac-tion sequences and topical drift within 65,000 unique sessions.

With the vast availability of research data on the Web within the Open Datainitiatives, searching for it becomes an increasingly important and timely topic.The Web hosts a whole range of new data species, published in structured, un-structured and semi-structured formats – from web tables to open governmentdata portals, knowledge bases such as Wikidata and scientiﬁc data repositories.This data fuels many novel applications, for example, fact checkers and questionanswering systems, and enables advances in machine learning, artiﬁcial intelli-gence and information retrieval.Dataset retrieval has emerged as an independent ﬁeld of study from the textretrieval domain. The latter is well-known in information retrieval (IR) with re-search leading to signiﬁcant improvements. Dataset retrieval, on the other hand,represents a challenging sub-discipline of information retrieval with substan-tial diﬀerences in comparison to traditional document retrieval [4,12]. Datasets,especially in disciplines such as the social sciences, often encompass complexadditional material such as codebooks (incl. variable descriptions), question-naires, raw data ﬁles and more. Due to the higher complexity of datasets, the a r X i v : . [ c s . D L ] J un Z. Carevic et al. applicability of IR models build mainly for document retrieval is questionable.In addition, the motivations and information needs of researchers seeking fordatasets are too manifold to be supported by out-of-the-box retrieval technolo-gies. Disciplines that encourage the re-use of datasets or secondary analysis, suchas, the social sciences might thus not be supported suﬃciently during dataset re-trieval. One way of supporting users during dataset retrieval is the developmentof an integrated dataset retrieval system that employs advances from establisheddocument retrieval systems and adopts these techniques to the ﬁeld of datasetretrieval. Our long term goal in the project ConData is to develop an eﬀectivedataset retrieval system, that incorporates personalised searching by employingcontextualised ranking features which aim at tailoring search results towardsthe users’ information needs. In order to develop a contextualised dataset re-trieval approach, it is necessary to ﬁrst gain a better understanding of diﬀerentcharacteristics during dataset retrieval. Obtaining these kinds of behaviouraldata is usually hard. We address this shortcoming by analysing real-life userbehaviour within a Digital Library for research data and related information forthe social sciences [7]. As an initial outcome of this study, we report our ﬁndingson comparing dataset retrieval with document retrieval sessions correspondingto query characteristics, interaction sequences and topical drift within 65,000unique search sessions. Although started as a fundamental database task, the diverse nature of searchedentities (which can be images, graphs, tables etc.) establishes dataset retrievalas a research domain for itself. The distinctive aspects of dataset retrieval re-garding complex information needs (and in turn, query formulations) make ita diﬃcult process in comparison to document search [3,11,12]. However, tra-ditional keyword-based retrieval approaches are still in use in the domain ofdataset retrieval although they are observed to be less eﬀective for the task [4].In order to exploit the additional information available for datasets, researcheshave been going on [2,5] to achieve further improvement.An important sub-task during a retrieval session is to characterise the queryto understand whether the search intent is of document or dataset. Consider-ing the diversity in nature between dataset retrieval and document retrieval,an integrated search system (having both, datasets as well as documents as arepository) would beneﬁt in selecting appropriate searching mechanism if thequery intent is recognized. However, in [10], Kacprzak et al. reported the diﬃ-culty in understanding the users’ intent when performing dataset search. Theyhave subtly drawn a co-relation between query length and the type of query, andconcluded with a suggestion to use longer queries for dataset retrieval. Experi-mented in an artiﬁcial setting without a naturalistic information need, however,they concluded that their observation could be considered as an approximationof the user behaviour for comparing dataset and document search. http://bit.ly/Condata haracteristics of Dataset Retrieval Sessions 3 Few of the works on studying user behaviour in dataset search have been doneexamining queries submitted to open data portals and online communities [5,10].However, in [9], Jansen and Spink concluded that it is not possible to directlycompare the results of a transaction log analysis across diﬀerent search engines.In this work, we focus on characterising the users’ intent when performingpublication (document) search and research data (dataset) search . We conduct our experiment in a real-life Digital Library for the social sci-ences . This integrated search system (ISS) allows users to search across diﬀerentdata collections: research datasets, publications, survey variables, questions fromquestionnaires, survey instruments and tools for creating surveys. The focus ofthe following study is on datasets and publications. The collection covering re-search datasets comprises 6,267 studies that are collected within our institutionand 107,595 studies coming from other institutes. The collection covering pub-lications comprises 48,234 records mainly as open accessible articles from thesocial sciences. Information items are interlinked whenever possible to allow abetter ﬁndability and reuse of the data. The ISS uses category facets which en-able a user to switch between data types. Furthermore, category facets ensurethat result lists contain exactly one data type at a time. The ISS is mainly usedby social scientists. A thorough report about the technical system, the contentand its users can be found in [7].The user interactions within the ISS are anonymously logged, which makesit possible to study user behaviour on a larger scale. Amongst others, the logcovers user actions such as queries submitted , record views , browse/ﬁlter oper-ations . For this study, we considered all search sessions from January 2018 toDecember 2019. Sessions and their corresponding identiﬁers are not bound to atimeout. Instead, a session expires in ISS on termination of the Web browser.In order to determine a realistic session timeout, we decided to consider sessionsexceeding an inactivity of 30 minutes as a new session. After this operation, weidentiﬁed 30,695 dataset retrieval sessions and 34,550 sessions that were focusedon publications.Given a query Q , ISS returns a list of distinct categories such as “researchdata”, “publications” along with “variables & questions”, “instruments &tools” from which a user can choose to retrieve a corresponding result set. Forthis study, we are interested in those sessions containing queries that led torecord views either in the category research data or in publication . We discrimi-nate the research data search and publication search from the log based on thetype of the succeeding record viewed by the user. We categorise a query as apublication search (or, dataset search) if the user has viewed a record of typepublication (or research data) immediately after submitting the query to ISS.Finally, we extract only those sessions that are either of type research data or The words ( document , publication ) and ( dataset , research data ) are used interchange-ably in the rest of the paper to imply the same concept. Accessible via: https://search.gesis.org . See details in [7]. Z. Carevic et al. publication . In total, our preprocessed log ﬁle consists of 142,028 rows. The rowsin the log represent queries submitted by users (identiﬁed by a session ﬁnger-print) and corresponding record views which are either of type publication orresearch data. The former type accounts for 79,931 records and the later for62,097 records. Certain preprocessing steps are necessary before analysing thetransaction log: we remove sessions having queries that are either empty or con-tain unrecognisable characters (which might result from erroneous encoding).

In this section, we present the results of our transaction log analysis. First, wesummarize the results of our query characterisation in Section 4.1. We compareand contrast dataset search and publication search on the basis of session-levelinformation and sequential interaction information, respectively in Section 4.2and Section 4.3.

In this study, we try to diﬀerentiate queries on the basis of their search intent(publication or research data search). In Table 1, we present the basic statisticsof queries with respect to publications and datasets.

Datasets PublicationsTotal query count 62,097 79,931Unique query count 18,706 (30 . . Table 1: Average statistics comparing queries for dataset and publication search.The following observations can be drawn from Table 1. – Publication search is more common than dataset search, with almost 28%more submitted queries, in the ISS. This is in line with the observationsalready made in [7]. – Dataset search queries are much more repetitive than publication searchqueries with 69.88% queries getting re-issued to the search system; in con-trast, the queries are less repeated (58.43%) for publication searches. Wecan interpret this observation by the variety of forms in representing theinformation need for publication searches (as compared to dataset search). – On average, the length of a dataset search query (measured by the number ofcharacters as well as the number of terms in the query) is less as compared topublication search. This observation is in conﬂict with the notion presented Character count is used considering the linguistics of German language; the queriessubmitted to the ISS are mixed, some in German and others in English.haracteristics of Dataset Retrieval Sessions 5 in [10], where the authors suggested issuing longer queries for dataset search.The reason can be a diﬀerence in the experimental settings of our studyand [10] where the authors acknowledge the artiﬁcial, crowd-sourced natureof their study. – Queries for dataset search signiﬁcantly more often contain numerical dig-its as compared to queries for publication search. Research data includes asigniﬁcant number of periodic records which are titled mentioning the peri-ods (e.g. allbus 2014, allbus 2016 etc. which refer to a biennial surveyconducted since 1980).

In Table 2, we report the average number of record views for dataset search andpublication search in a session. From the table, we can see that the number ofrecord views per session is higher for publication search than for dataset search.This implies that users having a publication search intent are expected to viewmore items than for a dataset search intent. In other words, we assume that theinformation need for a dataset search can be addressed by a comparatively lessnumber of record views than publication search.

Datasets PublicationsAvg. record views per session 2.02 2.31Avg. record views per session (unique) 1.61 2.06

Table 2: Number of record views per session with diﬀerent search intent.

Session Diversity

In a single session, a user could have multiple information needs and mighthave issued multiple queries to ISS. In order to identify the diversity of theinformation need, an elementary way would be to observe the similarity of theissued queries. However, being keyword queries, term overlap based similaritymeasurements, like IR-based TF-IDF model or a set-based Jaccard similaritymodel, would perform poorly when computing similarities among queries.To have a better understanding of the diversity in information needs, an ap-propriate approach would be to inspect the similarity of viewed records: intra-record similarity is inversely proportional to the underlying diversity of a ses-sion [1]. We hypothesise that a heterogeneous set of viewed records indicateshigh diversity.In a single session S , let a user has viewed a set of records { r f , · · · , r l } ( r i ∈{ publication , research data } ). To determine whether a session can be consideredas homogeneous, we measure the similarity between the ﬁrst ( r f ) and last ( r l )encountered record. In order to do this, however, a similarity threshold value isneeded to be ﬁxed with annotated training data. Instead, we apply the MoreLike This (MLT) module that is readily available in Elasticsearch. In the MLT https://bit.ly/MLT-elastic Z. Carevic et al.

Fig. 1: Session diversity at the top 100 and the top 5 similar records.module, similarity is computed using BM25 similarity between a given document( seed document) and all the documents in the collection; it returns a list ofdocuments which are similar in content with the seed document. This approach,in comparison to query similarity, enables us to utilise the set of descriptivemetadata to determine the similarity between documents while at the sametime, being more robust to query modiﬁcations.For a session S , we deﬁne a tuple ( r f , r l ) consisting of the ﬁrst and last viewedrecord. We consider r f to be the seed document for the MLT module. For both,publications and datasets we retrieve top k similar items for the seed ( r f ) using MLT module. If the last viewed record r l is present within the top k more-like-this records, we consider the session as topically homogeneous. However, choosingan appropriate k is crucial in understanding the diversity of the session. For thisstudy, we experiment with setting k to 100 for a lenient understanding, and to5 for a more rigorous and restricted understanding of diversity.The result of this analysis is presented graphically in Figure 1. In the ﬁgure,the light grey shade corresponds to sessions for which the last record r l is notfound within the top 100 more-like-this records. The blue and dark blue shadesindicate the number of sessions for which the last record r l has been located re-spectively within the top 100 and the top 5 records as returned by MLT module.Note that this analysis is not applicable to those sessions having only one recordview.For dataset search (presented at the bottom of Figure 1), we note that ap-proximately 11% sessions (particularly 964) are seen to be very focused to aparticular topic (dark blue) for which the last viewed record ( r l ) has been foundwithin the top 5 more-like-this items. The last record is found within top 100more-like-this items for 2993 sessions (blue) which accounts for 35 . . . MLT entries.The topical diversity and homogeneity of a session for publication and datasetsearch is even more evident when we consider the similarity scores provided bythe

MLT module ( sim-score ( r f , r l ) > . , Top5 = 475 . haracteristics of Dataset Retrieval Sessions 7 for publications was only Top100 = 70 . .

9. From this analysis,we can conclude that dataset retrieval sessions are much more focused thanpublication search sessions, and the searched datasets in a single session aremore densely coupled than the searched publications.

In this section, we study diﬀerences between dataset and document search onthe basis of interaction sequences. We present this using

Sankey diagrams inFigure 2. The diagrams represent the transitions of the ﬁrst eight interactions ofusers when searching for publications (Figure 2a) and datasets (Figure 2b) . Inthe ISS, it is possible to switch between object types (e.g. from searching for re-search data to publication search). Hence, we extracted only those sessions fromthe log having a focus either on publications or on datasets without switchingthe type in between. Each logged interaction is associated with an action labelwhich describes the type of action a user has performed (“view record”, “search”etc.). An in-depth explanation of this analysis technique can be found in [8]. (a) Publications (b) Datasets Fig. 2: First eight interaction transitions for publication and dataset search. Theinteractions are color-coded: green accounts for searching, blue for record view,orange for download (i.e. an implicit relevance signals) and grey for other inter-actions. Implicit relevance signals indicate a higher degree of relevance suggestedby an interaction such as “export citation” immediately after a search. A highresolution diagram is presented in Appendix.The analysis of the interaction sequence (see Figure 2) shows no substantialdiﬀerences between datasets and publications in terms of interaction paths. Forboth types, the most frequent interactions after an initial search (green) wereeither a view record (blue) or another search . Diﬀerences, however, can be foundin two aspects: a ) the frequency of consequent searches (green) is higher forpublications; b ) the number of implicit relevance signals is notably higher fordataset search. One can observe that a large fraction of dataset searches containinteractions related to the download of a record which is especially visible inthe third interaction for datasets. Further query reformulations are less frequentfor dataset searches (ﬂow into green from any other). A possible explanation A high-resolution ﬁgure is available at: https://arxiv.org/abs/2006.02770

Z. Carevic et al. for this can be that a major portion of dataset searches appear to be known-item based. This observation is in line with our earlier observations on sessiondiversity analysis (see Section 4.2).

In this study, we presented an analysis of search logs from an integrated searchsystem containing both, documents and datasets as repositories. In contrast toa similar study [10], we experimented with real-life queries issued by social sci-entists with a deﬁned information need. Further, we argue that the reportedanalysis is more factual in accordance with the observations made in [9]. Fromour study, we observe that the queries addressing a publication are more fre-quent and less repetitive in comparison to dataset searches. Also, the averagenumber of record views during dataset search is substantially lower comparedto publication searches. In terms of segregating search intents between a datasetand a publication search, we note that there are barely any distinctive featuresto characterize a query. As part of future work, we would like to utilise the ses-sion information to personalise retrieval sessions which can further be used toconstruct a specialised recommender system for dataset retrieval.

Acknowledgement

This work was funded by DFG under grant MA 3964/10-1, the “EstablishingContextual Dataset Retrieval - transferring concepts from document to datasetretrieval” (ConDATA) project, http://bit.ly/Condata . References

1. Angel, A., Koudas, N.: Eﬃcient diversity-aware search. In: Proc. of ACM SIGMOD.pp. 781–792 (2011)2. Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search enginefor datasets in an open web ecosystem. In: The World Wide Web Conference. p.13651375. WWW 19 (2019)3. Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun.ACM (2), 7279 (Feb 2011)4. Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ib´a˜nez, L.D., Kacprzak,E., Groth, P.: Dataset search: a survey. VLDB J. (1), 251–272 (2020)5. Chen, J., Wang, X., Cheng, G., Kharlamov, E., Qu, Y.: Towards More UsableDataset Search: From Query Characterization to Snippet Generation. In: Proceed-ings of the 28th CIKM ’19. pp. 2445–2448 (2019)6. Codd, E.: Relational completeness of data base sublanguages. Computer Sciences(1972)7. Hienert, D., Kern, D., Boland, K., Zapilko, B., Mutschke, P.: A Digital Li-brary for Research Data and Related Information in the Social Sciences. In: 2019ACM/IEEE Joint Conference on Digital Libraries (JCDL). pp. 148–157 (2019)8. Hienert, D., Mutschke, P.: A usefulness-based approach for measuring the localand global eﬀect of iir services. In: Proc. of 2016 ACM CHIIR. pp. 153–162 (2016)haracteristics of Dataset Retrieval Sessions 99. Jansen, B.J., Spink, A.: How are we searching the world wide web? a comparisonof nine search engine transaction logs. Inf. Process. Manage. (1), 248263 (2006)10. Kacprzak, E., Koesten, L., Tennison, J., Simperl, E.: Characterising Dataset SearchQueries. In: Companion of WWW ’18. pp. 1485–1488. ACM Press (2018)11. Kern, D., Mathiak, B.: Are there any diﬀerences in data set retrieval compared towell-known literature retrieval? In: Research and Advanced Technology for DigitalLibraries. pp. 197–208 (2015)12. Koesten, L., Mayr, P., Groth, P., Simperl, E., de Rijke, M.: Report on theDATA:SEARCH’18 workshop - Searching Data on the Web. SIGIR Forum (2),117–124 (2018)0 Z. Carevic et al. Appendix F i g . : P ub li c a tt