Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Craig Willis is active.

Publication


Featured researches published by Craig Willis.


Proceedings of the American Society for Information Science and Technology | 2014

Scholar-built collections: A study of user requirements for research in large-scale digital libraries

Katrina Fenlon; Megan Senseney; Harriett E. Green; Sayan Bhattacharyya; Craig Willis; J. S. Downie

To realize the great potential value of large-scale digital libraries, we need a fuller understanding of the range of ways in which scholarly communities conduct research, or want to conduct research within them. Scholars build collections in the course of their work. How can we anticipate and support various kinds of collection-building and -use, in order to support the diversity of researchers who work in libraries of digital books? This paper reports selected results of a study of how potential user groups of the HathiTrust Digital Library create and use collections in their research. This study aims to contribute to our broader understanding of scholarly practice, particularly of humanities scholars’ collecting activities. The results of the study inform ongoing work to develop a workset-creation tool for the HathiTrust Research Center.


Journal of the Association for Information Science and Technology | 2013

A random walk on an ontology: Using thesaurus structure for automatic subject indexing

Craig Willis; Robert M. Losee

Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. The primary goal of this research is the analysis of the contribution of thesaurus structure to the indexing process. The resulting models are evaluated in the context of automatic subject indexing using four collections of documents pre‐indexed with 4 different thesauri (AGROVOC [UN Food and Agriculture Organization], high‐energy physics taxonomy [HEP], National Agricultural Library Thesaurus [NALT], and medical subject headings [MeSH]). We also introduce a thesaurus‐centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.


Journal of Documentation | 2014

HIVEing: the effect of a semantic web technology on inter-indexer consistency

Hollie White; Craig Willis; Jane Greenberg

– The purpose of this paper is to examine the effect of the Helping Interdisciplinary Vocabulary Engineering (HIVE) system on the inter-indexer consistency of information professionals when assigning keywords to a scientific abstract. This study examined first, the inter-indexer consistency of potential HIVE users; second, the impact HIVE had on consistency; and third, challenges associated with using HIVE. , – A within-subjects quasi-experimental research design was used for this study. Data were collected using a task-scenario based questionnaire. Analysis was performed on consistency results using Hoopers and Rollings inter-indexer consistency measures. A series of t-tests was used to judge the significance between consistency measure results. , – Results suggest that HIVE improves inter-indexing consistency. Working with HIVE increased consistency rates by 22 percent (Rollings) and 25 percent (Hoopers) when selecting relevant terms from all vocabularies. A statistically significant difference exists between the assignment of free-text keywords and machine-aided keywords. Issues with homographs, disambiguation, vocabulary choice, and document structure were all identified as potential challenges. , – Research limitations for this study can be found in the small number of vocabularies used for the study. Future research will include implementing HIVE into the Dryad Repository and studying its application in a repository system. , – This paper showcases several features used in HIVE system. By using traditional consistency measures to evaluate a semantic web technology, this paper emphasizes the link between traditional indexing and next generation machine-aided indexing (MAI) tools.


ASIST '13 Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries | 2013

Finding information in books: characteristics of full-text searches in a collection of 10 million books

Craig Willis; Miles Efron

Searching large collections of digitized books is a relatively new area in information-seeking and retrieval research, made possible by initiatives such as Google Books and the HathiTrust Digital Library. The availability of large full-text book collections is transforming how users search and interact with information in books, but the characteristics of these changes are unknown. This paper aims to provide insight into the characteristics of full-text searches in a large collection of digitized books and is the first step in a broader research agenda intended to improve book retrieval. To better understand the types of queries that users are issuing to full-text-book collections, we analyzed a full year of anonymized query logs from the HathiTrust Digital Library full-text search engine. We also manually classified a random sample of 600 queries to develop a taxonomy of book search query types. We found that users are beginning to search for information in books instead of searching for books. Searches still largely follow bibliographic models, but, as expected, new types of searches are beginning to take advantage of full-text capabilities. Additionally, comparing the results of our query log analysis to searches in other domains, we found similar search patterns including short queries, sessions with only a few queries, and users viewing only a few pages of results per query. We discuss how these findings can be used to characterize users of large full-text book collections.


Proceedings of the Practice and Experience on Advanced Research Computing | 2018

TERRA-REF Data Processing Infrastructure

Maxwell Burnette; Rob Kooper; J. D. Maloney; Gareth S. Rohde; Jeffrey A. Terstriep; Craig Willis; Noah Fahlgren; Todd C. Mockler; Maria Newcomb; Vasit Sagan; Pedro Andrade-Sanchez; Nadia Shakoor; Paheding Sidike; Rick Ward; David LeBauer

The Transportation Energy Resources from Renewable Agriculture Phenotyping Reference Platform (TERRA-REF) provides a data and computation pipeline responsible for collecting, transferring, processing and distributing large volumes of crop sensing and genomic data from genetically informative germplasm sets. The primary source of these data is a field scanner system built over an experimental field at the University of Arizona Maricopa Agricultural Center. The scanner uses several different sensors to observe the field at a dense collection frequency with high resolution. These sensors include RGB stereo, thermal, pulse-amplitude modulated chlorophyll fluorescence, imaging spectrometer cameras, a 3D laser scanner, and environmental monitors. In addition, data from sensors mounted on tractors, UAVs, an indoor controlled-environment facility, and manually collected measurements are integrated into the pipeline. Up to two TB of data per day are collected and transferred to the National Center for Supercomputing Applications at the University of Illinois (NCSA) where they are processed. In this paper we describe the technical architecture for the TERRA-REF data and computing pipeline. This modular and scalable pipeline provides a suite of components to convert raw imagery to standard formats, geospatially subset data, and identify biophysical and physiological plant features related to crop productivity, resource use, and stress tolerance. Derived data products are uploaded to the Clowder content management system and the BETYdb traits and yields database for querying, supporting research at an experimental plot level. All software is open source2 under a BSD 3-clause or similar license and the data products are open access (currently for evaluation with a full release in fall 2019). In addition, we provide computing environments in which users can explore data and develop new tools. The goal of this system is to enable scientists to evaluate and use data, create new algorithms, and advance the science of digital agriculture and crop improvement.


Proceedings of the 2012 iConference on | 2012

The HIVE impact: contributing to consistency via automatic indexing

Hollie White; Craig Willis; Jane Greenberg

Research has shown that automatic subject indexing is more efficient and consistent than manual indexing; yet many organizations continue to use manual indexing because of the unacceptable quality of automatically produced results. This poster presents the results of an exploratory experiment examining consistency stemming from a machine-aided indexing approach. The HIVE vocabulary server was used to present concepts to 31 workshop participants. The presentation of terms via an automatic sequence reduced the indexer burden and contributed to increased consistency. This poster reports initial results and provides a framework for further exploration of automatic indexing in manual workflows.


Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability, Success and Impact | 2017

Container-based Analysis Environments for Low-Barrier Access to Research Data

Craig Willis; Mike Lambert; Kenton McHenry; Christine Kirkpatrick

The growing size of high-value sensor-born or computationally derived scientific datasets are pushing the boundaries of traditional models of data access and discovery. Due to their size, these datasets are often accessible only through the systems on which they were created. Access for scientific exploration and reproducibility is limited to file transfer or by applying for access to the systems used to store or generate the original data, which is often infeasible. There is a growing trend toward providing access to large-scale research datasets in-place via container-based analysis environments. This paper describes the National Data Service (NDS) Labs Workbench platform and DataDNS initiative. The Labs Workbench platform is designed to provide scalable and low-barrier access to research data via container-based services. The DataDNS effort is a new initiative designed to enable discovery, access, and in-place analysis for large-scale data, providing a suite of interoperable services to enable researchers, as well as the tools they are most familiar with, to access and analyze these datasets where they reside.


association for information science and technology | 2016

What makes a query temporally sensitive

Craig Willis; Garrick Sherman; Miles Efron

This work examines factors that affect manual classifications of “temporally sensitive” information needs. We introduce the concepts of temporal relevance and temporal topicality to differentiate between different aspects of temporal retrieval research. We use qualitative and quantitative techniques to analyze 660 topics from the Text Retrieval Conference (TREC) previously used in the experimental evaluation of temporal retrieval models. We use regression analysis to model previous manual classifications. We identify factors and potential problems with previous classifications, proposing principles and guidelines for future work on the evaluation of temporal retrieval models.


Archive | 2014

Using Collections and Worksets in Large-Scale Corpora: Preliminary Findings from the Workset Creation for Scholarly Analysis Project

Harriett E. Green; Katrina Fenlon; Megan Senseney; Sayan Bhattacharyya; Craig Willis; Peter Organisciak; J. Stephen Downie; Timothy W. Cole; Beth Plale

Scholars from numerous disciplines rely on collections of texts to support research activities. On this diverse and interdisciplinary frontier of digital scholarship, libraries and information institutions must 1) prepare to support research using large collections of digitized texts, and 2) understand the different methods of analysis being applied to the collections of digitized text across disciplines. The HathiTrust Research Center’s Workset Creation for Scholarly Analysis (WCSA) project conducted a series of focus groups and interviews to analyze and understand the scholarly practices of researchers that use largescale, digital text corpora. This poster presents preliminary findings from that study, which offers early insights into user requirements for scholarly research with textual corpora.


Journal of the Association for Information Science and Technology | 2012

Analysis and synthesis of metadata goals for scientific data

Craig Willis; Jane Greenberg; Hollie White

Collaboration


Dive into the Craig Willis's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Beth Plale

Indiana University Bloomington

View shared research outputs
Top Co-Authors

Avatar

Robert M. Losee

University of North Carolina at Chapel Hill

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jaime Arguello

University of North Carolina at Chapel Hill

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jin Ha Lee

University of Washington

View shared research outputs
Researchain Logo
Decentralizing Knowledge