Jake Ryland Williams
University of Vermont
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jake Ryland Williams.
Proceedings of the National Academy of Sciences of the United States of America | 2015
Peter Sheridan Dodds; Eric M. Clark; Suma Desu; Morgan R. Frank; Andrew J. Reagan; Jake Ryland Williams; Lewis Mitchell; Kameron Decker Harris; Isabel M. Kloumann; James P. Bagrow; Karine Megerdoomian; Matthew T. McMahon; Brian F. Tivnan; Christopher M. Danforth
Significance The most commonly used words of 24 corpora across 10 diverse human languages exhibit a clear positive bias, a big data confirmation of the Pollyanna hypothesis. The study’s findings are based on 5 million individual human scores and pave the way for the development of powerful language-based tools for measuring emotion. Using human evaluation of 100,000 words spread across 24 corpora in 10 languages diverse in origin and culture, we present evidence of a deep imprint of human sociality in language, observing that (i) the words of natural human language possess a universal positivity bias, (ii) the estimated emotional content of words is consistent between languages under translation, and (iii) this positivity bias is strongly independent of frequency of word use. Alongside these general regularities, we describe interlanguage variations in the emotional spectrum of languages that allow us to rank corpora. We also show how our word evaluations can be used to construct physical-like instruments for both real-time and offline measurement of the emotional content of large-scale texts.
PLOS ONE | 2016
Eric M. Clark; Christopher A. Jones; Jake Ryland Williams; Allison N. Kurti; Mitchell C. Norotsky; Christopher M. Danforth; Peter Sheridan Dodds
Background Twitter has become the “wild-west” of marketing and promotional strategies for advertisement agencies. Electronic cigarettes have been heavily marketed across Twitter feeds, offering discounts, “kid-friendly” flavors, algorithmically generated false testimonials, and free samples. Methods All electronic cigarette keyword related tweets from a 10% sample of Twitter spanning January 2012 through December 2014 (approximately 850,000 total tweets) were identified and categorized as Automated or Organic by combining a keyword classification and a machine trained Human Detection algorithm. A sentiment analysis using Hedonometrics was performed on Organic tweets to quantify the change in consumer sentiments over time. Commercialized tweets were topically categorized with key phrasal pattern matching. Results The overwhelming majority (80%) of tweets were classified as automated or promotional in nature. The majority of these tweets were coded as commercialized (83.65% in 2013), up to 33% of which offered discounts or free samples and appeared on over a billion twitter feeds as impressions. The positivity of Organic (human) classified tweets has decreased over time (5.84 in 2013 to 5.77 in 2014) due to a relative increase in the negative words ‘ban’, ‘tobacco’, ‘doesn’t’, ‘drug’, ‘against’, ‘poison’, ‘tax’ and a relative decrease in the positive words like ‘haha’, ‘good’, ‘cool’. Automated tweets are more positive than organic (6.17 versus 5.84) due to a relative increase in the marketing words like ‘best’, ‘win’, ‘buy’, ‘sale’, ‘health’, ‘discount’ and a relative decrease in negative words like ‘bad’, ‘hate’, ‘stupid’, ‘don’t’. Conclusions Due to the youth presence on Twitter and the clinical uncertainty of the long term health complications of electronic cigarette consumption, the protection of public health warrants scrutiny and potential regulation of social media marketing.
Journal of Computational Science | 2016
Eric M. Clark; Jake Ryland Williams; Christopher A. Jones; Richard A. Galbraith; Christopher M. Danforth; Peter Sheridan Dodds
Twitter, a popular social media outlet, has evolved into a vast source of linguistic data, rich with opinion, sentiment, and discussion. Due to the increasing popularity of Twitter, its perceived potential for exerting social influence has led to the rise of a diverse community of automatons, commonly referred to as bots. These inorganic and semi-organic Twitter entities can range from the benevolent (e.g., weather-update bots, help-wanted-alert bots) to the malevolent (e.g., spamming messages, advertisements, or radical opinions). Existing detection algorithms typically leverage meta-data (time between tweets, number of followers, etc.) to identify robotic accounts. Here, we present a powerful classification scheme that exclusively uses the natural language text from organic users to provide a criterion for identifying accounts posting automated messages. Since the classifier operates on text alone, it is flexible and may be applied to any textual data beyond the Twitter-sphere.
Scientific Reports | 2015
Jake Ryland Williams; Paul R. Lessard; Suma Desu; Eric M. Clark; James P. Bagrow; Christopher M. Danforth; Peter Sheridan Dodds
With Zipf’s law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding over no more than three to four orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that phrases of one or more words comprise the most coherent units of meaning in language, we show empirically that Zipf’s law for phrases extends over as many as nine orders of rank magnitude. In doing so, we develop a principled and scalable statistical mechanical method of random text partitioning, which opens up a rich frontier of rigorous text analysis via a rank ordering of mixed length phrases.
PLOS ONE | 2017
Sharon E. Alajajian; Jake Ryland Williams; Andrew J. Reagan; Stephen C. Alajajian; Morgan R. Frank; Lewis Mitchell; Jacob Lahne; Christopher M. Danforth; Peter Sheridan Dodds
We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the “caloric content” of social media and other large-scale texts. We do so by constructing extensive yet improvable tables of food and activity related phrases, and respectively assigning them with sourced estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of “caloric input”, “caloric output”, and the ratio of these measures are all strong correlates with health and well-being measures for the contiguous United States. Our caloric balance measure in many cases outperforms both its constituent quantities; is tunable to specific health and well-being measures such as diabetes rates; has the capability of providing a real-time signal reflecting a population’s health; and has the potential to be used alongside traditional survey data in the development of public policy and collective self-awareness. Because our Lexicocalorimeter is a linear superposition of principled phrase scores, we also show we can move beyond correlations to explore what people talk about in collective detail, and assist in the understanding and explanation of how population-scale conditions vary, a capacity unavailable to black-box type methods.
Physical Review E | 2015
Jake Ryland Williams; Eric M. Clark; James P. Bagrow; Christopher M. Danforth; Peter Sheridan Dodds
In an effort to better understand meaning from natural language texts, we explore methods aimed at organizing lexical objects into contexts. A number of these methods for organization fall into a family defined by word ordering. Unlike demographic or spatial partitions of data, these collocation models are of special importance for their universal applicability. While we are interested here in text and have framed our treatment appropriately, our work is potentially applicable to other areas of research (e.g., speech, genomics, and mobility patterns) where one has ordered categorical data (e.g., sounds, genes, and locations). Our approach focuses on the phrase (whether word or larger) as the primary meaning-bearing lexical unit and object of study. To do so, we employ our previously developed framework for generating word-conserving phrase-frequency data. Upon training our model with the Wiktionary, an extensive, online, collaborative, and open-source dictionary that contains over 100000 phrasal definitions, we develop highly effective filters for the identification of meaningful, missing phrase entries. With our predictions we then engage the editorial community of the Wiktionary and propose short lists of potential missing entries for definition, developing a breakthrough, lexical extraction technique and expanding our knowledge of the defined English lexicon of phrases.
Physical Review E | 2017
Peter Sheridan Dodds; David Rushing Dewhurst; Fletcher F. Hazlehurst; Colin M. Van Oort; Lewis Mitchell; Andrew J. Reagan; Jake Ryland Williams; Christopher M. Danforth
Herbert Simons classic rich-get-richer model is one of the simplest empirically supported mechanisms capable of generating heavy-tail size distributions for complex systems. Simon argued analytically that a population of flavored elements growing by either adding a novel element or randomly replicating an existing one would afford a distribution of group sizes with a power-law tail. Here, we show that, in fact, Simons model does not produce a simple power-law size distribution as the initial element has a dominant first-mover advantage, and will be overrepresented by a factor proportional to the inverse of the innovation probability. The first groups size discrepancy cannot be explained away as a transient of the model, and may therefore be many orders of magnitude greater than expected. We demonstrate how Simons analysis was correct but incomplete, and expand our alternate analysis to quantify the variability of long term rankings for all groups. We find that the expected time for a first replication is infinite, and show how an incipient group must break the mechanism to improve their odds of success. We present an example of citation counts for a specific field that demonstrates a first-mover advantage consistent with our revised view of the rich-get-richer mechanism. Our findings call for a reexamination of preceding work invoking Simons model and provide an expanded understanding going forward.
Proceedings of the National Academy of Sciences of the United States of America | 2015
Peter Sheridan Dodds; Eric M. Clark; Suma Desu; Morgan R. Frank; Andrew J. Reagan; Jake Ryland Williams; Lewis Mitchell; Kameron Decker Harris; Isabel M. Kloumann; James P. Bagrow; Karine Megerdoomian; Matthew T. McMahon; Brian F. Tivnan; Christopher M. Danforth
The concerns expressed by Garcia et al. (1) are misplaced due to a range of misconceptions about word usage frequency, word rank, and expert-constructed word lists such as LIWC (Linguist Inquiry and Word Count) (2). We provide a complete response in our papers online appendices (3). Garcia et al. (1) suggest that the set of function words in the LIWC dataset (2) show a wide spectrum of average happiness with positive skew (figure 1A in ref. 1) when, according to their interpretation, these words should exhibit a Dirac δ function located at neutral (havg = 5 on a 1–9 scale). However, many words tagged as function words in the LIWC dataset readily elicit an emotional response in raters as exemplified by “greatest” (havg = 7.26), “best” (havg = 7.26), “negative” (havg = 2.42), and “worst” (havg = 2.10). In our study (3), basic function words that are expected to be neutral, such as “the” (havg = 4.98) and “to” (havg = 4.98), were appropriately scored as such. Moreover, no meaningful statement about biases can be made for sets of words chosen without frequency of use properly incorporated.
PLOS ONE | 2017
Erjia Yan; Jake Ryland Williams; Zheng Chen
Publication metadata help deliver rich analyses of scholarly communication. However, research concepts and ideas are more effectively expressed through unstructured fields such as full texts. Thus, the goals of this paper are to employ a full-text enabled method to extract terms relevant to disciplinary vocabularies, and through them, to understand the relationships between disciplines. This paper uses an efficient, domain-independent term extraction method to extract disciplinary vocabularies from a large multidisciplinary corpus of PLoS ONE publications. It finds a power-law pattern in the frequency distributions of terms present in each discipline, indicating a semantic richness potentially sufficient for further study and advanced analysis. The salient relationships amongst these vocabularies become apparent in application of a principal component analysis. For example, Mathematics and Computer and Information Sciences were found to have similar vocabulary use patterns along with Engineering and Physics; while Chemistry and the Social Sciences were found to exhibit contrasting vocabulary use patterns along with the Earth Sciences and Chemistry. These results have implications to studies of scholarly communication as scholars attempt to identify the epistemological cultures of disciplines, and as a full text-based methodology could lead to machine learning applications in the automated classification of scholarly work according to disciplinary vocabularies.
Physical Review E | 2015
Jake Ryland Williams; James P. Bagrow; Christopher M. Danforth; Peter Sheridan Dodds