Papers

For additional work by our group on a wide range of topics, see the publications listed on Prof. Dredze's website as well as the Center for Language and Speech Processing.
      Glen Coppersmith, Mark Dredze, Craig Harman. Quantifying Mental Health Signals in Twitter. ACL Workshop on Computational Linguistics and Clinical Psychology, 2014.
      Glen Coppersmith, Craig Harman, Mark Dredze. Measuring Post Traumatic Stress Disorder in Twitter. International Conference on Weblogs and Social Media (ICWSM), 2014.
      John W. Ayers, Benjamin M. Althouse, Mark Dredze. Could Behavioral Medicine Lead the Web Data Revolution?. Journal of the American Medical Association (JAMA), 2014. [PDF]
      John W. Ayers, Benjamin M. Althouse, Morgan Johnson, Mark Dredze, Joanna E. Cohen. What's the Healthiest Day? Circaseptan (Weekly) Rhythms in Healthy Considerations. American Journal of Preventive Medicine,, 2014.
      Ben Althouse, Jon-Patrick Allem, Matt Childers, Mark Dredze, John W Ayers. Population Health Concerns During the United States' Great Recession. American Journal of Preventive Medicine, 2014;46(2):166-170.
Expand Me    David Broniatowski, Michael J. Paul, Mark Dredze. National and Local Influenza Surveillance through Twitter: An Analysis of the 2012-2013 Influenza Epidemic. PLOS ONE, 2013. [PDF]
Social media have been proposed as a data source for influenza surveillance because they have the potential to offer real-time access to millions of short, geographically localized messages containing information regarding personal well-being. However, accuracy of social media surveillance systems declines with media attention because media attention increases ``chatter'' -- messages that are about influenza but that do not pertain to an actual infection -- masking signs of true influenza prevalence. This paper summarizes our recently developed influenza infection detection algorithm that automatically distinguishes relevant tweets from other chatter, and we describe our current influenza surveillance system which was actively deployed during the full 2012-2013 influenza season. Our objective was to analyze the performance of this system during the most recent 2012--2013 influenza season and to analyze the performance at multiple levels of geographic granularity, unlike past studies that focused on national or regional surveillance. Our system's influenza prevalence estimates were strongly correlated with surveillance data from the Centers for Disease Control and Prevention for the United States (r = 0.93, p < 0.001) as well as surveillance data from the Department of Health and Mental Hygiene of New York City (r = 0.88, p < 0.001). Our system detected the weekly change in direction (increasing or decreasing) of influenza prevalence with 85% accuracy, a nearly twofold increase over a simpler model, demonstrating the utility of explicitly distinguishing infection tweets from other chatter.
Expand Me    Mark Dredze, Michael J Paul, Shane Bergsma, Hieu Tran. Carmen: A Twitter Geolocation System with Applications to Public Health. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF] [Code]
Public health applications using social media often require accurate, broad-coverage location information. However, the standard information provided by social media APIs, such as Twitter, cover a limited number of messages. This paper presents Carmen, a geolocation system that can determine structured location information for messages provided by the Twitter API. Our system utilizes geocoding tools and a combination of automatic and manual alias resolution methods to infer location structures from GPS positions and user-provided profile data. We show that our system is accurate and covers many locations, and we demonstrate its utility for improving influenza surveillance.
Expand Me    Michael J Paul, Byron Wallace, Mark Dredze. What Affects Patient (Dis)satisfaction? Analyzing Online Doctor Ratings with a Joint Topic-Sentiment Model. AAAI Workshop on Expanding the Boundaries of Health Informatics Using AI (HIAI), 2013. [PDF]
We analyze patient reviews of doctors using a novel probabilistic joint model of aspect and sentiment based on factorial LDA. We leverage this model to exploit a small set of previously annotated reviews to automatically analyze the topics and sentiment latent in over 50,000 online reviews of physicians (and we make this dataset publicly available). The proposed model outperforms baseline models for this task with respect to model perplexity and sentiment classification. We report the most representative words with respect to positive and negative sentiment along three clinical aspects, thus complementing existing qualitative work exploring patient reviews of physicians.
Expand Me    Alex Lamb, Michael J. Paul, Mark Dredze. Separating Fact from Fear: Tracking Flu Infections on Twitter. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF]
Twitter has been shown to be a fast and reliable method for disease surveillance of common illnesses like influenza. However, previous work has relied on simple content analysis, which conflates flu tweets that report infection with those that express concerned awareness of the flu. By discriminating these categories, as well as tweets about the authors versus about others, we demonstrate significant improvements on influenza surveillance using Twitter.
Expand Me    Michael Paul, Mark Dredze. Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models. North American Chapter of the Association for Computational Linguistics (NAACL), 2013. [PDF]
Multi-dimensional latent text models, such as factorial LDA (f-LDA), capture multiple factors of corpora, creating structured output for researchers to better understand the contents of a corpus. We consider such models for clinical research of new recreational drugs and trends, an important application for mining current information for healthcare workers. We use a "three-dimensional" f-LDA variant to jointly model combinations of drug (marijuana, salvia, etc.), aspect (effects, chemistry, etc.) and route of administration (smoking, oral, etc.) Since a purely unsupervised topic model is unlikely to discover these specific factors of interest, we develop a novel method of incorporating prior knowledge by leveraging user generated tags as priors in our model. We demonstrate that this model can be used as an exploratory tool for learning about these drugs from the Web by applying it to the task of extractive summarization. In addition to providing useful output for this important public health task, our prior-enriched model provides a framework for the application of f-LDA to other tasks
      Mark Dredze. How Social Media Will Change Public Health. IEEE Intelligent Systems, 2012. [Link]
Expand Me    Alex Lamb, Michael J. Paul, Mark Dredze. Investigating Twitter as a Source for Studying Behavioral Responses to Epidemics. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
We present preliminary results for mining concerned awareness of influenza tweets. We describe our data set construction and experiments with binary classification of data into influenza versus general messages and classification into concerned awareness and existing infection.
Expand Me    Atul Nakhasi, Ralph J Passarella, Sarah G Bell, Michael J Paul, Mark Dredze, Peter J Pronovost. Malpractice and Malcontent: Analyzing Medical Complaints in Twitter. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
In this paper we report preliminary results from a study of Twitter to identify patient safety reports, which offer an immediate, untainted, and expansive patient perspective un- like any other mechanism to date for this topic. We identify patient safety related tweets and characterize them by which medical populations caused errors, who reported these er- rors, what types of errors occurred, and what emotional states were expressed in response. Our long term goal is to improve the handling and reduction of errors by incorpo- rating this patient input into the patient safety process.
Expand Me    Michael J. Paul, Mark Dredze. Experimenting with Drugs (and Topic Models): Multi-Dimensional Exploration of Recreational Drug Discussions. AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012. [PDF]
Clinical research of new recreational drugs and trends requires mining current information from non-traditional text sources. In this work we support such research through the use of a multi-dimensional latent text model -- factorial LDA -- that captures orthogonal factors of corpora, creating structured output for researchers to better understand the contents of a corpus. Since a purely unsupervised model is unlikely to discover specific factors of interest to clinical researchers, we modify the structure of factorial LDA to incorporate prior knowledge, including the use of of observed variables, informative priors and background components. The resulting model learns factors that correspond to drug type, delivery method (smoking, injection, etc.), and aspect (chemistry, culture, effects, health, usage). We demonstrate that the improved model yields better quantitative and more interpretable results.
      Ralph Passarella, Atul Nakhasi, Sarah Bell, Michael J. Paul, Peter Pronovost, Mark Dredze. Twitter as a Source for Learning about Patient Safety Events. Annual Symposium of the American Medical Informatics Association (AMIA), 2012.
Expand Me    Michael J. Paul, Mark Dredze. A Model for Mining Public Health Topics from Twitter. Technical Report -, Johns Hopkins University, 2011. [PDF]
We present the Ailment Topic Aspect Model (ATAM), a new topic model for Twitter that associates symptoms, treatments and general words with diseases (ailments). We train ATAM on a new collection of 1.6 million tweets discussing numerous health related topics. ATAM isolates more coherent ailments, such as influenza, infections, obesity, as compared to standard topic models. Furthermore, ATAM matches influenza tracking results produced by Google Flu Trends and previous influenza specialized Twitter models compared with government public health data.
Expand Me    Michael J. Paul, Mark Dredze. You Are What You Tweet: Analyzing Twitter for Public Health. International Conference on Weblogs and Social Media (ICWSM), 2011. [PDF]
Analyzing user messages in social media can measure different population haracteristics, including public health measures. For example, recent work has correlated Twitter messages with influenza rates in the United States; but this has largely been the extent of mining Twitter for public health. In this work, we consider a broader range of public health applications for Twitter. We apply the recently introduced Ailment Topic Aspect Model to over one and a half million health related tweets and discover mentions of over a dozen ailments, including allergies, obesity and in- somnia. We introduce extensions to incorporate prior knowledge into this model and apply it to several tasks: tracking illnesses over times (syndromic surveillance), measuring behavioral risk factors, localizing illnesses by geographic region, and analyzing symptoms and medication usage. We show quantitative correlations with public health data and qualitative evaluations of model output. Our results suggest that Twitter has broad applicability for public health research.