Using Digital Data to Predict CHD
1 other identifier
observational
781
1 country
1
Brief Summary
This project seeks to identify and characterize features derived from digital data (e.g. social media, online search, mobile media) which are associated with coronary heart disease (CHD) and related risk factors, and develop models that use digital data and conventional predictive models to predict CHD risk and health care utilization.
Trial Health
Trial Health Score
Automated assessment based on enrollment pace, timeline, and geographic reach
participants targeted
Target at P75+ for all trials
Started Sep 2020
Longer than P75 for all trials
1 active site
Health score is calculated from publicly available data and should be used for screening purposes only.
Trial Relationships
Click on a node to explore related trials.
Study Timeline
Key milestones and dates
Study Start
First participant enrolled
September 25, 2020
CompletedFirst Submitted
Initial submission to the registry
September 28, 2020
CompletedFirst Posted
Study publicly available on registry
October 5, 2020
CompletedPrimary Completion
Last participant's last visit for primary outcome
May 30, 2025
CompletedStudy Completion
Last participant's last visit for all outcomes
June 1, 2025
CompletedResults Posted
Study results publicly available
November 4, 2025
CompletedNovember 4, 2025
October 1, 2025
4.7 years
September 28, 2020
August 4, 2025
October 8, 2025
Conditions
Keywords
Outcome Measures
Primary Outcomes (1)
Latent Dirichlet Allocation (LDA) Topics - Topics / Themes Discussed Between Patients With and Without Heart Disease
The primary outcome is topics and features (derived using the LDA method for clustering language data). For each participant, we included all available Facebook wall posts from the start of their account history through data collection, regardless of whether they occurred before or after a CHD diagnosis. We examined associations between linguistic features (unigrams, LIWC categories, LDA topics) and cardiovascular case status (CHD presence vs absence) using Pearson correlation and logistic regression. Latent LDA, a systematic method to identify text-based themes, was applied to generate 200 clusters of co-occurring words ("topics"). For each feature type (unigram, LIWC category, LDA topic), we fit separate logistic regression models and calculated Pearson correlation coefficients to assess predictive value for case status. Each language-derived feature was encoded as a normalized frequency count per user to enable consistent comparison across participants.
Through study completion, an average of 3 years
Other Outcomes (2)
CHD Event
Through study completion, an average of 3 years
Health Care Utilization
Through study completion, an average of 3 years
Study Arms (2)
Case
Patients ages 30-74 with and without CHD (IICD 10: I63, I20-I25 ) within the last 5 years.
Control
Patients aged 30-74 who have non-cardiovascular-related history
Interventions
Interested participants may complete the informed consent online. After informed consent, the participant will be asked to share the digital data types that they use (Facebook, Instagram, Twitter, Google search, step data) and then participants will complete a cross-sectional survey.
Eligibility Criteria
We will identify patients ages 30-74 with and without CHD (ICD 9:414.0, ICD 10: I63, I20-I25)
You may qualify if:
- years of age
- Willing to sign informed consent
- Primarily English speaking (for language analysis)
- Has an account on any of the following digital data platforms (Facebook, Instagram, Twitter Reddit, Google (gmail), or smartphone or wearable device such as Apple Health, Fitbit, Samsung Health, MapMyFitness or Garmin) and willing to share data
- If has social media account, Instagram or Facebook, willing to share historical and prospective data (60 days) If has Google (gmail) account, willing to download and share google takeout zip file
- If has smartphone or wearable device, willing to share step data
- Willing to share access to medical health records
- Willing to share healthcare insurance information
You may not qualify if:
- Does not use and post on digital data sources we are studying or unwilling to donate data
- Patient is in severe distress, e.g. respiratory, physical, or emotional distress
- Patient is intoxicated, unconscious, or unable to appropriately respond to questions
Contact the study team to confirm eligibility.
Sponsors & Collaborators
Study Sites (1)
University of Pennsylvania Health System
Philadelphia, Pennsylvania, 19101, United States
MeSH Terms
Conditions
Interventions
Intervention Hierarchy (Ancestors)
Results Point of Contact
- Title
- Director of Research
- Organization
- University of Pennsylvania
Publication Agreements
- PI is Sponsor Employee
- No
- Restrictive Agreement
- No
Study Design
- Study Type
- observational
- Observational Model
- CASE CONTROL
- Time Perspective
- CROSS SECTIONAL
- Sponsor Type
- OTHER
- Responsible Party
- SPONSOR
Study Record Dates
First Submitted
September 28, 2020
First Posted
October 5, 2020
Study Start
September 25, 2020
Primary Completion
May 30, 2025
Study Completion
June 1, 2025
Last Updated
November 4, 2025
Results First Posted
November 4, 2025
Record last verified: 2025-10