Clinical research tasks such as modeling disease risk and characterizing the genetic underpinnings of disease require collecting data from large and diverse cohorts. Prospective studies face several challenges including recruitment, long time-frames needed for follow-up, and the uncertainty of generalization. Utilizing electronic health record (EHR) data for retrospective studies can alleviate many of these challenges, especially when the data are collected by a hospital system that serves a large and diverse patient population. Such data sets are potentially orders of magnitude larger than most prospective studies. However, a common hindrance to utilizing EHR data is that a substantial portion of valuable information is not available in structured form but rather is embedded in either semi-structured reports or narrative notes. The structured data that is available, such as diagnostic billing codes, does not accurately capture biological phenotypes for research purposes. In this talk we propose a general framework for extracting accurate information from semi- and unstructured clinical text using modern natural language processing (NLP) techniques. Our group has developed the Community Care Cohort Project (C3PO) with the goal of collecting multimodal EHR data for patients that received longitudinal primary care through the Mass General Brigham healthcare system. In this talk, we will demonstrate the power and convenience of pretrained transformer-based models for extracting valuable pieces of information from clinical text, as well as for identifying outcome diagnoses from discharge summaries. We will highlight how our data-efficient NLP framework enables us to unlock the full power of EHR data, and present our resulting clinical discoveries in C3PO.