Data Science for Biomedical Informatics (BMIN503/EPID600)

As part of UPenn’s efforts to grow biomedical informatics training programs, a new course was offered Fall 2015 that covered basic principles of data science applied to biomedical informatics. As of Fall 2017, that course is a requirement for the Certificate in Biomedical Informatics and Epidemiology PhD. UPenn students can find out more from the main course website.

 

Course Summary

Data science refers broadly to using statistics and informatics techniques to gain insights from large datasets. Biomedical informatics refers to a range of disciplines that use computational approaches to analyze biomedical data to answer pre-specified questions as well as to discover novel hypotheses. In this course, we will use R and other freely available software to learn fundamental data science applied to a range of biomedical informatics topics, including those making use of health and omics data. After completing this course, students will:

  • Be able to retrieve and clean data, perform exploratory analyses, build models to answer scientific questions, and present visually appealing results to accompany data analyses.
  • Be familiar with various biomedical data types and resources related to them.
  • Know how to create reproducible and easily shareable results with R and github.

Course Director: Blanca E Himes, PhD

Guest Lecturers:
Elizabeth Grice, PhD
John Holmes, PhD
Rebecca Hubbard, PhD
William La Cava, PhD
Jason Moore, PhD
Ryan Urbanowicz, PhD

TA: Ryan Urbanowicz, PhD

Expectations: You are expected to attend all sessions of the course, read assigned chapters and articles prior to class (if/when assigned), participate during class sessions, and complete required exercises and the class project. This course requires use of a laptop computer, which you must bring to fully participate in lectures and scheduled lab activities. You must be familiar with this laptop and able to install free programs onto it.

Grading: The course is graded on a letter grade basis, according to the following proportions:
40% assignments (6 total)
40% biomedical data science project
20% participation in class and lab sessions

Format: This course meets twice weekly. The first half of each session is lecture-based, and the second half is spent working through computational exercises. Six assignments will be due throughout the semester. A final project requiring a substantial amount of work and creativity will be due at the end of the semester in lieu of a final exam. Students will be encouraged to work independently and seek help as needed.

Biomedical Data Science Project: The final project will answer a question selected by each student using publicly available biomedical data and some of the tools presented during the course. After students choose the topic to address on their own, each will identify three faculty/staff scientists/postdocs from different departments/fields to get feedback and help define a specific novel and interdisciplinary question. Grading will be based on three project components: (1) proposal that includes a novel interdisciplinary question and feedback provided by 3 diverse experts, (2) an R markdown document that describes the question, source of data, analysis, and results, and (3) a 10 minute oral presentation describing the work to classmates.

Textbooks:

There are many free online resources to learn R. The following two textbooks are suggested but not required for those students who prefer to have a printed reference:

  • Lander JP, “R For Everyone: Advanced Analytics and Graphics” Addison-Wesley Professional. (2014)
  • Wickham H and Grolemund G, “R for Data Science” O’Reilly (2016)

Prerequisites: Familiarity with basic statistical (e.g., EPID 526/7 or other first-year graduate level stats course) concepts is expected, as this course will not cover basic concepts in depth. A background in biology and computing would be helpful, but no formal requirements will be enforced.