BMMB - 597D: Practical Data Analysis (Fall 2011)
The purpose of this course is to introduce students to the
various applications of high-throughput sequencing including: chip-Seq,
RNA-Seq, SNP calling, metagenomics, de-novo assembly and others.
The course material will concentrate on presenting complete data analysis scenarios
for each of these domains of applications and will introduce students to a wide
variety of existing tools and techniques. We expect that by the end of the
course work students will:
understand common bioinformatics data formats and standards
- become familiar with the practice of analyzing short-read sequencing data from various instruments:
- Illumina HiSeq sequencer
- ABI SOLID sequencer
- Roche 454 platforms
develop a computationally oriented thinking that is necessary to take on large-scale data analysis projects
- understand data analysis principles of methodologies such as:
- short read and long read alignments
- Chip-Seq analysis and peak calling
- interval query and manipulation
- SNP calling and genomic variation detection
- genome assembly with open source tools
- metagenomics analysis
filter, extract and combine data with scripting languages
automate tasks with shell scripts to create reusable data pipelines
plot and visualize results with R and other packages
A laptop that has sufficient amount of battery power
for 25 minute work may be required to perform data analysis
tasks in class. We will be able to provide support for
Mac OSX (Tiger/Leopard), Windows (XP/Vista) and Linux operating systems.
Practical data analysis for life scientists
BMMB 597D - Bio Data Analysis (2 cr.)
Schedule #398704
Tuesday/Thursday 2:30-3:20 in 120 Thomas Building
Limit of 25 students.
Office hours: MW 2-3pm 502B Wartik
Lecture Notes
Lectures will appear below as they are presented. Each week we will cover certain topic over
two lectures. Homeworks are included in the handouts.
- Lecture 1 - slides, handouts
course information, homework and project information, introduction to computing,
introduction to the UNIX operating system, homework 1.
- Lecture 2 - slides, handouts, download:
lecture-2.zip
the GFF format, sequence ontologies, UNIX input and output streams,
piping commands, processing a tabular file with UNIX tools, homework 2.
- Lecture 3 - slides, handouts, download:
lecture-3.zip,
quality control, sequencing read file formats, fasta, color space fasta, fastq,
using the FastQC package. , homework 3.
- Lecture 4 - slides, handouts, install:
fastx.zip
quality filtering, writing shell scripts, elements of bash programming, using the Fastx toolkit, homework 4.
- Lecture 5 - slides, handouts, install:
lecture-5.zip
sequence aligment concepts, general features and charachteristics of short read aligners, using the BWA aligner homework 5.
- Lecture 6 - slides, handouts, data set:
lecture-6.zip;
the SAM - Sequence Alignment/Map format, understanding the SAM specification, generating
a SAM file with the BWA aligner homework 6.
- Lecture 7 - slides, handouts
SAM file filtering, using the Samtools software suite, generating BAM files (binary SAM), sorting and indexing BAM files,
filtering alignment files, depth of coverage tools, querying SAM files, homework 7.
- Lecture 8 - slides, handouts;
sequence coverage concepts, paired end and mated-pairs sequencing, aligning and filtering
paired-end data, homework 8.
- Lecture 9 - slides, handouts
genome visualization tools, using the Integrative Genomics Viewer,
creating custom genomes for IGV, visualizing paired end alignments, homework 9.
- Lecture 10 - slides, handouts,
data set: lecture-10.zip;
text parsing and processing, introduction to the AWK programming language, homework 10.
- Lecture 11 - slides, handouts,
genomic coordinate systems, BED, GFF and WIG formats, converting between formats, homework 11.
- Lecture 12 - slides, handouts,
lecture-12.zip;
interval datatypes, coordinate systems, BED and GFF formats, interval operations, intersect, genomic coverage computation with the BEDTools package, homework 12.
- Lecture 13 - slides, handouts;
more interval operations, flanking, extending, merging intervals with BEDTools package, homework 13.
- Lecture 14 - slides, handouts;
compressed files and archives, how to install tools, the tabix software tool, the Penn State High Performance Computing Systems , homework 14.
- Lecture 15 - slides, handouts;
human genomic variation, the Variant Call Format, introduction to SNP calling, homework 15.
- Lecture 16 - slides, handouts;
dealing with data duplication, continuing the overview of SNP calling tools,
the Genome Analysis Toolkit, the inGAP software, homework 16.
- Lecture 17 - slides, handouts,
data 17.tar.gz;
midterm project instructions, introduction to Chip-Seq analysis, DNA fragment sequencing,
comparing bound locations of short and long footprint, homeworks 17 and 18 (midterm project).
- Midterm Project list: project-ideas.pdf;
the list of proposed midterm project ideas, see Lecture 17 for instructions
- Lecture 18 - slides, handouts,
more strategies for Chip-Seq analysis, samtools pileup output and computing coverage measures, creating
and indexed queryable coverage file
- Lecture 19 - slides, handouts,
code repositories, the bioawk and chipexo repositories, peak calling concepts, peak calling with GeneTrack,
homework 19
- Lecture 20 - slides, handouts, peak-callers.tar.gz
chip-seq fragment size estimation (bioawk), the Chip-Seq Challenge
running peak callers, evaluating and comparing the output of MACS, sissrs, SWEMBL and GeneTrack (chipexo), homework 20
- Lecture 21 - slides, handouts, data 21.tar.gz;
peak prediction with GeneTrack (chipexo), the Cis-regulatory Element Annotation System, generating
your custom profiles with bioawk, homework 21
- Lecture 22 - slides, handouts,
p-values and statistical significance, p-value interpretation: problems and pitfalls,
simple strategies for statistical estimation, the filo package: groupBy and stats programs,
homework 22
- Lecture 23 - slides, handouts,
dataset 23
an introduction to genome assembly, the AMOS toolset, the Velvet assembler, the MUMmer aligner,
homework 23
- Lecture 24 - slides, handouts
the ReadSeq utility, the NCBI short read archive, read mapping quality evaluation with wgsim, NGS mapper ROC curves,
comparing BWA, bowtie and bowtie2 mappers,
homework 24
- Lecture 25 - slides, handouts
data analysis with R and RStudio, data types, vectors, factors, data frames, indexing and filtering R objects
homework 25
- Lecture 26 - slides, handouts dataset 26
visualize high dimensionality datasets, using the ggplot2 software, example plots, genering histograms of
distances around 5’ feature start sites
homework 26
- Lecture 27 - slides, handouts
introduction to metagenomics, methods of metagenomics, phylotyping and OTU based approaches, resources, running BLAST
- Lecture 28 - slides, handouts, dataset 28
final project information, analysis of metagenomics data, read classification via the MetaGenome Analyzer (MEGAN) and
the RDP Multiclassifier
- Dataset for the final project: see project description at the beginning of lecture 28. project.tar.gz
- Lecture 29 - slides, handouts, dataset 29
metagenomics data analysis, the QIIME and mothur packages, the NAST algorithm, example of data analysis with mothur
- Lecture 30 - slides, handouts,
quality filtering for metagenomics data analysis, trimming flows and sequences with mothur, basic workflow, rarefaction curves
Grading and Homework
The final grade will be an average of the grades obtained on homework and two projects.
Please refer to the information in Lecture 1 for more details
on the projects.
Homework will be handed out on most lectures in the form of exercises that will need
to be turned in at the beginning of each week. Note that many of these may be solved
in class during the exercise session.
We want to emphasize that the primary goal of this course work is to improve
students ability to handle and interpret data sets. Therefore the evaluation process
is relative to the initial aptitudes. We aim to focus on developing
permanent skills and talents that are not just immediately useful but
also provide the foundation for further more in depth understanding of
informatics in general.