Skip to content

Latest commit

 

History

History
871 lines (649 loc) · 24 KB

slides.md

File metadata and controls

871 lines (649 loc) · 24 KB
theme colorSchema layout highlighter lineNumbers title info drawings css themeConfig
./theme
bright
intro
shiki
false
Analyzing genomics data
## Omics Data Analysis: Genomics How to obtain biological information from the measurements.
persist
unocss
primary logoHeader eventLogo eventUrl
#5d8392
/assets/scilifelab.png
/assets/ngilogo.png

Analyzing genomic data

From data to insight (hopefully)

https://github.com/MatthiasZepper/Lecture-OmicsDataAnlysis --- layout: presenter presenterImage: '/assets/people/matthias.jpg' ---

Matthias Zepper

  • Life and Medical Sciences in Bonn 🇩🇪
  • PhD in leukemia epigenetics in Münster 🇩🇪
  • Founder (& liquidator) of start-up Nucleotidy
    Nucleotidy logo
  • Bioinformatician at the NGI, Stockholm 🇸🇪

layout: center

An exciting day in the lab...

layout: text-image media: '/assets/fun/phd021315s.gif' caption: 'https://phdcomics.com/comics/archive.php?comicid=1780'

Bioinformatician

The toolbox

Genomic data formats

Exemplary analyses

Workflows


layout: intro

Toolbox peek


layout: text-image reverse: true media: '/assets/tools/excel/nameerrors.png' caption: 'https://doi.org/10.1186/s13059-016-1044-7'

What I don't use

Gene renaming by HGNC.

In 2020, Human Genome Gene Nomenclature Committee (HGNC) renamed genes that were auto-converted to dates in Excel.


layout: text-image media: '/assets/tools/excel/conversionoptional.png' caption: 'https://gizmodo.com/microsoft-fixes-excel-feature-that-forced-scientists-to-1850949443'

What I don't use


layout: text-window

Use suitable tools

  • They might not have a GUI.
  • They might not run on your machine.
  • For remote compute, mind data privacy!

Use a suitable OS

  • GNU / Linux
  • MacOS
  • Windows Subsystem for Linux

::window::

_   _ ____  ____  __  __    _    __  __
| | | |  _ \|  _ \|  \/  |  / \   \ \/ /   | System:    rackham3
| | | | |_) | |_) | |\/| | / _ \   \  /    | User:      bioinfomagician
| |_| |  __/|  __/| |  | |/ ___ \  /  \    |
 \___/|_|   |_|   |_|  |_/_/   \_\/_/\_\   |

########################################################################

  User Guides: http://www.uppmax.uu.se/support/user-guides
  FAQ: http://www.uppmax.uu.se/support/faq

  Write to [email protected], if you have questions or comments.


(base) [bioinfomagician@rackham3 ~]$





layout: new-section

Programming languages for data exploration

Python

R

Julia


layout: new-section

Programming languages for tools

C++

Go

Rust


layout: new-section

Bioinformatic ecosystem

git (version control)

Jupyter, Quarto (Notebooks)

Snakemake, Nextflow (Workflows)

Docker, Apptainer (containers)

Bioconda (package manager)

Bioconductor, Tidyverse (R packages)

BioNumPy, Pandas, Polar.rs, Apache Arrow, DuckDB (Analytics)


layout: new-section

The most important tools

Biological understanding

Statistical knowledge

(free fulltext)

layout: new-section

Syntax errors are easy to debug

but it frequently happens that

tools output something arbitrarily.


layout: text-image media: '/assets/tools/issues/PIIS1934590923002886.png' caption: 'https://doi.org/10.1016/j.stem.2023.08.005'

Understand the methods you apply

Example from j.stem.2023.08.005:

  • Findings backed by wet lab results.
  • Distances in 2D projections of UMAP / t-SNE are not directly interpretable.
  • Their loss functions are invariant with respect to rotations.
  • More details at Understanding UMAP

layout: text-image media: '/assets/tools/issues/s41586-020-2095-1.png'

Understand the methods you apply

Example from 10.1038/s41586-020-2095-1:

  • In this case, both the analysis strategy and the understanding of the used methods was inadequate.
  • Overconfident broad generalization of the findings.
  • More details at 10.1128/mbio.01607-23

layout: intro

Common genomic data formats


layout: new-section

FastQ: Format for sequencing reads

genome-sequencer-8 icon by DBCLS https://togotv.dbcls.jp/en/pics.html is licensed under CC-BY 4.0 Unported https://creativecommons.org/licenses/by/4.0/ DNA_sequencer icon by DBCLS https://togotv.dbcls.jp/en/pics.html is licensed under CC-BY 4.0 Unported https://creativecommons.org/licenses/by/4.0/ img_sequencers01 icon by PacBio https://pacb.com is licensed under CC0 https://creativecommons.org/publicdomain/zero/1.0/
Document icon by OpenClipart https://openclipart.org/ is licensed under CC0 https://creativecommons.org/publicdomain/zero/1.0/

FastQ


layout: new-section

FastQ: Format for sequencing reads

  • Plain text format
  • Each read is represented by four consecutive lines:
    1. Sequence identifier and an optional description
    2. The sequence
    3. + (optional)
    4. The base call quality
@SCILIFELAB:500:NGISTLM:1:1101:32832:1016 1:N:0:GCTTCAGGGT+AAGGTAGCGT
TCCCCCAACTTGATATTAATAACACTATAGACCACCGCCCCGAAGGGGACGAAAAATGGTTTTTAGAGAACGAGAAGACGGTTACGCAG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFF

layout: two-cols-header

Quality control: Good sequencing quality

::left::

Base call quality is high along the full read

::right::

The base composition is balanced


layout: two-cols-header

Quality control: Poor sequencing quality

::left::

Base call quality drops off dramatically

::right::

The base composition is heavily skewed


layout: text-window

Common tasks

(pairwise) Alignment

Find the exact origin of a short fragment in a long reference

Quasi-mapping

Which reference is the most-likely origin?

De-novo assembly

Create a long reference from short fragments

::window::

>NC_001422.1 Escherichia phage phiX174
GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT
GATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGGCGGAAAATGAGAAA
ATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTG
TCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATGCTTGGCACGTTCGTCAAGGACTGGTTTA
GATATGAGTCACATTTTGTTCATGGTAGAGATTCTCTTGTTGACATTTTAAAAGAGCGTGGATTACTATC
TGAGTCCGATGCTGTTCAACCACTAATAGGTAAGAAATCATGAGTCAAGTTACTGAACAATCCGTACGTT
TCCAGACCGCTTTGGCCTCTATTAAGCTCATTCAGGCTTCTGCCGTTTTGGATTTAACCGAAGATGATTT
CGATTTTCTGACGAGTAACAAAGTTTGGATTGCTACTGACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCT
TGCGTTTATGGTACGCTGGACTTTGTGGGATACCCTCGCTTTCCTGCTCCTGTTGAGTTTATTGCTGCCG
TCATTGCTTATTATGTTCATCCCGTCAACATTCAAACGGCCTGTCTCATCATGGAAGGCGCTGAATTTAC
GGAAAACATTATTAATGGCGTCGAGCGTCCGGTTAAAGCCGCTGAATTGTTCGCGTTTACCTTGCGTGTA
CGCGCAGGAAACACTGACGTTCTTACTGACGCAGAAGAAAACGTGCGTCAAAAATTACGTGCGGAAGGAG
TGATGTAATGTCTAAAGGTAAAAAACGTTCTGGCGCTCGCCCTGGTCGTCCGCAGCCGTTGCGAGGTACT
AAAGGCAAGCGTAAAGGCGCTCGTCTTTGGTATGTAGGTGGTCAACAATTTTAATTGCAGGGGCTTCGGC
CCCTTACTTGAGGATAAATTATGTCTAATATTCAAACTGGCGCCGAGCGTATGCCGCATGACCTTTCCCA
TCTTGGCTTCCTTGCTGGTCAGATTGGTCGTCTTATTACCATTTCAACTACTCCGGTTATCGCTGGCGAC
TCCTTCGAGATGGACGCCGTTGGCGCTCTCCGTCTTTCTCCATTGCGTCGTGGCCTTGCTATTGACTCTA
CTGTAGACATTTTTACTTTTTATGTCCCTCATCGTCACGTTTATGGTGAACAGTGGATTAAGTTCATGAA
GGATGGTGTTAATGCCACTCCTCTCCCGACTGTTAACACTACTGGTTATATTGACCATGCCGCTTTTCTT
GGCACGATTAACCCTGATACCAATAAAATCCCTAAGCATTTGTTTCAGGGTTATTTGAATATCTATAACA
ACTATTTTAAAGCGCCGTGGATGCCTGACCGTACCGAGGCTAACCCTAATGAGCTTAATCAAGATGATGC
TCGTTATGGTTTCCGTTGCTGCCATCTCAAAAACATTTGGACTGCTCCGCTTCCTCCTGAGACTGAGCTT
[...]

layout: text-window

🎄🎶❄️ mode on...

Pairwise alignment

  • Unique: ng spirits brig  ghing all the w
  • Multi-mapper: Jingle bel
  • Base error: what pun it is
  • Indels: Jingggge bls

Quasi-mapping

  • Within scaffold: Bells on bob tail open sleigh. Hey!
  • Within reference: Sankta Lucia

::window::


Dashing through the snow
In a one-horse open sleigh
O'er the fields we go
Laughing all the way
Bells on bob tail ring
Making spirits bright
What fun it is to ride and sing
A sleighing song tonight! Oh!

Jingle bells, jingle bells,
Jingle all the way.
Oh! what fun it is to ride
In a one-horse open sleigh. Hey!

Jingle bells, jingle bells,
Jingle all the way;
Oh! what fun it is to ride
In a one-horse open sleigh.


layout: new-section

SAM/BAM/CRAM: Format for pairwise alignments

  • Plain text format (SAM)
  • Binary & compressed format (BAM/CRAM)
  • Contains a header with metadata about reference and aligner
  • Prints one alignment per line
  • May contain secondary alignments
@HD	VN:1.0	SO:coordinate
@SQ	SN:chr1	LN:197195432
[...]
@PG	ID:Bowtie	VN:1.1.2	CL:"bowtie --wrapper basic-0 --threads 4 -v 2 -m 10 -a /ifs/mirror/genomes/bowtie/mm9 /dev/fd/63 --sam"
[...]
SRR2057595.665063_CGCCG	16	chr19	3486359	255	63M	*	0	0	*	*	XA:i:0	MD:Z:63	NM:i:0	UG:i:0	BX:Z:CGCCG
SRR2057595.1043355_CGCCG	16	chr19	3486359	255	63M	*	0	0	*	*	XA:i:0	MD:Z:63	NM:i:0	UG:i:0	BX:Z:CGCCG
SRR2057595.2024535_CGCCG	16	chr19	3486359	255	63M	*	0	0	*	*	XA:i:0	MD:Z:63	NM:i:0	UG:i:0	BX:Z:CGCCG
SRR2057595.3828487_CGCCG	16	chr19	3486359	255	63M	*	0	0	*	*	XA:i:0	MD:Z:63	NM:i:0	UG:i:0	BX:Z:CGCCG

layout: new-section

Genome browsers for viewing

Agglomerated errors may represent individual variations (mind the ploidy)

layout: intro

Exemplary analysis


layout: new-section

ChIP-seq: Location of DNA-binding proteins

Annotations (e.g. gene positions) aid the interpretation

layout: intro

?


layout: text-image media: '/assets/bio/igvpeaks.png' reverse: true

ChIP-seq analysis

  1. FastQ generation

    • Basecalling
    • De-multiplexing of samples
  2. Quality control

  3. Pairwise alignment

  4. Peak-calling

    Discriminate true signal from false positives

layout: text-image media: '/assets/bio/motifs/helixloophelix.png' reverse: true

ChIP-seq analysis

  1. Motif analysis

Motifs for ETS1 and ET3 transcription factors
(Somewhat outdated by now)
  1. Create context from annotations

    Find nearby genes or regulatory elements

layout: new-section

Different methods result in different signals

Annotations (e.g. gene positions) aid the interpretation

layout: text-image media: '/assets/bio/cgv/humangenomeproject.png' caption: 'Covers from the 2001 draft sequence release' reverse: true

Reference genomes

Linear reference genomes

  • Are versioned in major (GRCh38, hg38) and minor releases (GRCh38.p14)
  • Come in different flavors
  • Used for most applications.

T2T assemblies (Human: 2022)

Pangenomes (Human: 2023)


layout: new-section

T2T vs. "regular" reference genome

Long-reads filled gaps and revealed inversions

layout: intro

Workflow management systems


layout: text-window

Workflow management systems

  • Scale analyses to a large number of samples
  • Allow for parallel processing
  • Agnostic of the compute infrastructure

A workflow (pipeline)

  • A sequence of interdependent processes
  • Outputs are consumed by other steps

:: window ::

 graph TD 
            A(STAR) -->|*.Aligned.out.bam| B
            A -->|*.Aligned.toTranscriptome.out.bam| B
            B(samtools sort - coordinate)
            B -->|*.sorted.bam| C
            B -->|*.transcriptome.sorted.bam| C
            C(umi-tools dedup)
            C -->|*.umi_dedup.sorted.bam| E
            C -. *.umi_dedup.sorted.bam .-> D
            C -->|*.umi_dedup.transcriptome.bam| D
            C -->|*.umi_dedup.transcriptome.bam| P
            D(samtools sort - name)
            D -->|*.umi_dedup.transcriptome.sorted.bam| S
            D -. *.umi_dedup.namesorted.bam .-> E
            E(picard MarkDuplicates)
            E -->|*.markdup.sorted.bam| F
            F(featureCounts)
            P(prepare-for-rsem.py)
            P -->|*.umi_dedup.transcriptome.filtered.bam| R
            R(RSEM)
            S(salmon)   
Loading

layout: text-window

Example

  • One process
  • Input is a list of three greetings
  • The process is run for each input

:: window ::

#!/usr/bin/env nextflow
nextflow.enable.dsl=2 

process sayHello {
  input: 
    val x
  output:
    stdout
  script:
    """
    echo '$x world!'
    """
}

workflow {
  Channel.of('Bonjour', 
             'Hej', 
             'Hello') | sayHello | view
}

layout: new-section

Workflow systems

Several hundred workflow systems exist, but in bioinformatics it boils down to those:

Domain-specific language (Pythonic)

Domain-specific language (Groovy, Java)

Honorable mention: Reflow, Workflow Description Language


layout: new-section

Workflow "philosophies"

Batch-processing

  • Optimised for finite batches of data (one sequencing run)
  • Snakemake, Nextflow ...

Stream-processing

  • Optimised for a constant stream of data (sensors, stock prices)
  • Kafka, Flink, Redpanda, RisingWave ...

layout: new-section

Workflow "philosophies"

Dataflow model (inspired)

  • Isolated processes linked by dependencies (Directed acyclic graphs)
  • Conceptually no dimension for time
  • Snakemake, Nextflow ...

Imperative

  • Specify a sequence of steps explicitly
  • Airflow, ...

Declarative data assets


layout: iframe-right url: https://nf-co.re/pipelines class: text-sm

nf-core

  • Bioinformatic workflow community
  • Nextflow pipelines
  • 93 public pipelines
  • More than 1ooo modules
  • A friendly Slack space for questions
  • Watch Beginner's guide to nf-core

layout: iframe-right url: https://anvio.org class: text-sm

Anvi’o


layout: new-section

UAG (stop)

https://github.com/MatthiasZepper/Lecture-OmicsDataAnlysis