Code bites

Working with ONT data

There are a few tutorials online that teach the process of data processing of ONT data. However, most of them are automatic workflows that don’t detail the individual steps involed. Therefore I have decided to go through the steps myself.

Data

The first challenge is to find a dataset that is small enough for the disk space and computing power of my laptop. I decided to use the sample dataset from DeepMod2. It provides “A small example of HG002 (R10.4.1 5kHz) POD5 and FAST5 files for chr11:2650000-2750000 region, extracted from ONT Open datasets is included in the release. This region contains an imprinted control region for KCNQ1OT1.”. I am working with the pod.5 files.

Basecalling

For basecalling I’ve opted for dorado, which is from ONT itself.
Alternatively, the older library is called guppy.

Installation

Got the newest version from their Github.

Unzip: tar -ldksjfldsjf ldfkjaldsjfldsaf set path lasdjfldsajfldsfj ldskjföldsajf

For the actual basecalling in theory a computing GPU is needed, which I dont have. So I have to specify that I want to use whatever I do have, which is CPU:

gamma@Gamma dorado basecaller -x 'auto' hac,5mCG_5hmCG /mnt/c/Users/negin/Projects/ont/sample/sample.pod5 > calls.bam

My laptop has the following specs: Model name: Intel(R) Core(TM) Ultra 7 155U CPU family: 6 Model: 170 Thread(s) per core: 2 Core(s) per socket: 7

100% in 38 minutes - better then expected.

Samtools

Installation

Has a lot of co-dependencies, so pay attention or it’s gonna backfire. That the link.

Makes it possible to look, wheter file is corrupted:

gamma@Gamma samtools quickcheck calls.bam && echo "BAM looks okay"

And to have a guick look at the basecalls themselves:

$ samtools view -h yourfile.bam

ML B,C Base modification probabilities
MM Z Base modifications / methylation

Demultiplexing

If you want to sequence multiple samples at once, you first barcode your samples by ligating adapters onto your DNA.
Mutliplexed data = sequencing multiple barcode samples togheter in one run.
Demultiplexing = sorting them back into separate samples after sampling.

Quality Control (QC)

Nanoplot

Installation

$ pip install NanoPlot

WSL Ubuntu forbid this, therefore I needed to install in a venv:

pipx install NanoPlot
pipx ensure path
source ~/.bashrc

QC on basecalled reads

Doesn’t support QC on pod5 files, therefore either use Fast5q files or do the basecalling first and then use the bam files.
In this case I used the unaligned bam files.
Apparently this is the point where I have to give up my resistance concerning installing chrome! Argh…

NanoPlot --ubam /adress/of/bam/file/calls.bam -o /qc/nanoplot

Then you get information in the form of graphs.

Cramino

Installation

outch

Is very fast and can paste some basic information about the bam or fastq file in the terminal.
successor to NanoStat

Type

cramino --ubam calls.bam

Like this:

File name       example.cram
Number of reads 14108020
% from total reads  83.45
Yield [Gb]      139.91
N50     17447
Median length   6743.00
Mean length     9917
Median identity 94.27
Mean identity   92.53
Path    alignment/example.cram
Creation time   09/09/2022 10:53:36

To create a histogram of the lengths of the reads:

cramino --ubam calls.bam --histo output.txt

Phred quality score

= measure of the quality of the identification of the nucleobases generated by automated sequencing - FASTQ format encodes phred scores alongside the read sequences - Phred quality scores $ Q $ are logarithmically related to the base-calling error probabilities $ P $ and are defined as:

$$ Q = -10 \log_{10}(P) $$

because it is logarithmically it means that each +10 means 10 x lower error
so if phred assigns a quality score of 40 to a base, the probability that this base is called incorrectly is 1 in 10000, which means the base calling accuracy is 99.99%

For example, if Phred assigns a quality score of 30 to a base, the chances that this base is called incorrectly are 1 in 1000. In this case 58 from the total of 59 reads meaning 98.3% were >Q10.

Adapter removal

F.i. with PoreChop (abandonware…). Don’t know whether I actually need that.

Trimming with dorado while basecalling

trimming can be performed while basecalling
depending on the sequencing kit used (normally embedded as metadata within pod5 file) any adapter or primer sequence will be found and removed
functionality can be controlled using either the –trim or –no-trim options with Dorado basecaller.
there are multiple options for trimming
- all –> trims adapters, primers and barcodes
- adapters –> only trims adapters
- none –> equivalent to –no-trim
Trimming of existing datasets can be performed as following:

dorado trim calls.bam --sequencing-kit <kit> > trimmed.bam

In don’t know wheter my dataset needs trimming but I couldn’t do it anyways because the used sequencing kit is not provided.

Read trimming and filtering

F.i. with Nanofilt. Technically one could

remove all sequences shorter than 500 nucleotides
trim the first 10 nucleotides off all reads but in this case this is not necessary/beneficial.

Alignment

My goal was at first do do the alignment using minimap2 directly. This is because I read that calling minimap2 through dorado beacause I read that “Not all arguments from minimap2 are currently available and parameter names are not finalized and may change.”.
This however didn’t work out because minimap2 requires a fastq data format as input. This means that even if I were to convert my bam file into a fastq file using samtools I would be able to salvage the basecalling but not the modifications (methylation). This problem was also discussed in this forum.
So the way to go was simultaneous basecalling, modification calling and alignment using dorado:

To accomplish this you need a refernce genome. I choose T2T a whole-genome assembly published in 2022.

Then I first tried to do the two steps (basecalling (with modifications) + alignment) simultaneously:

gamma@Gamma:~/sample$ dorado basecaller -x 'auto' hac,5mCG_5hmCG /mnt/c/Users/negin/Projects/ont/sample/sample.pod5 --refernce /mnt/c/Users/negin/Projects/ont/sample/T2T/ncbi_dataset/data/GCF_009914755.1/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna > calls_metyhlation_alignment.bam

This didn’t work so I did the two steps seperatly. The basecalling was unproblematic, as already documented above. The problem occured when doing the alignment:

gamma@Gamma:~/sample$ dorado aligner /mnt/c/Users/negin/Projects/ont/sample/T2T/ncbi_dataset/data/GCF_009914755.1/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna /mnt/c/Users/negin/Projects/ont/sample/modified_calls.bam > calls_metyhlation_alignment.bam

This led to the following output after a couple of minutes:

It just sayed killed.

So I had a look at the dmesg to find out why exactly it got killed and it returned the following:

[97249.745953] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,glo│ bal_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-tmux.slice/tmux│ -spawn-1fe34f6e-3b1d-4553-99c2-a52528bbc98d.scope,task=dorado,pid=169042,uid=1000 │ [97249.746142] Out of memory: Killed process 169042 (dorado) total-vm:23649088kB, anon-rss:722│ 3424kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:32932kB oom_score_adj:0 │ [97813.495265] mini_init (236): drop_caches: 1 │ [97935.127243] systemd-journald[42]: Time jumped backwards, rotating. │ [98353.590353] mini_init (236): drop_caches: 1

Hereby confirming that the process got killed because it ran out of memory. The requested memory is total-vm:23649088kB ≈ 23 GB. The resident memory on my WSL Ubuntu machine however is only anon-rss:7223424kB ≈ 7 GB.
So logically I though about allocating more memory space to my WSL or running the command from windows, however that wouldn’t be sufficient as it only has 16 GB RAM.
So I searched for solutions and only found a Github issue that described this very problem and offered the –resume command to overcome the urge to kill. I couldnt’t really find this in the documentation (only –resume-from which I think does something else). So the only thing left to do would be memory managment by splitting up the data I have into smaller, more manageable parts.

My file sizes are currently:
Raw pod5 data ~ 23000 KB
Modified basecalls ~ 1900 KB

One dinner later and the solution has been found! The file size that actually matters is the reference I am giving, which is currently the entire genome. Because I know that the data I have is from Chromosome 11 I went ahead and downloaded only the T2T from Chr11 and I had absolutely no problems (finished in 606ms lol) with the alignment.

Visualize BAM files

For visualisation of a BAM file a viewer like IGV can be used. But first the BAM file needs to be prepared:

Sorting

samtools sort -o calls_metyhlation_alignment_sorted.bam calls_metyhlation_alignment.bam

Index

samtools index calls_metyhlation_alignment_sorted.bam

This creates a .bai file.

IGV

These two files can then be uploaded under “Track” in the Web-applicaiton of the Integrative Genomics Viewer from Broad Institute. The refernce genome can also be uploaded or if already available simply selected under “Genome”. For more information have a look at the documentation.

Modkit

For installtion see here. The documentation can be found here.

I want to extract the entire methylation profile and create a bedMethyl table. So I only need to use the most basic command:

modkit pileup calls_metyhlation_alignment_sorted.bam output_calls_metyhlation_alignment_sorted.bed --log-filepath pileup.log

Please make sure that you use a sorted bam file as input and have also already created an index file (see last chapter).

The bedMethyl table can be read like this:

Column	Name	Description	Type
1	chrom	Name of reference sequence from BAM header	str
2	start position	0-based start position	int
3	end position	0-based exclusive end position	int
4	modified base code	Single letter code for modified base: m = 5-methylcytosine (5mC), h = 5-hydroxymethylcytosine (5hmC)	str
5	score	Equal to Nvalid_cov	int
6	strand	‘+’ for positive strand, ‘-’ for negative strand, ‘.’ when strands are combined	str
7	start position	Included for compatibility	int
8	end position	Included for compatibility	int
9	color	Included for compatibility, always 255,0,0	str
10	Nvalid_cov	Valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical, also used as the score in bedMethyl	int
11	fraction modified	Fraction of modified reads. Nmod / Nvalid_cov	float
12	Nmod	Number of calls passing filters that were classified as a residue with a specified base modification	int
13	Ncanonical	Number of calls passing filters classified as canonical base rather than modified	int
14	Nother_mod	Number of calls passing filters classified as modified, but where the modification differs from the listed base (canonical base is equal)	int
15	Ndelete	Number of reads with a deletion at this reference position	int
16	Nfail	Number of calls where the probability of the call was below the threshold	int
17	Ndiff	Number of reads with a base other than the canonical base for this modification	int
18	Nnocall	Number of reads aligned to this reference position with correct canonical base but without a base modification call	int

Useful websites

This is good: https://scienceparkstudygroup.github.io/rna-seq-lesson/04-bioinformatic-workflow/index.html

Leukemia Dataset

Working with Google Collab

To use a GPU on Google Colab you need to connect to the GPU runtime. In my case only the Tesla 4 was free so I was connected to it. Running basecalling went smoothly and only took several minutes.
Be aware that you can be disconnected from the GPU runtime and not allowed to reconnect for a certain unknown amount of time, in my case 1 day.

Here is the link to the Google Colab notebook with all the commands I used.

Alignment

Alignment wasn’t possible on Google Colab because of insufficient memory. Therefore I used my own laptop. It was necessary to allow WSL to use my entire RAM and also increase the swap memory, which is now at a about 51 GB for WSL.

However before doing this I tried to reduce the amount of memory space needed by adjusting the parameters for alignment. I read in some forums that in minimap2 you can reduce the batch size with -k which corresponds to -I in dorado aligner. However I am not sure about this.

dorado aligner -t 1 --mm2-opts "-I 0.1k"

dorado aligner /mnt/c/Users/negin/Projects/ont/sample/T2T/ncbi_dataset/data/GCF_009914755.1/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna /mnt/c/Users/negin/Projects/ont/leukemia/methylated_calls.bam > calls_metyhlation_alignment.bam
[2025-10-13 20:57:36.018] [info] Running: "aligner" "-t" "1" "--mm2-opts" "-I 0.1k" "/mnt/c/Users/negin/Projects/ont/sample/T2T/ncbi_dataset/data/GCF_009914755.1/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna" "/mnt/c/Users/negin/Projects/ont/leukemia/methylated_calls.bam"
[2025-10-13 20:57:36.030] [info] num input files: 1
[2025-10-13 20:57:36.030] [info] > loading index /mnt/c/Users/negin/Projects/ont/sample/T2T/ncbi_dataset/data/GCF_009914755.1/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
[2025-10-13 21:01:30.582] [info] processing /mnt/c/Users/negin/Projects/ont/leukemia/methylated_calls.bam -> -
[2025-10-13 21:01:30.653] [info] > starting alignment
[2025-10-13 23:29:40.537] [info] > finished alignment
[2025-10-13 23:29:40.691] [info] > Finished in (ms): 8889836
[2025-10-13 23:29:40.691] [info] > Reads written: 43561
[2025-10-13 23:29:40.691] [info] > total/primary/unmapped 2016386/43256/305

Sort and Index

Next the aligned bam file has to be sorted and indexed before it can be processed by MARLIN.

samtools sort -o calls_metyhlation_alignment_sorted.bam calls_metyhlation_alignment.bam                      
samtools index calls_metyhlation_alignment_sorted.bam

Calling MARLIN

Before calling MARLIN read the Readme and follow the instructions.

Then go into the directory with the bam file you want to classify and run:

bash /mnt/c/Users/negin/Projects/MARLIN/MARLIN_realtime/0_real_time_prediction_main.sh

FYI my file naming:

methylated_calls
methylated_calls_11
methylated_calls_22

Last updated on Oct 24, 2023

← Visual Studio Code Oct 24, 2023