PacBio Error Plots: Base + Homopolymer length
Plots showing the number of correct identity bases versus number of
incorrect identity bases for every base and homopolymer length.
This is based on sequencing a single HIV genome. I look at alignments
regions of the form: x+y^n+z where (x!=y and y!=z)
I then count the number of correct bases (y) versus the number of
incorrect bases (!y). For all regions, I sum the counts and plot in a
matrix that has 14 rows and 17 columns. The rows represent 0:13
correct bases and columns represent 0:16 incorrect bases.
For example the first plots shows that for T-bases of length 1
surrounded by bases on left and right that are not T, you are most
likely to observe a single correct T base and no other bases. There is
some chance you miss the base or have an extra correct base (above and
below the highest), but you still don't observe many incorrect bases
(partly due to the way I tally).
Note G's have more variance in general.
Colored plots replace missing data (there are no runs of C 5 or
greater for this sample).
These error matrices can be used to detect minor variants and mitigate
noisy haplotypes using CRFs and linear system theory.
T G C A
1
2
3
4
5
6
7