goal
goal: summarize results of BCR-ABL variant finding on a larger number
of barcoded patient samples.
summary
- METHODS: Sequence 862 base region of kinase domain from BCR-ABL
using nested PCR followed by barcoded PacBio RS sequencing for 54
clinical samples from 36 CML patients. Look for amino acid variants to
the reference sequence. Run two chips for technical repeats, although
all results only examine one chip at a time. The PacBio RS sequences
continuous single molecules spanning the entire region.
Here is workflow overview bcrablWorkflow.pdf
- 106 known SEQUENCING variants. PacBio finds all of them. A threshold
of 6e-08 would give all of the sequencing mutations with no false
negatives. knownmutsSEQ.merge.tsv
- 131 known SEQUENOM variants. PacBio finds all of them. A p-value
threshold of 1.03e-03 would give all known sequenom mutations with no
false negatives. knownmutsSEQUENOM.merge.tsv
- Note that the stricter p-value of 6e-08 would miss 8 SEQUENOM
variants (that have p-values >6e-08 but <1.03e-03).
- Taking the more strict pvalue threshold of 6e-08, I see 485 novel
variants (327 non-silent) that occur in BOTH technical repeat
chips occurring with abundances 0.2% to 86.15%. novelPacBioVariants.tsv.xls
- We can tell when mutations occur together in compound mutations and
sort by frequency, bcrabl.compoundvariants.runs20-21c24-37.tsv.xls
- We can identify compound mutations at early time points at low
levels that eventually dominate, suggesting early clinical
intervention strategies.
samples, runs, and barcodes
We ran 54 samples in duplicate. Patient SampleIDs were barcoded and
run. The run table is: map2.tsv
Barcode yields are summarized:
limsid thresh ccsnum barcodenum barcodeYield barcodeHitNames barcodeHitCounts
2450177-0020 25 36722 29990 0.816676651598497 F0 F1 F2 F3 F4 F6 3172 3756 2787 9906 3185 7184
2450177-0021 25 37465 30593 0.816575470439077 F0 F1 F2 F3 F4 F6 3159 3809 2855 10238 3287 7245
2450177-0024 25 22911 17812 0.777443149578805 F0 F1 F2 F3 F4 F5 F6 F7 1187 1374 1898 5502 2669 1640 1723 1819
2450177-0025 25 22382 17575 0.785229202037351 F0 F1 F2 F3 F4 F5 F6 F7 1160 1463 1824 5301 2670 1613 1763 1781
2450177-0026 25 21233 16909 0.796354730843498 F0 F1 F2 F3 F4 F5 F7 5215 1167 1528 3010 1182 3818 989
2450177-0027 25 22824 18270 0.800473186119874 F0 F1 F2 F3 F4 F5 F7 5381 1362 1621 3241 1355 4191 1119
2450177-0028 25 17554 14646 0.834339751623562 F0 F1 F2 F3 F4 F5 F6 F7 F8 1588 1917 1601 1110 2301 2395 1887 1846 1
2450177-0029 25 19417 16104 0.829376319719833 F0 F1 F2 F3 F4 F5 F6 F7 1813 2162 1709 1241 2421 2586 2041 2131
2450177-0030 25 17045 12845 0.753593429158111 F0 F1 F2 F3 F4 F5 F6 F7 2437 2576 2175 2900 299 356 902 1200
2450177-0031 25 17290 13003 0.752053209947947 F0 F1 F2 F3 F4 F5 F6 F7 2449 2698 2221 2911 290 330 880 1224
2450177-0032 25 19790 15977 0.807326932794341 F1 F2 F3 F4 F5 F6 2436 2523 2047 1461 4905 2605
2450177-0033 25 19487 15870 0.81438907989942 F1 F2 F3 F4 F5 F6 2390 2524 2042 1426 4826 2662
2450177-0034 25 17284 14822 0.857556121268225 F0 F1 F2 F4 F5 F6 F7 1412 3990 1525 1847 1972 2389 1687
2450177-0035 25 20820 17708 0.850528338136407 F0 F1 F2 F4 F5 F6 F7 1677 4775 1844 2351 2290 2771 2000
2450177-0036 25 12859 10474 0.814526790574695 F0 F1 F2 F3 F4 3925 1879 1263 2391 1016
2450177-0037 25 12341 9997 0.810064014261405 F0 F1 F2 F3 F4 3890 1693 1134 2216 1064
Looks decent with ~80% yields, ~15k barcode reads per chip, sample
coverage 300 to 10,240 aligned reads. -0028,29 contains one false F8
barcode call. -0030,31 contains errant F4.
known variants
We were given a list of known variants by other technology (sequencing
and sequenom): Mutations by seq and MA 130612_Rev.xlsx
I cleaned this into two normalized tables:
knownmutsSEQ.norm.csv
knownmutsSEQUENOM.norm.csv
The headers are:
"key": run.barcode variant key for joining tables
"SampleID": your patient SampleID
"TestGroup": a pacbio run group
"PtInitials": your patient initials
"CollDate": your collection date
"Muts": your variant
"Frac": your estimate of fractional abundance
"runbc": pacbio run.barcode
"one": analysis flag
"poi": amino-acid position of interest.
"coi": codon of interest at poi in the reference
"Qcodon": query codon. the variant
"fracNull": the fraction of the variant seen in the null control run
"countNull": the number of times the variant is seen in the null control run
"SNull": the sum total of observations in the null control run at that position
"fracQAtP": the fraction of the variant seen in the sample
"countQAtP": the number of times the variant is seen in the sample
"SQAtP": the sum total of observations in the sample at that position
"pval": fisher exact test of sample counts to null counts: ((countQAtP, SQAtP),(countNull,SNull))
"mutation": the variant
"limsID": pacbio run id
"barcode": pacbio barcode
"Sample.ID": your patient SampleID
theshold estimate
In order to set an appropriate threshold to separate noise from true
variant results, I joined the known variants by the PacBio variant
estimates. A more stringent threshold of 6.0e-08 was chosen.
Here is the joined known variants with PacBio data:
knownmutsSEQ.merge.tsv
knownmutsSEQUENOM.merge.tsv
This 6.0e-08 threshold yields all SEQUENCING variants and all SEQUENOM
variants except 8.
Here are the missed SEQUENOM variants at the 6.0e-08 threshold:
----
key SampleID TestGroup PtInitials CollDate Muts Frac
42 2450177-0021.F1 g250e 2 1 laj 1/6/05 g250e x%
44 2450177-0021.F1 l248v 2 1 laj 1/6/05 l248v x%
85 2450177-0024.F6 f359c 15 1 map 6/6/05 f359c x%
104 2450177-0025.F1 m244v 10 1 ksb 30/3/05 m244v x%
161 2450177-0027.F0 l248v 17 1 sp 23/6/05 l248v x%
206 2450177-0029.F4 t315i 66 4 mz 26/7/06 t315i x%
279 2450177-0034.F2 m351t 89 4 lys 27/5/05 m351t x%
289 2450177-0034.F6 v299l 93 4 dvd 11/1/07 v299l x%
runbc one poi coi Qcodon fracNull countNull SNull fracQAtP
42 2450177-0021.F1 1 250 ggg gag 0.00028 8 28889 0.00469
44 2450177-0021.F1 1 248 ctg gtg 0.00010 1 10000 0.00240
85 2450177-0024.F6 1 359 ttc tgc 0.00022 4 18461 0.00317
104 2450177-0025.F1 1 244 atg gtg 0.00010 1 10000 0.00687
161 2450177-0027.F0 1 248 ctg gtg 0.00010 1 10000 0.00172
206 2450177-0029.F4 1 315 act att 0.00392 164 41835 0.00999
279 2450177-0034.F2 1 351 atg acg 0.00026 11 41782 0.00850
289 2450177-0034.F6 1 299 gtg ttg 0.00028 12 42241 0.00450
countQAtP SQAtP pval mutation limsID barcode Sample.ID key2
42 10 2132 6.28e-08 g250e 2450177-0021 F1 2 2 g250e
44 7 2918 1.94e-04 l248v 2450177-0021 F1 2 2 l248v
85 4 1263 9.61e-04 f359c 2450177-0024 F6 15 15 f359c
104 8 1164 1.17e-07 m244v 2450177-0025 F1 10 10 m244v
161 7 4074 1.02e-03 l248v 2450177-0027 F0 17 17 l248v
206 19 1901 7.25e-04 t315i 2450177-0029 F4 66 66 t315i
279 5 588 2.03e-06 m351t 2450177-0034 F2 89 89 m351t
289 9 2002 1.45e-07 v299l 2450177-0034 F6 93 93 v299l
----
There is the possibility that a 6.0e-08 threshold is too stringent and
that a larger, more-lenient threshold would have good
performance. However, we would have to validate a larger number of
predictions to see if they are true or false positives. Also, we can
possibly boost statistics by combining technical repeat runs together.
novel variants
Taking the more strict pvalue threshold of 6e-08, I see 485 novel
variants (327 non-silent) that occur in BOTH technical repeat chips
occuring with abundances 0.2% to
86.15%. novelPacBioVariants.tsv.xls
Here is a large abundance novel variant in both technical repeats:
key one poi coi Qcodon fracNull countNull SNull
98846 2450177-0034.F2 t240a 1 240 acg gcg 0.00015 6 41248
104291 2450177-0035.F2 t240a 1 240 acg gcg 0.00015 6 41248
fracQAtP countQAtP SQAtP pval mutation limsID barcode Sample.ID
98846 0.86150 734 852 0 t240a 2450177-0034 F2 89
104291 0.86089 854 992 0 t240a 2450177-0035 F2 89
key2
98846 89 t240a
104291 89 t240a
A next step is to validate these novel variants to see if they are
real.
One can also allow for significant variants that don't occur in both
technical repeat chips. This might identify some very low abundance
variants.
----
Note that in order to simplify results, I took the most significant
technical repeat and multiple codon encoding.
Here is a case where a f317l is spread across multiple codons:
key SampleID TestGroup PtInitials CollDate Muts Frac
250 2450177-0036.F2 f317l 97 4 eec 27/12/05 f317l 30%
251 2450177-0036.F2 f317l 97 4 eec 27/12/05 f317l 30%
252 2450177-0036.F2 f317l 97 4 eec 27/12/05 f317l 30%
253 2450177-0036.F2 f317l 97 4 eec 27/12/05 f317l 30%
runbc one poi coi Qcodon fracNull countNull SNull fracQAtP
250 2450177-0036.F2 1 317 ttc tta 1e-04 2 19818 0.38975
251 2450177-0036.F2 1 317 ttc ctc 1e-04 1 10000 0.20143
252 2450177-0036.F2 1 317 ttc cta 1e-04 1 10000 0.00358
253 2450177-0036.F2 1 317 ttc ttg 1e-04 1 10000 0.00596
countQAtP SQAtP pval mutation limsID barcode Sample.ID
250 327 839 0.00e+00 f317l 2450177-0036 F2 97
251 169 839 6.30e-180 f317l 2450177-0036 F2 97
252 3 839 1.76e-03 f317l 2450177-0036 F2 97
253 5 839 1.58e-05 f317l 2450177-0036 F2 97
In this case, a single codon = 39% abundance but two codons=59%
abundance.
compound mutations
We can look at how mutations occur compounded together because we
sequence a continuous 862-base region from single molecules.
Here is a summary of all compound mutations: bcrabl.compoundvariants.runs20-21c24-37.tsv.xls
For this summary, any mutation seen 10 or less times is not trusted
and discarded from compound mutation analysis.
Among 47 samples where >1 mutation was detectable by direct sequencing
or MassARRAY, SMRT sequencing revealed that 40 (85%) had compound
mutations detectable at a frequency of >1%. In total, we detected 73
different compound mutations at a frequency of ≥1%. In all cases
where compound mutations were detected and more than one treatment
timepoint was available, at least one compound mutation clearly
evolved from a mutation detectable at a prior timepoint. In the most
complex case, 4 separate mutations yielded 8 different mutant alleles.
time evolution
Patient CSY has 3 time points: F3=22/1/08, F2=28/3/06, F1=21/3/05 in
run 2450177-0032.
Show all compound mutations greater than 1% abundance:
---- F1:
26 0.010874 p230p.cca,f359c.tgc
606 0.253450
1325 0.554161 f359c.tgc
---- F2:
58 0.023529 t315i.att,f359c.tgc
334 0.135497
1006 0.408114 f359c.tgc
---- F3:
21 0.010479 t315i.att,a350a.gct,e352d.gat
21 0.010479 t315i.att,a350a.gct
36 0.017964 t315i.att,e352d.gat,f359c.tgc
65 0.032435
78 0.038922 t315i.att,a350a.gct,f359c.tgc
118 0.058882 t315i.att,a350a.gct,e352d.gat,f359c.tgc
140 0.069860 f359c.tgc
244 0.121756 t315i.att
903 0.450599 t315i.att,f359c.tgc
In 2005 there is a f359c mutation (55%). In 2006, there is also a 2%
t315i.att,f359c.tgc. In 2008, t315i.att,f359c.tgc domainates (45%),
followed by t315i (12%), and f359c (7%).
Perhaps this could have been caught in 2006 rather than 2008.
t315i compound mutations
Here are all mutations that compound with t315i at greater than 1%,
where the integer indicates the number of chips runs where the
compound mutation was observed.
1 d276e.gag,t315i.att
1 t315i.att,a350a.gct
1 t315i.att,d482d.gat
2 e255v.gtg,t315i.att
2 g250e.gag,t315i.att
2 t315i.att,a350a.gct,e352d.gat
2 t315i.att,a350a.gct,e352d.gat,f359c.tgc
2 t315i.att,a350a.gct,f359c.tgc
2 t315i.att,e352d.gat,f359c.tgc
2 t315i.att,f359v.gtc
2 t315i.att,h396r.cgt
2 t315i.att,m351t.acg
2 t315i.att,m458i.att
2 t315i.att,s417y.tac
2 t315i.att,w478l.ttg
2 y253h.cac,t315i.att
4 e255k.aag,t315i.att
4 e279k.aag,t315i.att
4 t315i.att,f359c.tgc
29 t315i.att
We can also highlight the most commonly independently observed
mutations in the 2HYY structure:
You can also look at how samples and variants co-occur:
future: mutual information, structural variants, CLIA test
To determine the dependence on one variant to another, we can compute
mutual information between all pairs. This quantifies whether one
variant always occurs with another but might be hindered by smaller
sample size.
This analysis examined amino-acid variants to a known reference. We
have also observed more structural events like ~150-base deletions in
the reads. A characterization of these structural events to clinical
status might be very interesting.
The PacBio assay appears to have sensitivity comparable to SEQUENOM
and SEQUENCING and additionally gives the entire 862-base
domain. Would this be a valuable clinical CLIA-type assay?