goal: collect together results for bcr-abl paper
================================
Summary:
I give below some results for the bcr-abl paper.
- reduced and clinical merged data table, reduced data table, combined data table
- reproducibility of variant positions that are significant in both
replicates.
- number of mutations versus the number of silent mutations
- look at number of variant positions observed versus number of
compounds > 1%
- time evolution plots of variant positions
- interesting compound variants for CSY and EEC
================================
Combined, reduced, merged variant data table:
I have three sets of data tables (.txt which is tab-separated data
tables). All give the same information but have been reduced and
merged in different ways.
1) reduced and merged with clincial: reduced pacbio data merged with UCSF clinical data (hopefully up-to-date)
2) reduced: this is pacbio data but all results reduced and grouped by patient
3) combined: this is all pacbio data in full form
For each set I give list of variants, list of variants > 1%, list of
compound variants > 1%
---- 1) reduced and merged with clincial:
mutationCollapseMergeClinical.txt for each sampleID the list of mutations
significant in both runs with clinical
mutationCollapseMerge1perClinical.txt for each sampleID the list of mutations
significant in both runs and > 1% with clinical
compoundmutationCollapseMerge1perClinical.txt for each sampleID the
list of compound mutations observed at > 1% with clinical
---- 2) reduced:
mutationCollapseMerge.txt for each sampleID the list of mutations
significant in both runs
mutationCollapseMerge1per.txt for each sampleID the list of mutations
significant in both runs and > 1%
compoundmutationCollapseMerge1per.txt for each sampleID the
list of compound mutations observed at > 1%
---- 3) combined:
NEW.bcrabl.variants.tsv.xls
NEW.bcrabl.variants.significantInBoth.tsv.xls
NEW.bcrabl.compoundvariants.tsv.xls
Here is the data table from the paper
bcrabl.tsv
================================
There are so many possibilities for what to include: structural
variation, quality, silent vs total, num vs numCompound, time
evolution, where the variants occur (single and compound). I have many
many results in all the READMEs for this project. TODO: go through all
readmes for information.
================================
abundance agreements between technical repeats.
mylm = lm(resultp$frac1 ~ resultp$frac2)
summary(mylm)
Call:
lm(formula = resultp$frac1 ~ resultp$frac2)
Residuals:
Min 1Q Median 3Q Max
-0.043617 -0.001181 -0.000119 0.001133 0.063501
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001075 0.0001719 0.625 0.532
resultp$frac2 1.0014472 0.0008202 1221.027 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.005883 on 1309 degrees of freedom
Multiple R-squared: 0.9991, Adjusted R-squared: 0.9991
F-statistic: 1.491e+06 on 1 and 1309 DF, p-value: < 2.2e-16
RESULT: R-squared of 0.9991, reproducibility is very high. This is
impressive as these were run on different chips at different times and
went through barcoding!
resultp$relerr = 2*abs(resultp$frac1 - resultp$frac2)/(resultp$frac1 + resultp$frac2)
summary(resultp$relerr[2:nrow(resultp)])
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.04924 0.12170 0.16310 0.23520 1.27700
RESULT: Median relative error of 12% across entire range.
================================
look at the number of mutations verus the number of silent mutations
mylm = lm(silentDat$numSilent ~ silentDat$total)
summary(mylm)
Call:
lm(formula = silentDat$numSilent ~ silentDat$total)
Residuals:
Min 1Q Median 3Q Max
-3.5176 -1.0578 0.0335 0.8984 4.8474
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4155 0.2484 1.672 0.0979 .
silentDat$total 0.1825 0.0136 13.418 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.486 on 90 degrees of freedom
Multiple R-squared: 0.6667, Adjusted R-squared: 0.663
F-statistic: 180 on 1 and 90 DF, p-value: < 2.2e-16
About 18% of observed minor variants are silent with significance.
================================
look at number of variant positions observed versus number of
compounds > 1%
Look at whether the number of compounds greater than 1% is related to
the number of variant positions with abundance greater than 1%:
summary(mylm)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.64463 0.36252 7.295 1.13e-10 ***
mm[, 2] 0.27142 0.04557 5.956 4.92e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.115 on 90 degrees of freedom
Multiple R-squared: 0.2828, Adjusted R-squared: 0.2748
F-statistic: 35.48 on 1 and 90 DF, p-value: 4.917e-08
The number of compounds greater than 1% is about 27% of the number of
variant positions above 1%.
================================
time evolution
For the patients with multiple time points, show the time evolution.
here are the patients with more than 1 timepoint:
30 HL 4
4 AHP 3
14 CSY 3
22 DVD 3
52 MDL 3
56 MZ 3
7 BRM 2
10 CO 2
23 DWB 2
26 EEC 2
37 JLR 2
55 MT 2
57 NEF 2
Here are the time series plots for all variants where the mean
aboundace is greater than 0.0075. PatientID in filename.
================================
Compound Variants
I looked for interesting patterns in compound variants for those
patients with time series.
-- EEC has two codons for same variant f317l (tta,ctc) at the second time point:
key count fraction variant limsID barcode.x
10167 2450177-0036.F2 27 0.021669 g250e.gag,f317l.ctc 2450177-0036 F2
10168 2450177-0036.F2 62 0.049759 g250e.gag,f317l.tta 2450177-0036 F2
10169 2450177-0036.F2 131 0.105136 f317l.ctc 2450177-0036 F2
10170 2450177-0036.F2 204 0.163724 g250e.gag 2450177-0036 F2
10171 2450177-0036.F2 251 0.201445 f317l.tta 2450177-0036 F2
10172 2450177-0036.F2 507 0.406902 2450177-0036 F2
-- CSY is heavily compounded
21/3/05: f359c
28/3/06: f359c and low level t315i+f359c
22/1/08: t315i+f359c and 4 variant compound at >5%
tmp=compvars[compvars$ptInit=="CSY" & compvars$fraction>0.01,]; split(tmp,tmp$key,drop=T)
$`2450177-0032.F1`
key count fraction variant limsID barcode.x
6391 2450177-0032.F1 26 0.010874 p230p.cca,f359c.tgc 2450177-0032 F1
6392 2450177-0032.F1 606 0.253450 2450177-0032 F1
6393 2450177-0032.F1 1325 0.554161 f359c.tgc 2450177-0032 F1
$`2450177-0032.F2`
key count fraction variant limsID barcode.x
6818 2450177-0032.F2 58 0.023529 t315i.att,f359c.tgc 2450177-0032 F2
6819 2450177-0032.F2 334 0.135497 2450177-0032 F2
6820 2450177-0032.F2 1006 0.408114 f359c.tgc 2450177-0032 F2
$`2450177-0032.F3`
key count fraction variant
6985 2450177-0032.F3 21 0.010479 t315i.att,a350a.gct,e352d.gat
6986 2450177-0032.F3 21 0.010479 t315i.att,a350a.gct
6987 2450177-0032.F3 36 0.017964 t315i.att,e352d.gat,f359c.tgc
6988 2450177-0032.F3 65 0.032435
6989 2450177-0032.F3 78 0.038922 t315i.att,a350a.gct,f359c.tgc
6990 2450177-0032.F3 118 0.058882 t315i.att,a350a.gct,e352d.gat,f359c.tgc
6991 2450177-0032.F3 140 0.069860 f359c.tgc
6992 2450177-0032.F3 244 0.121756 t315i.att
6993 2450177-0032.F3 903 0.450599 t315i.att,f359c.tgc
================================