================================

NIAID Close HIV Genomes

- GOAL: Sequence ~20x ~9kb unlabeled arbitrary HIV genomes that can differ by a single base or 1000's of bases on a single PacBio chip. - Samples to develop and test newly developed methods:

S#1: NL4.3 wild type (WT) S#2: NL4.3 + K65R (AAA to AGA) S#3: NL4.3 + M41L (ATG to TTG) S#4: NL4.3 + T215Y (ACC to TAC) S#5: NL4.3 + Y181C (TAT to TGT) + G190A (GGA to GCA) M1 S#1-20%+S2-20%+S3-20%+S4-20%+S5-20% "easy" uniform mix 20% M3 S#1-80%+S2-05%+S3-05%+S4-05%+S5-05% "harder" mix 5% M5 S#5-50%+S#1-50% uniform 2 SNP phasing

- RESULTS: - All pure samples agree except for two base differences. - M5 correctly phased - M1 correctly phased to five genomes except for half of one variant missed - M3 not correctly identified. - FUTURE WORK: - Confirm sequencing "errors" - Higher coverage of new chemistry should resolve harder cases. - Many algorithmic techniques to be explored - Tackle 20-mix after in-silico experiments. ================================

Previously

- Previous techniques: + CluCon: Good for genomes differing by >30 positions. Fully phased, base accurate results given about 100 reads. Universal but weaker "model-free" technique. Good results on 3-clone mixtures down to 5%. + LAA: Designed for HLA, barcoded samples, diploid variants. More complex variants allowed but not designed for. Good results on mixtures of large internal deletion HIV. - GAP: <30 variant positions and >>2 genomes ================================

CluCon Basis

- Stronger modeling of sequencing error using homopolymer basis functions (A,C,G,T runs of 1,2,3,4,5,...) allowing for fine-scale identification and phasing of arbitrary mixtures of variants. - Models P( A.6 -> AAGAATTAA ) and P( A.4+T.2+A.2 -> AAGAATTAA ) - Learn models by sequencing exactly known templates and counting in regions with back-off models for unseen contexts. Visualization - Used to identify variant regions by excess of observations. - Used to phase positions using linear system matrix: M(i,j) = P(H_j -> H_i) Obs = M * Truth -> Truth = M^{-1} * Obs

================================

Sequencing Results

- Good number of P4 full length genomes:

Sample #FL ConsensusLength S1 2525 8911 S2 3036 8911 S3 2445 8911 S4 3526 8910 S5 2715 8910 M1 3853 8911 M3 2664 8911 M5 983 8911

- S1, M1, and M3 had the same consensus. M5 is equal to S5 rather than S1. ================================

Sequencing Results

- Sequencing is correct except two unaccounted errors:

msapos quiver HPbegin HPend context HPrange quiverbase truthbases type ERROR! 6010 1201 6005 6025 ACA 6005 6025 C.A.3.T A ['A', 'A', 'A', '-', 'A'] delete s#3 9850 1969 9835 9855 AAA 9835 9855 G.A.3.T A ['A', 'A', 'T', 'A', 'A'] mismatch s#2 10215 2042 10205 10240 GAA 10205 10240 G.A.6.G A ['A', 'G', 'A', 'A', 'A'] mismatch s#5 11955 2390 11950 11960 CTA 11950 11960 T.A.1.T A ['A', 'A', 'A', 'A', 'G'] mismatch s#5 12090 2417 12080 12095 AGG 12080 12095 A.G.2.A G ['G', 'G', 'G', 'G', 'C'] mismatch s#4 12441 2487 12435 12445 GGA 12435 12445 G.A.1.T . ['.', '.', '.', 't', '.'] insert s#4 12441 2487 12440 12460 GGA 12440 12460 A.T.3.A . ['.', '.', '.', 't', '.'] insert s#4 12465 2492 12460 12475 TAC 12460 12475 A.C.2.A C ['C', 'C', 'C', '-', 'C'] delete ERROR! 28455 5690 28450 28460 GAT 28450 28460 A.T.1.G T ['T', 'T', 'T', 'T', '-'] delete ERROR#1 in S#4: S#3 C....A....A....A....T G....A....A....A....T....G C....-....A....A....T G....A....A....T....T....G S#2 T....A....A....A....G....A....A....A....A....A....A....G T....A....A....A....G....A....G....A....A....A....A....G S#5 C....T....A....T....C + A....G....G....A....T C....T....G....T....C A....G....C....A....T S#4 T....G....G....G....G....A....T....T....T....A....C....C....A T....G....G....G....G....At...T....T....T....A....-....C....A Corrected: T....G....G....G....G....A....T....T....T....A....C....C....A T....G....G....G....G....A....T....T....T....T....A....C....A ERROR#2 in S#5: G....A....T....G....C G....A....-....G....C

The two errors are upheld in other independent sequencing runs. The observations suggest the consensus is correct with little suggestion of the wild-type for the two "errors" (74/2309, 18/2023). ================================

Variant Positions

- All pure samples had no identified variant positions. - M5 identified 2 variant positions correctly - M1 identified 5 variant positions but should have 6 - M3 identified 0 variant positions but should have 6 ================================

M5 Identified and Correctly Phased

[['G', 'A', 'A', '', 'G', '', 'AA', 'GG', 'G', 'G', 'A', 'TA', 'G', 'A', 'GT', 'AT', 'A', '', 'G'], ['C', 'G', '', 'G', 'CC', 'C', 'G', 'C', '', 'GC', 'GG', 'G', 'CA','A', 'C', 'G', 'C', '', 'TC']] observed [359, 287, 51, 23, 21, 15, 12, 11, 11, 11, 10, 10, 9, 9, 7, 7, 6, 6, 6] ================================ top 1 'A+G+' [[0, 871], [287, 871]] 4.80251396237e-78 should add! 1 'A+G+' ================================ top 2 'A++' [[16, 871], [51, 871]] 2.783535288e-05 # 7% discounted minor = 51/(359+287+51) '+G+' [[15, 871], [23, 871]] 0.250924327934 'G+CC+' [[12, 871], [21, 871]] 0.159007424968 '+C+' [[20, 871], [15, 871]] 0.495312158218 'AA+G+' [[11, 871], [12, 871]] 1.0 'GG+C+' [[12, 871], [11, 871]] 1.0 'G++' [[18, 871], [11, 871]] 0.261083327465 'G+GC+' [[4, 871], [11, 871]] 0.116946397511 'A+GG+' [[9, 871], [10, 871]] 1.0 'TA+G+' [[3, 871], [10, 871]] 0.0911245759195 'G+CA+' [[4, 871], [9, 871]] 0.265075360206 'A+A+' [[1, 871], [9, 871]] 0.0211469672287 'GT+C+' [[4, 871], [7, 871]] 0.547542762848 'AT+G+' [[3, 871], [7, 871]] 0.342354228124 'A+C+' [[3, 871], [6, 871]] 0.506690410552 '++' [[2, 871], [6, 871]] 0.287945566371 'G+TC+' [[4, 871], [6, 871]] 0.753202724374 *** converged! the following haplotypes and fractions explain all the data 0 'G+C+' 0.55572755418 1 'A+G+' 0.44427244582

================================

M1 Mostly Identified and Correctly Phased

- Should find six variant positions but only find five

Top p-values: 1660 11950 11960 T.A.1.T 3.682654e-131 # s#5 1682 12080 12095 A.G.2.A 1.148554e-119 # s#5 1371 9835 9855 G.A.3.T 5.278450e-108 # s#3 1411 10205 10240 G.A.6.G 9.658415e-92 # s#2 1734 12435 12445 G.A.1.T 5.180785e-91 # s#4 ? t-insert, where is the c-delete ? ... 1737 12460 12475 A.C.2.A 4.036372e-32 # S#4 c-delete

- The S#4 variant misses the c-delete

S#4 T....G....G....G....G....A....T....T....T....A....C....C....A T....G....G....G....G....At...T....T....T....A....-....C....A Corrected: T....G....G....G....G....A....T....T....T....A....C....C....A T....G....G....G....G....A....T....T....T....T....A....C....A

- The C-deletion is there but the p-value doesn't ring it out.

id observed expected expectedProb over 3537.0 CC 2238.0 2434.0 0.688117608683 C 724.0 434.0 0.122587252492 CCC 165.0 133.0 0.0376396169989 ....

- TODO: If I take 12435:12475, it is probably more significant because it takes the two "bled" variant positions and puts them together. - The phasing is correct ... just lacking the corresponding c-delete

================================ top 1 'AAT+AAAAAA+A+GG+A+' [[1, 1481], [119, 1481]] 1.12602313786e-33 should add! 1 'AAT+AAAAAA+A+GG+A+' ================================ top 2 'AAA+AGAAAA+A+GG+A+' [[0, 1481], [116, 1481]] 1.10841149061e-34 should add! 2 'AAA+AGAAAA+A+GG+A+' ================================ top 3 'AAA+AAAAAA+G+GC+A+' [[0, 1481], [114, 1481]] 4.13977487445e-34 should add! 3 'AAA+AAAAAA+G+GC+A+' ================================ top 4 'AAA+AAAAA+A+GG+A+' [[32, 1481], [65, 1481]] 0.00126598134044 'AA+AAAAAA+A+GG+A+' [[16, 1481], [59, 1481]] 5.37653495232e-07 'AAA+AAAAAA+A+GG+AT+' [[0, 1481], [55, 1481]] 5.22373769102e-17 should add! 6 'AAA+AAAAAA+A+GG+AT+' ... *** converged! the following haplotypes and fractions explain all the data 0 'AAA+AAAAAA+A+GG+A+' 0.331125827815 1 'AAT+AAAAAA+A+GG+A+' 0.19701986755 2 'AAA+AGAAAA+A+GG+A+' 0.192052980132 3 'AAA+AAAAAA+G+GC+A+' 0.188741721854 6 'AAA+AAAAAA+A+GG+AT+' 0.091059602649

- If I stratify to the 'AAA+AAAAAA+A+GG+AT+' variant then the correct single C-deletion is observed 27 times, so stratification might resolve exactly. ================================

M3 Is Missed!

- P-values of the correct variant positions

V1 V2 V3 V4 target 1371 9835 9855 G.A.3.T 2.799849e-14 9840 1411 10205 10240 G.A.6.G 1.138599e-19 10210 1660 11950 11960 T.A.1.T 5.627766e-05 11955 1682 12080 12095 A.G.2.A 1.138794e-06 12085 1734 12435 12445 G.A.1.T 5.464434e-30 12440 1737 12460 12475 A.C.2.A 2.831993e-04 12465

- If I threshold at these values I get a minimum of 80 false positions with no hope of phasing. - The data do support the variants but is underpowered to detect it:

S#5 A->G 11950 11960 T.A.1.T { 'T....A....T': 1888, 'T....-....T': 103, '-....A....T': 99, 'T....A....-': 86, 'Ta...A....T': 84, 'T....G....T': 44, ** 'T....At...T': 39, 'Tt...A....T': 31, 'T....Aa...T': 28, 'Tg...A....T': 23, 'T....Ag...T': 17, 'A....A....T': 15, '-....A....-': 15, 'T....-....-': 15, 'T....Ac...T': 15, 'Tc...A....T': 13, 'T....T....T': 13,

================================

Future Work

- Confirm sequencing "errors" - Higher coverage of new chemistry should resolve harder cases. - Many algorithmic techniques to be explored - Tackle 20-mix after in-silico experiments.