Thursday, 22 December 2011

Merry Christmas from Core genomics

Merry Christmas and a happy New Year to everyone that read my blog this year. I only started in May and have had some great feedback. I'd seriously encourage others to start blogging too. It's great fun and the discipline of trying to write something every week is tough but I hope my writing skills are improving.

Good luck with your research next year, I look forward to reading about it.

Predictions for 2012:
5M reads and 400bp on PGM.
20M reads and 500bp on MiSeq.
A PacBio eukaryotic genome.
Goodbye GAIIx.
$1500 genome.
$250 exome.
NGS in the clinic using amplicon methods.
Fingers crossed we are going to hear a lot more form Oxford Nanopre in 2012 as well, it may well be their year.

See you at AGBT.

Wednesday, 21 December 2011

Reading this will make you itch, the 1000 head louse genome project.

It’s that time of year when kids come home from school and start complaining of “the itches”; the nits are back.
I don’t know about the rest of the world but here in the UK we used to have “nit nurses” who would go from school to school checking for headlice. As a kid I remember lining up with the other kids in class and having her fingers run through my hair looking for eggs and lice. It may be strange but for me it is a lovely memory!

Nit genomics: This year when the school sent home the inevitable letter that head lice had been confirmed in my daughters class my thoughts turned to the louse genome (I am aware this is a bit nerdy). Has it been sequenced and what might we learn about the spread of nits and individual susceptibility through genomic analysis? After all the Belly Button Biodiversity project turned out to be pretty interesting didn’t it?

The nits close relative the body louse (Pediculus humanus humanus) has been recently sequenced. The body louse was sequenced using Sanger shotgun methods in 2010. 1.3M. It is a very AT rich genome. It has the smallest insect genome yet sequenced and apparently is the first animal genome to be shown to have fragmented mitochondrial mini-chromosomes.

The head louse genome is only 108Mb. As these parasites are generally quite prolific it should be possible for me to collect a reasonable number from each of my kids heads and mine and my wifes over the few weeks I am combing then out with conditioner (wet combing is as effective as any insectidcide based trewatment). I got four or five this morning from my daughter!

Ideally one would collect only larvae that have not yet started sucking blood to avoid having to sequence some of the Human genome as well (although I am not certain how much Human DNA would contaminate each louse).

With this sample it might be possible to get some idea of the population structure within a school, possibly through some molecular barcoding once we have good genes to target. Perhaps we can learn something about the spread of this organism through a community. As it is pretty harmless it should be easy to collect samples form schools allover the world. Are the head lice in Wymondham different from the lice in Waikiki, do they have lice in Waikiki?

If we could look deeper into the host could we find susceptibility loci and would screening of more susceptible individuals reduce the outbreaks we see each year? What else might we learn about this host:parasite interaction? Are different blood groups more or less affected by lice? There are so many questions we might answer.

I am not certain I will get the time to pursue this project but if there is an enterprising grad student that wants to take this on do get in touch.

Nit biology and evolution: I am writing this just because I am almost certainly never going to write about it again but I want to make sure I can explain it to my kids! Most of this comes from two papers I'd encourage you to read so see the references at the end.

Nits and lice are hemimetabolous rather than holometabolous insects. That is they develop from nymphs to adults rather than going through a larvae–pupae–adult transformation. The holometabolous strategy allows larvae and adults to occupy different ecological niches and as such has proven highly successful. However the niche occupied by nits is the same regardless of life-cycle stage. Nits are a strict Human obligate-ectoparasite and are provided with a homogenous diet (our blood) and few xenobiotic challenges. As such it appears that lice were able to reduce their genome size by losing genes associated with odorant and chemo-sensing; they have 10 fold fewer odorant genes than other insect sequenced and relatively few detoxification-enzyme encoding genes. Basically they don't need to find food or deal with harmful plant toxins.

Lice have been in contact with us for a long time and Human and Chimpanzee lice diverged at the same time as we did from our common ancestor about 25M years ago. We have been living and evolving together ever since with the body louse evolving relatively recently as we began to wear clothing. A paper by Kittler et al used a molecular clock analysis with 2 mtDNA and 2 nuclear loci across diverse human and chimpanzee lice. They saw higher diversity in African lice similar to Human diversity and estimated that body lice evolved around 70,000 years ago. They said this correlated with the time when Humans left Africa, I guess we had to wear something when we moved into Europe as it’s a whole lot colder than Africa.

Kirkness et al. Genome sequences of the human body louse and its primary endosymbiont provide insights into the permanent parasitic lifestyle. PNAS 107; 27: 2168–12173 (2010)
Kittler et al. Molecular evolution of Pediculus humanus 
and the origin of clothing. Curr Biol 13:1414–1417. (2003)

Friday, 9 December 2011

Getting on the map

A recent advertising flyer from Life Tech seems to borrow from the GoogleMap of next-gen sequencers by suggesting to their community to "get on the map". The backdrop to the ad is a map of the world with highlighted areas where the reader might assume an Ion PGM is located (although the ad does not specifically claim this).

Immitation is the sincerest form of flattery:
The Ion Torrent backdrop locations match reasonably closely to the data on the GoogleMap (added by PGM owners) with respect to numbers of PGM machines by continental location.
Ion Torrent PGMs on the GoogleMap

Here is a comparison of the two sources:
North America 36 (ad) vs 39 (map)
South America 2 vs 0
Europe 31 vs 26
Africa NA (not visible) vs 2
Asia/India 21 vs 6
Australia 7 vs 17

I am all for encouraging users to register on the map and we have tended to get feedback that coverage is quite representative if only 60-70% of actual machines installations.

We will be updating the site in the next few months and I'd encourage you to add your facility or update it with your new toys.

We'd also be happy to get feedback from users about what they want to see on the map in the future.

RNA-seq and the problem with short transcripts

There are over 9000 Human transcripts <200bp in length which is about 5% of the total number of transcripts. When analysing some recent RNA-seq data here at CRI we noticed that only 17 detected transcripts from almost 30,000 are from transcripts shorter than 200bp. About 100 times lower than might be expected.

We have been asking why this might be the case. Firstly this is control data from the MAQC samples, UHRR and Brain. It may be that short transcripts are more often expressed at low levels but it may be that we are not picking them up because they are too short.

Spike-in experiments can help:
A recent paper presented data from a complex RNA spike in experiment. Jiang et al individually synthesised and pooled 96 RNA's by in vitro transcription which were either novel sequences or from the B. subtilis and M. jannaschii genomes. The RNA was stored in a buffer from Ambion to maintain stability. The synthsised RNA's were 273-2022bp in length and distributed over a concentration range spanning six orders of magnitude. They observed a high correlation between RNA concentration and read number in RNA-seq data and were able to investigate baises due to GC content and transcript length in different protocols.

This type of resource is useful for many experiments but difficult to prepare.

I have thought about using Agilents eArrays to manufacture RNAs of up to 300bp in length and to use spot number to vary concentration (15,000 unique RNA molecules spotted 1, 10, 100, 1000 or 10,000 times to vary concentration) this would create a very complex mix which should be reproducibly manufactured at almost any time for any group. This would also be very flexible in varying the actual sequences used to look at particular bias of the ends of RNA molecules in RNA-seq protocols.

But the RNAs need to be small enough:
The current TruSeq protocol uses metal hydrolysis to degrade RNA and no gel size selection is required later on. This is making it much easier to make RNA-seq libraries in a high throughput manner, however this technique possibly excludes shorter transcripts or at least makes the observed expression as measured by read counts lower than reality.

The TruSeq fragmentation protocol produces libraries with inserts of 120‐200 bp. Illumina do offer some advice on varying insert size in theor protocol but there is not a lot of room to manipulate inserts with this kind of fragmentation and size selection. They also offer an alternative protocol based on cDNA fragmentation but this method has been sidelined by RNA fragmentation due to increased coverage of the latter method across the transcript, see Wang et al.

The libraries prepared using the TruSeq protocol do not contain many observable fragments below 200bp (see image below) and 110bp or so of this is the adaptor sequence. This suggests the majority of starting RNA molecules were fragmented to around 150-250bp in length, so shorter RNAs could be fragmented too low to be sequenced in the final library.

From the TruSeq RNA library prep manual
I'd like to hear from people who are working on short transcripts to get their feedback on this. Does it matter as long as all samples will be similarly affected as many times we are interested in the differential expression of a transcript between two samples rather than between two transcripts in a single sample.

PS: What about smallRNA-seq?
There have been lots of reports on the bias of RNA ligase in small and micro-RNA RNA-seq protocols. At a recent meeting Karim Sorefan from UEA presented some very nice experiments. They produced adapter oligos with four degenerate bases at the 5' end and compared performance of these to standard oligos. The comparison made use of two reagents, a completely degenerate 21mer and complex pool of 250,000 21mer RNAs. The initial experiment with the degenerate RNA should have resulted in only one read per 400M for any single RNA molecule. They clearly showed that there are very strong biases and were able to say something about the sequences preferred by RNA ligase. The second experiment used adapters with four degenerate bases and gave significantly improved results, showing little if any bias.

This raised the question in my mind that the tissue specific or Cancer specific miRNAs published may not be quite so definitive. Many of the RNAs found using the degenerate oligos in the tissues they tested had never been seen in that tissue previously.

Jiang et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res. 2011.

Thursday, 8 December 2011

Reference based compression: why do we need to do it?

We are producing a lot of sequence data (see the GoogleMap and a previous post). And will continue to produce a lot more, Illumina have just made noises about 1TB runs on HiSeq using PE100bp runs (whether they will actually release this given recent stock market downgrades is unclear). Computers don't keep up fast enough, we are running out of space to store it and bandwidth to move it around (Moores law - compute power grows 60% annually, Kryder's law - data storage grows 100% annually, Nielsen’s law - internet bandwidth grows 50% annually).

So we need to delete some data but what can we afford to throw away?

Reference based compression:
Ewan Birney’s lab at EBI published a very nice paper earlier this year presenting a reference based compression method. New sequences are aligned to a reference and the differences are encoded rather than storing the raw data. At the 3rd NGS congress I had lunch with Guy Cochrane and two other sequencing users and we discussed some of the issues in compressing data. I had to apologise that I’d missed Guys’s talk but it turned out his was at the same time as mine so at least I had a good excuse!

Efficient storage of high throughput sequencing data using reference-based compression. Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, et al. Genome Res 2011.

One of the discussions in compression of sequence data is the issue of what to use as a reference and how far to go in compressing data. Quite simply the current Human genome is not quite good enough, there are large amounts of sequence in a new genome which don’t match the current Hg19 reference but the reference is improving as more genome projects release data that can be incorporated.

One interesting comment in the paper for me is the exponential increase in efficiency of storage with longer read lengths. Longer reads are good for several other reasons, there is a significant drop in cost per bp, an increase in mappability and greater likelihood of resolving structural variants or splice isoforms with a single read. It looks like long read sequencing is going to be cheaper and better for us to use for many different reasons.

Today we still use Phred encoding ( and yet nearly all users remove data lower than Q30 (I am not sure anyone really cares about the difference of Q30 to Q40 when doing analysis). As such we may well be able to compress read quality information by reducing the range encoded to ten values Q10,Q20...Q50 or even lower to just two <Q30, >Q30.

At a recent meeting Illumina presented some of their work on compressing data and their experiences in reducing the range of Qscores encoded when doing this. They saw:
  • 8 bits/base in the current Illumina bcl files
  • 7.3 bits/base are actually used in bcl for up to Q40 scores
  • 5-6 bits/base can be achieved by compressing bcl files (lossless compression)
  • 4 bits/base can be achieved with the same methods as above if only 8 Q scores are used.
  • 2 bits/base can be achieved if no quality scores are kept (lossy)
  • <1 bits/base if a BWT (Burroghs wheeler transformation: which apparently allows this to be done on the instrument and allows someone to uncompress and realign the data later on, sounds good to me) and zero quality compression are used (lossy)

Compression of cancer genomes:
The discussion over lunch got me thinking about what we can do with Cancer genomes. Nearly all cancer genome projects like ICGC, are sequencing matched tumour:normal pairs. Many of these pairs turn out to be very very similar for most of the genome so using the same basic idea as presented in the Birney group paper would allow us to sequence two genomes, the tumour:normal pair but only store the data for one and the differences. We should be able to implement this quite rapidly and impact storage on these projects. Later on as Human genome reference based compression gets better we might be able to store even less. Birney's group also report that 10-40% of reads in a new genome sequence do not map to the reference and these are obviously harder to compress. Again with T:N pairs the rate of unmapped data should be significantly lower.

Compression of other projects:
What about sequencing other than whole genomes? It may well be that we can use the increasing numbers of replicates in RNA-seq and ChIP-seq experiments to reduce the storage burden of the final data. Replicates could be compressed against each other. A triplicate experiment could be stored as a single sequence file with variability's at specific loci and a few different reads across the genome.

Friday, 25 November 2011

Making box plots in Excel

A box plot conveys a lot of information and can be a very powerful tool. Excel does not generate these as part of its basic funtctions and I have never found time to learn how to do this in R or

Gather your data togather in columns, with labels on the top.

Calculate the fllowing:
Quartile 1: =QUARTILE(K4:K13,1) this returns the 25th percentile of the data in K4 to K13 of your table.
Min: =MIN(K4:K13), this returns the smallest of the numbers in K4 to K13.
Median: =MEDIAN(K4:K13), returns the median of the numbers in K4 to K13. The median is the number in the middle of a set of numbers; that is, half the numbers have values that are greater than the median, and half have values that are less.
Max: =MAX(K4:K13), this returns the largest of the numbers in K4 to K13.
Quartile 3: =QUARTILE(K4:K13,3) this returns the 75th percentile of the data in K4 to K13 of your table.

Creating the box plot chart:
 Highlight the calculation table and its headers (the data in the image above) and create a "Marked line" chart. You will then need to highlight the chart, right-click and "select data" then click the "Switch Row/Colum" button. Now you are ready to format the chart to create box plots as your data are in teh correct format with q1, min, median, max and q3 plotted for each column.

Right-click each data series in turn and format them to have no lines and no markers.

Format the "Up bars" so they have a black line.

There you have it a lovely box plot with not too much effort, that hopefully proves your point. I'm off to make mine with a reagent provider now!

Friday, 18 November 2011

MiSeq: possible growth potential part 2

This post (and others like it) are pure speculation from me. I have no insider knowledge and am trying to make some educated guesses as to where technologies like this might go. This is part of my job description in many ways as I need to know where we might invest in new technologies in my lab and also when to drop old ones (like GA).

A while ago I posted about MiSeq potential and suggested we might get to 25Gb per flowcell. This was my first post on this new blog and I am sorry to say I forgot to divide the output on HiSeq by two (two flowcells) so my 25Gb should really have been closer to 12Gb. Consider it revised (until the end of this post).

In October 2007 the first GAI was installed in our lab. It was called the GAI because the aim was to deliver 1Gb of sequence data. It was a pain to run, fiddly, and the early quality was dire compared to where we are today. I remember thinking 2% error at 36bp was a good thing!

Now MiSeq is giving me twice the GAI yield "out-of-the-box".

Here is our instrument:
MiSeq at CRI

And here is the screenshot for run performance: this was taken at about 13:00 today after starting at 15:30 yesterday. If you look closely you can see we are already at cycle 206!
MiSeq installation run

MiSeq in MyLab: So now I can update you on our first run that has processed the in run yield prediction and other metrics. This is a PE151bp PhiX installation run.

MiSeq installation run metrics:
    Cluster density: 905K/mm2
    Clusters passing filter: 89.9%
    Estimated yield: 1913.5MB (I think this means about 6.4M reads)
    Q30: 90.9%  

What is the potential:
So our first run is double the quoted values from Illumina on release.

Broad have also performed a 300bp single end run and if some extra reagents could be squeezed into the cartridge (reconfigured tubes that are a bit fatter perhaps) then PE300 is possible if you wanted to run for 2 days. This would yield 4Gb based on my current run.

We only need an increase of 3x in yield to hit my revised 12Gb estimate, read on...

At the recent Illumina UK UGM we had a discussion in one of the open floor sessions on what we wanted from an instrument like MiSeq. The Illumina team discussed options such as reducing read quality to allow faster runs. This would be achieved by making chemistry cycles even shorter. Currently chemistry takes 4 minutes and imaging takes 1, for a combined 5 minute cycle time.
Reducing chemistry cycle times would speed up the combined cycle time and allow longer runs to be performed, this would impact quality (by how much is not known and Illumina would not say). If you do teh same with imaging then you increase yield but make run times longer.

If you play with chemistry and imaging cycle times you can generate a graph like this one.
In this I have kept cycle time constant but varied chemistry and imaging times. The results are pretty dramatic. The peak in the middle of the table represents a 1min Chemistry / 1 min Imaging run, giving the same number of clusters as today (nearly 7M in my case) on a staggering 720bp run. This may be achievable using the standard reagent cartridge if less chemistry is actually used in the cycling (I just don't know about this). If you are happy to increase run times to two days then a low quality (maybe Q20) 1400bp (PE700) run would be pretty cool.

Even if this is a step too far then dialling in quality and playing with imaging could allow some really cool methods to be developed. What about a strobe sequencing application that gave high quality data at the start, middle and end of a 1000bp cluster for haplotyping but did not collect images in the middle? The prospects are interesting.

As I said at the start this is speculation by me and the reality may never get quite as far as 1400bp on SBS chemistry. We can keep our fingers corssed and I hope that exactly this kind of sepculation drives people to invent the technologies that will delvier this. After all if Solexa had not tried to build a better sequencer we would not be where we are today.

I thought I might trade in my remaining GA's for a HiSeq but perhaps I'd be better off asking for two more MiSeq's instead?

Who knows; HiSeq2000 at 600GB (2 flowcells),  HiSeq1000 at 300GB (1flowcell), MiSeq at 35GB (equivalent to 1 lane)?

Competitive by nature: Helen (our FAS) would not let me try to max out loading of the flowcell, I do feel a little competitive in getting the highest run yield so far. Did you know Ion offer a $5000 prize for a record braking run each month? Their community is actually quite a good forum, and I hope they don't kick me off!

Mis-quantification of Illumina sequencing libraries is costing us 10000 Human genomes a year (or how to quantitate Illumina sequencing libraries)

I was at the 3rd NGS congress in London on Monday and Tuesday this week and one of the topics we discussed in questions was quantitation of Illumina sequencing libraries. It is still a challenge for many labs and results in varying yields. The people speaking thought that between 5-25% of possible yield was being missed through poor quantification.

Illumina recommend a final concentration of 10–13 pM to get optimum cluster density from v3 cluster kits. There is a hug sample prep spike in NGS technologies where a samples is adapter ligated and massively amplified so a robust quantification can allow the correct amount of library to be added to the flowcell or picotitre-plate. If a sensitive enough system is sued then no-PCR libraries can be used. Most people are stil using PCR amplification and lots of the biases have been removed with protocol improvements.

The method of DNA quantitation is important (no-one wants to run titration flow cells). There are many methods that can be used and I thought I 'd give a run down of the pro's and cons' for each of these (see below). The LOQ values are taken from Digital PCR provides sensitive and absolute calibration for high throughput sequencing, and I have ordered systems by sensitivity (lowest to highest).

Which one to use: Most labs choose the method that suits them best and this is dependant on skills and experience and also what equipment is available for them to use. However even in the best labs getting cluster density spot on has not been perfected and methods could still be improved (I'm currently working on a solution).

In my lab we find that careful use of the Bioanalyser gives us quantitative and qualitative information from just 1 ul of sample. I think we may move to qPCR now we are making all libraries using TruSeq.

Why is this important? If you agree that 5-20% of achievable yield is being missed then we can work out how many Human genomes we could be sequencing with that unused capacity. To work this out I made some assumptions about the kind of runs people are performing and use PE100 as the standard. On GAIIx I used 50Gb as the yield and 12 days as the run time, for HiSeq I used 250Gb and 10 days. There are currently 529 GAIIx and 425 HiSeq instruments worldwide according to the map. I assumed that these could be used 80% of the time (allowing for maintenance and instrument failures), even though many are used nowhere near that capacity.

Total achievable yield for the world in PE100 sequencing is a staggering 7.5Pb.

Missing just 5% of that through poor quantification loses us 747Tb or about 3500 Human genomes at 100x coverage.

Missing 20% loses us 1500Tb or about 15000 Human genomes at 100x coverage.

We need to do better!

The quantitation technology review:
Agilent Bioanalyser (and others) (LOQ 25ng): The bioanalyser uses a capillary electrophoresis chip to run a virtual gel. Whilst the sensitivity is not as good as qPCR or other methods a significant advantage is the collection of both quantitative and qualitative data from a single run using 1ul of library. The Bioanalyser has been used for over a decade to check RNA quality before microarray experiments. The qualitative analysis allows poor libraries to be discarded before any sequence data are generated and this has saved thousands of lanes of sequencing from being performed unnecessarily.

Bioanalyser quantitation is affected by over or under loading and the kits commonly used (DNA1000 and High-Sensitivity) have upper and lower ranges for quantitation. If samples are above the marker peaks then quantitation may not be correct. Done well this system provides usable and robust quantification.

Many labs will run Bioanalyser even if they prefer a different quantitative assay for determining loading concentrations. New systems are also available from Caliper, Qiagen, Shimazdu and I recently saw a very interesting instrument from Advanced Analytical which we are looking at.
Examples of Bioanalyser libraries (good and bad) from CRI

UV spectrophotometry (LOQ 2ng): Probably the worst kind of tool to use for sequencing library quantification. Spectrophotometry is affected by contaminants and will report a quantity based on absorbance by anything in the tube. For the purpose of library quantification we are only interested in adapter ligated PCR products, yet primers and other contaminants will skew the results. As a result quantification is almost always inaccurate.

This is the only platform I would recommend you do not use.

Fluorescent detection (LOQ 1ng): Qubit and other plate based fluorometer use dyes that bind specifically to DNA , ssDNA or RNA and a known standard (take care when making this up) to determine a quantitative estimate of the test samples actual concentration. The Qubit uses Molecular Probes fluorescent dyes which emit signals ONLY when bound to specific target molecules, even at low concentrations. There are some useful resources on the Invitrogen website and a comparison of Qubit to nanodrop. I don't think it's nice to bash another technology but the Qubit is simply better for this task.
Qubit from Invitrogen website

You can use any plate reading fluorometer and may already have one lurking in your lab or institute.

qPCR (LOQ 0.3-0.003fg): Quantitative PCR (qPCR) is a method of quantifying DNA based on PCR. During a qPCR run intensity data are collected after each PCR cycle from either probes (think TaqMan) or intercalating chemistry (think SYBR). The intensity is directly related to the number of molecules present in the reaction at that cycle and this is a function of the starting amount of DNA. Typically a standard curve (take care when making this up) is run and unknown test samples are compared to the curve to determine a quantitative estimate of the samples actual concentration. qPCR is incredibly sensitive and quite specific and is the method most people recommend.

You can use any qPCR machine and either design your own assay or use a commercial one. You don't need to buy an Illumina qPCR machine or thier kit, just use the onw available in your lab or one next door and spend the money saved on another genome or three!

In the Illumina qPCR quantification protocol they use a polymerase, dNTPs, and two primers designed to the adapter sequences. The primer and adapter sequences are available from Illumina TechSupport but you do have to ask for the and they should not be generally shared (I don't know why they don't just put them on the web, everyone who wants to know does). The design of the assay means that only adapter ligated PCR products should amplify and you will get a very good estimate of concentration fro cluster density. Adapter dimers and other concatamers may also amplify so you need to make sure your sample is not contaminated with too much of these. Illumina also demonstrated that you can use a dissociation curve to determine GC content of your library. You can use this protocol as a starting point for your own if you like.

Illumina qPCR workflow
GC estimation by dissociation curve

Digital PCR (LOQ 0.03fg): Fluidigm's digital PCR platform has been released for library quanitifaction as the SlingShot kit, available for both Illumina and 454. This kit does not require a calibrator samples and uses positive well counts to determine a quantitative estimate of the samples actual concentration. A sinlge qPCR reaction is setup and loaded onto the Fluidigm chip. This reaction gets partitioned into 765 9nl chambers for PCR. The DNA is loaded at a concentration that results in many wells having no template present. The count of positive wells after PCR is directly related to starting input and quantitation s very sensitive.
SlingShot image from Fluidigm brochure

The biggest drawback is the need to buy a very expensive piece of hardware and this technology has only been adopted by labs using Fluidigm for other applications or in some large facilities.
Those are pretty big numbers and many more genomes than are being sequenced in all but the largest of consortia led projects.

Thursday, 17 November 2011

Cufflinks (the ones you wear not the RNA-seq tool)

Thanks very much to the people that sent me some more chips for an improved set of cuff links. I was not sure whether you required anonymity or not so erred on the side of caution!

It would take quite a nerd to spot the difference between the chips but I shall probably stick to wearing the 318's until the 5 series comes out next year.

314's (left), 316's (right), 318's are in the box)

I now have quite a nice collection of NGS and array consumables for my mini-museum, but if you have something you think would be good to incorporate then do let me know by adding a comment.

And here's my Christmas flowcell from 2007.

(Anyone got an old ABI array?)

Wednesday, 9 November 2011

Something for Mr Rothbergs Christmas stocking?

I have been collecting genomics technologies for a while to create a little display of old and current consumables. As part of this collecting I had more Ion chips than I needed and decided to try and get creative.

The result is (I think) quite a nice pair of cufflinks that I shall wear when I put on a proper shirt for presentations. Expect to see these at the 3rd NGS congress and at AGBT (if I get in, only The Stone Roses sold out faster this year). 

If anyone wants a pair let me know and I am sure we can come to some arrangement including a donation to CRUK. If you have any Ion chips lying around the lab please do send them to me.

Making these made me think of what other things could be done with microarray and sequencing consumables. We spend a fortune on what are disposable items and surely we can come up with interesting ways to reuse these. I will try to get some old Affy chips to do the same with, but they could be a little large. Four years ago we had some flowcells hanging on the Institute Christmas tree.
What else can you come up with?

I am still collecting for my ‘museum’. If anyone has the following please get in touch if you are willing to donate.

My wish list:
373 or 377 gel plates
Affymetrix U95A&B set
ABI gene expression array
Helicos flowcell
Ion torrent 316 and 318 chips
Seqeunom chip

Wednesday, 2 November 2011

Illumina generates 300bp reads on MiSeq at Broad

Danielle Perrin, from the genome centre at Broad presents an interesting webinar demonstrating what the Broad intends to do with MiSeq and lastly a 300bp single read dataset. Watch the seminar here. I thought I'd summarise the webinar as it neatly follows on from a previous post, MiSeq growth potential. Where I speculated that MiSeq might be adapted for dual surface, larger tiles to get up to 25Gb. There is still along way to go to get near this but if the SE300 data presented by Broad holds up for PE runs then we jump to 3.2Gb per run.
Apparently this follows up from an ASHG presentation but as I was not there I missed it.

Apparently Broad has six MiSeq's. Now I understand why mine has yet to turn up! I must add these t the Google map of NGS.

MiSeq intro: The webinar starts with an intro to MiSeq if you have not seen one and goes through cycle time, chemistry and interface. They have run 50 flowcells on 2 instruments since August. They are now up to 6 boxes running well. No chemistry or hardware problems yet and software is being developed.

What will broad do with it? At the Broad they intend to run many applications on MiSeq: Bacterial assembly, library QC, TruSeq Custom Amplicon, Nextera Sample Prep and Metagenomics. So far they have run 1x8 to 2x151 runs and one 1x300 run.
2x150 metrics look good at  89% Q30, 1.7Gb, 5.5M reads 0.24% error rate.
Getting cluster density right is still hard even at Broad (something for another blog?). They use Illumina's Eco qPCR system for this.

Bacterial Assembly: The Broad has a standard method for bacterial genomics which uses a mix of libraries; 100x coverage of 3-5KB libs, 100x coverage of 180bp libs, 5ug DNA input and AllPaths assembly. They saw very good concordance from MiSeq to HiSeq. And the MiSeq assembly was actually higher but Danielle did not say why (read quality perhaps).

library QC: 8bp index in all samples by ligation (they are not using TruSeq library prep at Broad) 96well library prep, pool all libraries and run it son the number of lanes required based on estimated coverage. QC of these libraries and evenness of pools is important. They run the index first and if the pool is too uneven they will kill the flow cell and start again. They use a positive control in every plate and run as a 2x25bp run to check the quality of the plate. Thinking of moving this QC to MiSeq to improve QC turnaround. They aim to run the same denatured pool onto HiSeq after MiSeq QCC. This will avoid a time delay requiring the denaturation to be repeated. Very important in ultra low input libraries where you can run out if flow cells need to be repeated. All QC metrics seem to correlate well between MiSeq and HiSeq.

Amplicons: They presented Nextera validation of 600bp amplicons: 8 amplicons, pool, Nextera, MiSeq workflow (very similar to the workflow I discussed in a recent Illumina interview). And TruSeq Custom Amplicon (see here), Illumina's GoldenGate extension:ligation and PCR system. After PCR samples are normalised using a bead based method, pooled, run on MiSeq (without quantification) and analysed. Danielle showed a slide (#34) with the variation seen in read numbers per sample after bead based normalisation and a CV of only 15%.  I wonder if the bead normalisation method will be adopted for other library types?

SE300bp run: The Broad took a standard kit and ran it as a 300bp single end run. They have done this once and first time round achieved 1.6Gb, 5.29M read, 65%Q30, 0.4% error. Pretty good to start with and hopefully demonstrating the future possibilities.

How long can you go, 550bp amplicons (PE300) anyone? Another goodbye 454 perhaps?

Monday, 31 October 2011

My genome analysis part III - the results are in

Lots of excitement today on checking to see if my sample has been processed, the results are ready.

You can see I checked the "please load my health data", I guess 23andMe want to make sure you really really do want to find things out about yourself efore letting you dive into the results.
The next step is to enter my data; year of birth (40 years ago last Tuesday), sex (with an "I'm not sure" option!), height, weight and smoking status. I answered no to all the medical questions (lucky me), except I am a Psoriatic so to finish off I added that I am using Dovonex cream for my Psoriasis.

Health and disease status: My first result is my blood group status, 23adnMe correctly identify me as having an O blood group. I also find out which snps are used to determine this and the reference for the paper. Interesting stuff for a scientist like me.

Immediately available is my health status report. This shows 114 disease, 52 trait, 27 carrier status and 20 drug response reports.

3 of these are locked. The locked reports describe the trait or disease eing reported and exp,lain the genetic susceptibility BEFORE you chose to reveal results. It certainly felt that the website is designed to guide an informed choice, even if you are missing a face-to-face discussion with a genetic counselor. I looked at my ApoE status and see I have twice the risk of the general population. Certainly not a 'nice' result but not likely to make me lose any sleep.

The first 23andMe discovery I read about was curly hair. According to the analysis I should have slightly curlier hair than the average European. Mine is dead straight and was so even when I was a head-banging teen rocker. My brother has slightly curly hair and my dad was curly (all shaved off now).

I am not a carrier for 24 of the diseases reported. I am a haemochromotosis carrier and not ready to look at BRCA status yet.

The most useful result for me personally is an increased risk of Glaucoma. This had been mentioned at my last opticians visit and I had brushed it off a it. Seeing a genetic risk as well makes me think I will speak to the optician a bit more at my next appointment and monitor this closely. I'll also start to look at what I can do and what treatments are available for this condition.

Traits and inheritance: As for traits I was happy to see I am a likely sprinter (CRUK half marathon in March next year). I was a little disappointed to find out I am likely to have a typical IQ. It shows what ard work I must have put in to get where I am today and also says to me my kids won't be able to get away with not doing their homework.

There are no closely related individuals on 23andME, yet. I am 74.24% identical to Neil Hadfield (also on 23andME) however I am 71.19% similar to a Chinese person and 68.49% similar to an African. Neil is probably not my long lost brother.

Impressions so far: I will spend a few days looking through this but so far it has persuaded me my £160 birthday present was worth it. It certainly satisfies my curiosity. For now I will leave BRCA status as there are some family things that need to be discussed efofre diving into that one.

Friday, 28 October 2011

Top ten things to write in a leaving card

I always struggle when it comes to writing in someones leaving card. I'd like to be witty and at the same time show the person that I remembered them and enjoyed working with them. However most of the time something short gets quickly scribbled.

We get a lot of leaving cards in scientific centres. I guess this is due to the contractual nature of much of the work we do, PhD students and Post-docs are all on three to five year contracts. Many grants only carry money for three to five years or even less. This means people do move on quite regularly.

How much to donate to the gift?
Every time a card comes round I also think about the collection. I try hard for my staff to go round the people they have most likely interacted with and worked for and make a shameless effort to get enough in the way of contributions to buy something decent. In my view if everyone gave £1 or £2 then we should get about £100 which buys a nice gift to remember the lab/institute by.

The top ten things written in leaving cards: here is a summary of three leaving cards, not perhaps a large enough sample to say anthing with statistical confidence (but that's a whole other post about stats)
1: Best wishes 25%.
2: Good luck 15%.
3: Best wishes and good luck combination, 10%.
4: Congratulations 5%.
5: All the best <5%.
6: I will miss you <5%.
7: Thanks for your help <5%.
8: Goodbye <5%.
9: Enjoy the new job <5%.
10: Sorry you are leaving <5%.

I think 50% of 'best wishes' and/or 'good luck' shows just how unimaginitive we are so I'd like to encourage everyone reading this to write something far more interesting in the next card that lands on the bench in front of you.

Why doesn't anyone write "I've always loved you" and sign it from a secret admirer?

And don't forget to add a couple of quid, dollars, yen, etc.

Monday, 24 October 2011

My genome analysis part II

My kit arrived, I spat and it is now being processed. It is a pretty neat package with easy to follow instructions. Inside the package was a box containing the kit, which is an Oragene product. Has anyone tested their RNA kits yet?

Before sending it back you have to register the kit on the 23andMe website and link it to an account. The only trouble I had was not being able to enter a UK postcode.

I am now connected to the only other Hadfield on their database and it would be nice if I turned out to be more related to him than anyone else. Lets wait and see.

Consent agreement:
There is a rather lengthy consent document you need to read and sign. This gives 23andMe access to your personal genotype and other non-identifying data for research use in the 23andWe program. As a genomic scientist I am more than happy to do this, large sample sizes are clearly needed for this kind of study. It is a shame that 23andMe don't share the IP with the users. This would be a great way to connect individuals with scientific research. Lets face it the proceeds would be minimal but if they offered a charity donation option then someone other than just 23andMe might benefit.

The lengthy consent agreement is primarily aimed at making sure I can give informed consent to use my data. Surprisingly to me this is the first time I have ever given informed consent.

23andWe projects:
23andWe is running research projects to "understand the basic causes of disease, develop drugs or other treatments and/or preventive measures, or predict a person's risk of disease". The listed projects cover a wide range of from hair colour & freckles to migraine to Parkinson's. They specifically say that they will not investigate "sensitive" topics such as sexual orientaion (although they would need to be analysing methylation for Epigaynomics) or drug use. But if they do decide to do so in the future they would contact me and ask for a separate consent agreement. They will also collaborate with external groups but won't release any identifying information. There is of course the worry that 1M genotypes pretty well identifies me to some people.

There is a great website over at the University of Delaware from John H. McDonald, "Myths of Human Genetics". I am sure there are lots of other interesting questions that the public might engage with. Some of these may not be considered high-brow science but f it gets people involved and they consent for other studies surely that has to be a good thing?

Personally I hope 23andMe do some Psoriasis research. I am a Psoriatic and would like to think about some analysis that might be made using 23andMe data, maybe they will even let users start studies one day?

I am now waiting for the dat to turn up and then I can take a look at what is lurking in my genome. Fingers crossed it has some good news stories to post about!

Tuesday, 18 October 2011

Life Technologies turn up the heat on Illumina: do we have some real competition?

I was presenting today at an EBI workshop and was followed by Matt Dyer, senior product manager bioinformatics at Ion Torrent. He gave a good introduction to the platform an recent development and then went onto a hands on demo of the Torrent suite of applications.

He updated his talk 5 minutes before giving it.
571.43Mbp from a single 316 chip.
551 bp perfect read is current record for length.

Goodbye 454?

Life Tech made a real splash by the sound of things coming out of ASHG.

5500 updates:
Wildfire is the new isothermal amplification removing the painful emulsion PCR, and will be performed on the instrument using the 5500 "flowchips". Sounds a bit like a cBot in a HiSeq!

Certainly does according to Mark Gardner, VP at Life Tech, over on GenomeWeb who says "you coat the flow cell with primers ... and then add template and isothermally amplify." resulting in isolated fragments. Sounds an awful lot like clusters on a flowcell.

Gardner also suggested the much hyped feature of using just one lane on a flowcell. Lastly Gardner says the 55000 will move from 400k cpmm (400 thousand clusters per millimetre squared) to 1M cpmm and on both sides of the flowchip.

If all this pans out then there is a real competitor to Illumina for whole genome sequencing and any high-throughput applications.

Incidentally, I found it really hard to find a picture of a flowchip, can someone post some HiRes images? Or post me a flowchip and I'll put it alongside Illumina, PacBio, Ion et al.

5500 Flowchip from

Ion updates:

A new library prep kit the Ion Xpress™ Plus Fragment Library Kit an enzymatic shearing reduces library prep to 2 hours making DNA to sequence possible in eight hours. 200bp read kits from the Ion Sequencing 200 Kit and talk of reads over 500bp. Custom enrichment for more then 100kb of genome targeting, Ion TargetSeq™ Custom Enrichment Kit. 384 barcodes are coming as well.

Look out Roche and Illumina, Ion is hot on your heels.

PS: I still think of Life Tech 5500 as SOLiD and use this in conversations with others. I'll miss the term when Life Tech marketing finally kill it.

Amplicon NGS battles begin in earnest

A short while ago I posted about the recent exome sequencingcomparisons in Genome Research. In that post I did ask whether we really need to target the exome at all and if targeted amplicon sequencing might be a better fit for some projects.

In the last few days both Life Technologies and Illumina have released amplicon resequencing products. You can read another good review of Illumina's offering over at Keith Robinson's blog.

I really hope that amplicon NGS is the tool that gets translated into the clinic quickly. Microarrays took over a decade, and only CGH has made it. I am not aware of any gene expression array based clinical tests, either than Mammaprint and the upcoming Coloprint from Agendia. Amplicon NGS is similar to the current standard Sanger tests in many ways. Labs will still perform PCR and sequencing, they'll just be doing a different PCR and it will be NGS. This should make adoption seem like less of a hurdle.

The other amplicon competition:

Fluidigm's Access Array, RainDance's new ThunderStorm, Halo Genomics, MIPs, traditional multiplex PCR assays are all competition from the in-house kits of Illumina and Life Technologies. The major differences with all the platforms are the way in which multiple loci are captured and amplified. Microfluidics, emulsion PCR and oligo-probes are the different 'capture' mechanisms. All rely on PCR for the amplification and to add the sequencing platform adapters and barcodes. The cost of the RainDance instrument is very high, AccessArray is medium and the probe based systems can require almost nothing additional over what is already in your wet lab. AccessArray is the only system where the user has complete flexibility over what goes in the panel, if you want to change something just order a new pair of primers. RainDance, Halo and other platforms, as well as Life and Illumina's offerings, all require you to design a panel and order quite a lot to become cost effective.

Ultimately the cost per sample is going to be what makes one of the system here, or one as yet to be released the dominant technology. $10 rather than $100 is what we need to get these tests to every cancer patient!

So what have Life Tech and Illumina got to offer?

Life Technologies "AmpliSeq" amplicon sequencing cancer panel for Ion Torrent:

The Ion cancer panel interrogates >700 mutations using 190 amplicons in 46 genes. Using the 314 chip should get 500 fold coverage and allow detection of variants as low as 5%. The AmpliSeq kit can target 480 amplicons (but is scalable from there) in a single tube reaction with just 10ng DNA input from FFZN or FFPE tissue. PCR and sequencing can be completed in a single day, assuming of course you have the one touch system. They have chosen "the most relevant cancer genes" for the initial panel, most probably from COSMIC.

Life Tech are also involved in the CRUK/TSB funded Stratified Medicines Initiative, on which I was worked early on. However I am not sure if they are going to get the ion test out before a full set of Sanger based assays. It will be interesting to see what comes first on this project and could be a good proxy for seeing how much Life Tech still believes in Sanger as a long-term product. Life Tech are aiming to get this into the clinic and are going to seek FDA approval.

There is no pricing on the press release from Life Tech.

I'd agree with the early access Life Tech customers, Christopher Corless, Marjolijn Ligtenberg and Pierre Laurent-Puig at Oregon, Nijmegen and Paris respectively on the likely benefits of amplicon NGS. The simplicity of these methods will hopefully mean clinical genetics labs adopt them quickly.

Illumina's TruSeq custom amplicon (TCSA) sequencing for MiSeq et al:

Illumina provide a nice tool in the DesignStudio and also recently release a cloud based analysis system called BaseSpace. Both of these are likely to help novices get results quickly. TCSA allows you to target up to 384 amplicons with 96 indices and requires 250ng of input DNA. Illumina use an integrated normalisation solution so you do not have to quantitate each amplicon set before running on a sequencer. This is going to make some peoples lives much easier as many do still struggle getting this right every time.

TCSA uses the GoldenGate chemistry as I mentioned at the bottom of a previous post. This makes use of an extension:ligation (see here for one of the origianla E:L methods) reaction followed by universal PCR to provide better specificity in highly multiplex PCR based reactions. In SNP genotyping GG goes much higher than the 384 plex Illumina are offering on TCSA today. Hopefully this shows the scope for increasing the level of multiplexing.

The benefits of running TCSA on MiSeq are going to be turnaround time and the inbuilt analysis workflows. Of course many users will want to be amplifying 100s-10,000's of samples from FFPE collections and for this purpose Illumina might want to consider modifying their dual-indexing to allow the maximal number of samples to be run in a single HiSeq lane. Right now the limitation of 96 is a pain.

There are no early access customer comments on the TCSA data sheet or Illumina's website. I cant imagine it is going to take too long for the first reports to come out on how well it works though.

Illumina have a pricing calculator on their website so you get to see how much your project is going to cost. Once you have designed an amplicon pool it will let you specify a number of samples and return a project cost inclusive of MiSeq seqeuncing. I'm not sure who they talked to about the price point for this but it looks like Illumina are aiming at capillary users. The target price is $0.43 per amplicon or nearly $200 per sample! Personally I was hoping we would get to under $50 per sample and as low as $10 or $20. I'd also like to see enough indices such that a large project could be run in one lane on HiSeq making the whole project very cost effective and fast.

Imagine 1500 DNA samples from FFPE blocks for lung cancer being screened for the top 50 Cancer genes with just 15 plates of PCR and one PE100 lane on HiSeq. The whole sample prep could be done by one person in a couple of weeks, and the sequencing completed ten days later.

Watch out for more competition:
It feels like everyone sees amplicon sequencing (Amp-seq anyone?) as the most likely step into the clinic. As such there is going to be stiff competition in this format and that can only be good for all of us wanting to use the technology.

Hopefully it won't be too long before someone compares the results on all of these as well.

Saturday, 15 October 2011

Learing to live with staff turnover: my Top 10 tips for recruitment

There are six people working in the Genomics Core lab I run, including myself. It is exactly five years since I started and in that time five members of my team have come and gone. The next person to go will mean I hit a milestone I had not even thought about before, where there are as many people working for me as there are people who have worked for me in the past!

Staff leaving is never easy. There is a lot of work to be done in recruiting someone new and getting them up to speed in the lab. Also there is an inevitable impact on the rest of the lab as a vacuum is created and someone new has to come in and fill another persons shoes.

However it is not all bad. Leavers offer an opportunity to change how things are done and can mean promotion of others in the lab. Even if this does not happen inevitably the ripple effects mean people get to do some new things and take on new responsibilities.

I recently had my number two person in the lab leave. She has been great and has worked for me for over four years, and will be sorely missed. However it was time for her to move on, a great opportunity arose and she went for it and I wish her the very best. I had to recruit and thought I'd post on my experiences for others to consider.

My Top 10 tips for recruitment

1 Write a good job description: It might sound like an obvious one but get it wrong and you'll never get the right person. This is the time to really consider what you need this new person to do in the lab. It is an opportunity to change responsibilities as someone new can take on something the other person never did without even knowing it.

2 Write a good advert: I always struggle with this. How to get a good ad that attracts people to apply is tough. I always get our HR tam to help with this and usually aim for online advertising now. The costs of ads in Science and Nature is very high in the print editions. Online is no cheap option though, however the ad needs to be seen.

3 Read covering letters and CVs carefully: For my last job opening I got over 50 applicants. There is not enough time to read every one in detail and fortunately our HR team use an online system that allows me to screen and reject poor candidates quite easily. I usually start with the covering letter and if this does not grab me put the candidate straight into the reject pile. It might seem tough but the covering letter is the opportunity for the applicant to shine and to shout out why they are the best person for this job. The CV should be clear and allow me to see what skills they have and what their job experience is. A list of 40 publications is a bad idea and off putting. personally I like to see no more than three.

4 Use a scoring matrix for possible candidates: I start by deciding which criteria are most important for teh job, perhaps specific skills. I then make a table in Excel to record how each candidate measures up on a three point scoring system. I use the results of this to decide which candidates to invite in for interview and also use it to decide on the order of interviews. I like to get the best candidate in first and then see the others in order of preference if possible. It can get tiring doing interviews so I want to be as fresh as possible for the best candidates. This matrix also helps if someone comes back later to ask why they did not get the job as there is evidence they might not have measured up against other candidates.

5 Generate a list of questions for the interview: These do not have to be kept to rigidly but they offer an opportunity to keep interviews as similar as possible so you can make an unbiased decision. They also allow you to think of something to ask if a candidate turn out to be very poor. I would not recommend you stick to an hours interview if it really is not going anywhere, get the candidate out of the door and move on.

6 Get candidates to present: I have found a ten minute presentation a great way to start off an interview. I use a rather vague title for talks like "Cancer genomics in a core facility" and allow candidates freedom to interpret this a they see best. This certainly sorts out people who are really thinking how a core might run from post-docs that would really just like another research post. A talk also gives you an idea of how the person will communicate with others in the job. And it shows you how much homework they have done for this job.

7 Show candidates the lab: I ask people in my team or collaborating labs to take people around and then get their feedback on the candidates as well. Sometimes people relax in this scenario and their true personality comes out. If someone seems interested in the Institute and the work we are doing then great. If all they care about is the holiday package and any perks they are unlikely to make this clear in the formal interview.

8 Talk to the interview panel: I get mine to rate the top three candidates in order of preference. Having each person do this independently can help when there is a difficult choice. If the three don't agree than you can have an informed discussion as to why. Of course hopefully it is clear and the same candidate comes out on top.

9 Make a good offer: I like to personally call someone when offering a job in my lab. It is one of the nicest things about being a boss and I hope makes a better impression on the individual than having HR ring them up. Personally though I leave pay and conditions to HR, I just stick to questions about the job. Leaving HR to the complex discussions on pay is helpful. They won't get carried away with  packages and can answer all the questions individuals might have.

10 Help them settle in: When someone new turns up make sure you give them the time to settle in, explain the job again, introduce them to everyone again, take them on another tour. I like to sit down on the first afternoon and have an informal chat about what I want them to do and where I think the lab is going. Give people some of your time.

Hopefully everything goes well and the new person settles in fine. I am excited about my latest recruit and hope your next recruitment goes smoothly.

Illumina and other life science stock slipping fast

Illumina's stock has dropped dramatically in the last few months.

I have watched Illumina over the last six or seven years and their price has seemed to follow a continual upward trajectory. With the exception of a couple of hiccups, one of which was caused by the Paired End reagent problems in 2009. This time it looks like the global recession is finally starting to hit Genomics expenditure.

And just as we were all having so much fun!

The stock had been at $60-80 for most of the last year. But in July it dipped below $60 and by mid September was under $50.

Today it stands at $27.

Several of the investment companies have recently downgraded life science companies including Illumina as sales forecasts are not looking as stellar as they have in recent years. I am sure Illumina and Life are placing big bets on MiSeq and Ion. Illumina are only just starting to ship and if they can't deliver in the volumes expected I think the banks won't take that as a good sign. At least the PE issues were against a background of incredibly strong demand from users.

Life Technology has had a similar drop from $70 and is currently at $36.82.

Thursday, 13 October 2011

The 'embarrasing' science of Olympic drug testing in the UK

GlaxoSmithKline have won the contract to help perform drug testing at next summers 2012 Olympic games with King's College London's Drug Control Centre. There is a report on the BBC news from the 10th of October which is still being repeated. It is, quite frankly, embarrassing.

It shows a lab similar to the one that will be used for the actual testing. The news team focused on the robots and show an automated pipetting robot that could be used to make sure atheletes don't cheat. Watch the video carefully and at 1 min 39 sec the robot does its thing, 96 pipet tips come down into the 96 well plate and liquid comes flooding out all over the deck of the robot.


I am sure Professor David Cowan, King's College London's Drug Control Centre Director would prefer Gold rather than a wooden spoon!

Perhaps he should employ Gabriel See, the 11 year old who was one of the team that built a liquid handling robot out of lego. His worked about as well as the GSK one.

Sorry for those of you outside of the UK you might not be able to watch this.

Friday, 7 October 2011

Illumina Custom Capture: Design Studio review

Illumina are currently offering a demo kit for custom TruSeq capture. I thought I would try it out on some genes from COSMIC and see how easy it was to design a capture set using their DesignStudio tool. There is also a pricing calculator I was very interested in so we can see how much the final product is likely to cost.

The TruSeq Custom capture kit is an in-solution method that allows users to target 0.7-15 Mb of sequence. The design studio site will produce 2,500-67,000 custom oligos. After ordering you simply make lots of libraries with TruSeq DNA Sample Preparation Kits and perform the capture reactions in up to 12plex pool. This makes the process pretty efficient if you have lots of samples to screen. As the kits come in 24 or 96 reaction sizes a total of nearly 300-1200 samples can be processed together. With 24 indexes currently possible in TruSeq DNA kits this is just 2-4 lanes of sequencing. As Illumina move to 96 indexes for TruSeq kits the sequencing cost will continue to drop.

You need to register with Illumina for an iCom account, then you can just log-in to the "Design Studio" and get started.

Illumina DesignStudio: Start a project, choose the genome, upload loci and the tool does its job.

There are multiple ways to get your genes of interest into their database. I chose to upload a csv file with a list of the 100 most mutated genes in COSMIC. The template file Illumina provide is very minimal. There are columns for, gene name, offset bases, target type (exon or whole gene), density (standard or dense) and a user definable label. The upload was simple enough and in about thirty seconds all the genomic coordinates for the list of genes was available. The processing for bait design took a little longer at about ten minutes.

The design tool predicts coverage and gives a quality score (this is a cumulative score for the entire region targeted) for the targeting. Each probe set is shown in a browser and coloured green for OK or yellow for problematic. For my 100 genes 12 were under a 90% score and two were not designed at all because I entered names with additional characters making them incomprehensible by the tool.

Here is a screenshot of TP53 exon probes:

there is a pricing calculator available as well which I'll talk about in a minute.

My "100 most mutated COSMIC genes" custom capture kit:
It took about twenty minutes to pull this together
Regions of Interest Targeted: 2662
Final Attempted Probes Selected: 3693
Number of Gaps: 80
Total Gap Distance: 3,158
Non-redundant Design Footprint: 736,759
Design Redundancy: 3%
Percent Coverage: 100%
Estimated Success: ≥ 95%

How much will this kit cost?
The pricing calculator has some variable fields you need to fill in, it uses the data from my region list asks for library size (default 400), how many samples you want to run (288 minimum), what level of multiplexing (12 plex for this example) and then platform and read type. Lastly it asks you to select the % of bases covered in the regions and at what fold coverage, e.g. 95% of bases at 10 fold (in this example).

Unfortunately the calculator did not work!

Fortunately Illumina provide another one here and this did. This also recommends how many flowcells and what other items are needed to run your custom capture project.

It turns out that I will need:
My TruSeq Custom Enrichment Kit at $33,845.76 in this example.
6x TruSeq DNA Sample Prep Kit v2-Set A (48rxn with PCR) at $12k.
3x TruSeq SBS Kit v3 - HS (200-cycles) at $17k.
3x TruSeq PE Cluster Kit v3 - cBot - HS at $13k.
A total of $76000k or $263 per sample. 

Trying it out:
If you are interested in trialling this then Illumina are offering a 50% off promo on a 5000 oligo capture kit for up to 288 samples (24 pull downs at 12 plex). You can also order a $2860 TruSeq custom capture demo kit that targets to ~400 cancer genes (again from COSMIC I expect, there are also autoimmume and ES cell gene kits). The kit includes both TruSeq DNA library prep and the custom capture reagents for processing 48 samples. This works out at $60 per sample for pulldown. If this were run on two lanes of HiSeq v3 the cost including sequencing would still be under $100 per sample.

I am not sure how much sequence is going to be needed to get good coverage on this particular kit, but my kt is a little smaller than the Cancer trial kit so the results from the pricing calculator should be indicative. 

When will custom amplicon be released?
For me the most interesting thing Illumina mentioned in the MiSeq release at AGBT was the GoldenGate based custom amplicon product. Hopefully it will appear soon on DesignStudio as well. This will allow us to remove library prep almost entirely and process samples from much smaller amounts of DNA.

Illumina are also releasing dual barcoding in the next release of HCS. This will allow four reads from each library molecule so with just 24 barcodes you may be able to multiplex 576 samples into one MiSeq run with just six 96 well plate reactions to get almost ten fold coverage of all TP53 exons.

Thursday, 6 October 2011

Exome capture comparison publication splurge

In the last few days there has been a mini splurge in papers reviewing capture technologies. I thought it would be useful to write an overview of these. I have been involved in several comparison papers and so am aware of the limitations of comparison experiments. Many comparison publications fail to give readers the one answer they are looking for, a clear "which is better" statement. In fact the Sulonen paper discussed below says "the question often asked from a sequencing core laboratory thus is: “Which exome capture method should I use?”" They do appear to skirt the issue though in their final conclusions.

GenomeWeb has reviewed and interviewed Michael Snyder’s NatureBiotechnology paper. They pointed out many of the highlights from the paper.

Exome-Seq via genome capture protocols either on arrays or in-solution have been making quite a splash with many hundreds or even thousands of samples being published. In-solution methods seem to have won out, which is not surprising given the throughput requirements of researchers. And Exome-Seq has become pretty common in many labs, allowing an interesting portion of the genome to be interrogated more cost effectively than whole genome sequencing (WGS) would allow. Even a $1000 genome might not supplant Exome-Seq as the number of samples that can be run is significantly higher and this is likely to give many projects the power to discover significant variants, albeit only in the targeted regions of course.

Exome-Seq kits vary significantly in the in the regions and amount of genome captured. Unfortunately the latest kits from each provider are not easily available on UCSC for comparison as annotation tracks. If you have a specific set of genes you are interested in you need to go out and find this information yourself. Both Agilent and Illumina require you to register on their websites to download tracks. Nimblegen's are available from their website.

Table 1: Sulonen et al produce a comprehensive table comparing Agilent to Nimblegen and this has helped with some of the detail in my table below. I have chosen not to include details on what is captured as this is still changing quite rapidly. I have instead focussed on factors likely to impact projects such as input DNA requirements, pooling and time.

Table 1

Parla et al: A comparative analysis of exome capture. Genome Biol. 2011 Sep 29;12(9):R97.
Dick McCombie's lab at Cold Spring harbour compared two exome capture kits and found that both kits performed well. These focussed on CCDS capture and as such did not capture everything researchers may be interested in. Both were sequenced on the Illumina platform.

Asan et al: Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011 Sep 28;12(9):R95.
This group compared one array based kit with two in-solution versions; Nimblegen's array and in-solution kits and Agilent's SureSelect kit. They used the first Asian DNA sample published in Nature 2008. They also reported the differences in regions captured by these kits, this is mainly a design decision and there is no reason I am aware of that would not allow each company to target exactly the same regions. Asan et al found that Nimblegen produced better uniformity at 30-100 fold coverage. All platforms called SNPs equally well. They compared SNP calls from Illumina 1M chips to array based genotyping and reported >99% concordance between sequencing and arrays. They also discuss the advantages of in-solution methods over array based ones. HiSeq PE90bp, data submitted to SRA.

Sulonen et al: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011 Sep 28;12(9):R94.
Sulonen et al used a single control DNA across two kits each from Agilent and Nimblegen. They found that Nimblegen generated lower amounts of off-target sequence and showed more specific targeting and enrichment for Nimblegen than Agilent. The Nimblegen kit was most efficient and captured the exome targeted with just 20 fold coverage. Agilent produced fewer duplicate reads. The 201 Genome Biology paper by Bainbridge et al discussed duplicate reads, their suggestion being that these come from low complexity libraries. They also stated that these can be difficult to screen out. We have been looking at library indexing approaches that could incorporate a random sequence in the index read. This would allow PCR duplicates to be removed quite easily. They again reported the negative impact of GC content on capture and said the long baits on the Agilent platform appeared to be slightly more impacted by this. Interestingly they reported that where a SNP was heterozygous more reference alleles were called than would have been expected and explained this as a result of the capture probes being designed to the reference allele. However the genotype concordance of sequencing to arrays, this time on Illumina 660W Quad chips, was >99% from a coverage of just 11 fold. The authors don't do a great job of saying what they did in the lab. They report sequencing reads of 60-100bp but in the sequencing methods don't say whether this is single or paired end nor what instrument or chemistry was used. They did submit their data to SRA though.

Clark et al: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011 Sep 25. doi: 10.1038/nbt.1975
Michael Snyder's lab at Stanford compared exome capture kits from Agilent, Illumina and Nimblegen using the same human sample. They found that Nimblegen covered the fewest regions but required the lowest amount of whilst Agilent and Illumina covered more regions but needed higher sequence coverage. Additionally Illumina captured non-coding regions and regulatory sequence not targeted by the other platforms, this is going to be a key development for some researchers. Lastly this group compared the exome data to whole genome sequencing of the same samples and interestingly found that Exome-Seq discovered additional small variants missed by WGS.

Some interesting stats from the sequencing data include: off target reads of one third for Illumina compared to 13% and 9% for Agilent and Nimblegen respectively. Illumina did respond to this in a GenomeWeb article stating that their new TruSeq kits reduced duplication rates to generate far better results. Genomic loci high in GC were less well targeted and Agilent performed best in the regions where data could be compared. Illumina captured most SNPs but targeted the most sequence so no real surprise there. Where the three platforms overlapped Nimblegen was most efficient. HiSeq

Natsoulis, G. et al., 2011. A Flexible Approach for Highly Multiplexed Candidate Gene Targeted Resequencing. PloS one, 6(6).

I wanted to include this PloS One paper as the group took quite a different approach, which may well be a very useful one for other groups. Instead of purchasing a whole exome capture kit Natsoulis et al designed 100bp olio’s ads baits to the Human exome and published these as an open resource. Now anyone can order the baits they are interested in and perform custom capture in their own lab.

How much does Exome-Seq cost?

Snyder’s paper included some comments on the costs of capture that they said was "highly negotiable". The biggest change coming is in the pooling strategies with all platforms moving to six or eight plex pooling before capture and Illumina's custom capture kits now supporting a 12 plex reaction. This makes the workflow much easier for large numbers of samples. I have been following this for a while and costs are dropping so rapidly as to make comparison or projection a bit of a waste of time. The number of sequencing lanes required is also changing as the density on Illumina continues to rise. Illumina handily provide a table estimating the number of exomes that can be run per lane on HiSeq and other platforms, HiSeq 600gb v3 chemistry allows 7 exomes per lane or 56 per flowcell at 50x coverage. And an exome might be achievable on a MiSeq next year. I have estimated out internal exome costs to be about £300-450 depending on coverage and read type, inclusive of library prep, capture and sequencing. We are only just starting to run these in my lab though so I'll soon find out how close my estimates really are.

Do you really need to target the exome?

A lot of people I have talked to are now looking at screening pipelines which use Exome-Seq ahead of WGS to reduce the number of whole Human genomes to be sequenced. The idea being that the exome run will find mutations that can be followed up in many cases and only those with no hits can be selected for WGS.

As a Cancer Genomics Core Facility I am also wondering how the smaller targeted panels like Illumina's demo TruSeq Custom Capture Kit: Cancer Panel will fit into this screening regime. These can be multiplexed to higher level, target many fewer regions but cost a lot less to sequence and analyse. Perhaps the start of the process should be this or even next-gen PCR based 'capture' from the likes of Fluidigm?


1. Parla et al: A comparative analysis of exome capture. Genome Biol. 2011 Sep 29;12(9):R97.

2. Asan et al: Comprehensive comparison of three commercial human whole-exome capture platforms. Genome Biol. 2011 Sep 28;12(9):R95.

3. Sulonen et al: Comparison of solution-based exome capture methods for next generation sequencing. Genome Biol. 2011 Sep 28;12(9):R94.

4. Clark et al: Performance comparison of exome DNA sequencing technologies. Nat Biotechnol. 2011 Sep 25. doi: 10.1038/nbt.1975

5. Natsoulis, G. et al., 2011. A Flexible Approach for Highly Multiplexed Candidate Gene Targeted Resequencing. PloS one, 6(6).

6. Bainbridge, M.N. et al., 2010. Whole exome capture in solution with 3Gbp of data. Genome Biology.

7. Kingsmore, S.F. & Saunders, C.J., 2011. Deep Sequencing of Patient Genomes for Disease Diagnosis : When Will It Become Routine? ScienceTranslationalMedicine, 3(87), p.1-4. Review of Bainbridge et al and discussion of WGS and targeted or Exome-Seq. They also suggest that an exome costs 5-15 fold less that a WGS.

8. Maxmen, A., 2011. Exome Sequencing Deciphers Rare Diseases. Cell, 144, p.635-637. A review of the undiagnosed Diseases Program at
NIH. Exome-Seq and high-resolution microarrays for genotyping. They mention the team’s first reported discovery of a new disease, which was published in The New England Journal of Medicine.