skip to Main Content

Some personal comments on “The genetic and genomic map of multiple sclerosis” that is now available in biorxiv

Few hours ago, with great pride we submitted the Genetic and Genomic Map of Multiple Sclerosis for consideration in a weekly journal. In parallel, we made the paper available in biorxiv to get our results out as soon as possible.

This work is the outcome of collaboration of hundreds of scientists (International Multiple Sclerosis Genetics Consortium; IMSGC), selflessness of hundreds of thousand of individuals that donate their DNA for studies, and many agencies our the planet that funded local, national, or international efforts. Of course I am perhaps the most biased person regarding this study but it is undeniably a landmark study for genetics of multiple sclerosis (MS) and other common diseases. Here is a small (and not full) list:

  1. It is one of the few, if not the only, genetic study of such large sample size that the analysis team had access to data for all ~110K samples.
  2. We identified 233 genome-wide (p-value < 5×10-8) genetic variants, more than doubling the knowledge in MS. This also makes MS one of the most well-studied diseases in terms of genetics.
  3. We performed a genome-wide extensive analysis in a discovery phase (~41K samples) to find statistically independent effects, setting our threshold to p-value of 0.05. Thus, we analyzed any part of the genome that had even the slightest evidence for association.
  4. We designed a disease specific chip array, MS Chip, to replicate the variants found in the discovery phase in ~40K samples.
  5. We used two replication sets that jointly had ~200% more sample size than the discovery set.
  6. We could explain ~48% of the genetics of MS!
  7. We applied an ensemble of approaches to prioritize putatively causal genes and we provided detailed lists to the community. We hope that this will allow many scientist that are not that savvy computationally to better inform and expedite their studies.
  8. (… many more in other posts)

In papers of that scale and size many small details are lost and many more things are never reported. This is a small effort to offer some tidbits and anecdotes, especially of the discovery phase that took place many years ago. Perhaps other scientists have already encountered or will encounter something similar. If you do, feel free to reach out.

The part of the paper where we identified the associated genetic variants had two phases: the discovery and the replication. The discovery phase included ~41.5K samples. The replication phase had two data collections of ~40K samples (MS Chip) and ~33K (ImmunoChip). This post is for the discovery phase that took place in 2011. Perhaps in other posts I will comment on the rest, along with other parts of the paper.

Some numbers and tidbits for the discovery phase analysis:

  • During the discovery phase we had to merge cases with similar ancestry controls, especially samples originated from the WTCCC2  data. I created more than 100 permutations to find the best combination of cases and controls. Each step was taking several days given that I was running a watered-down quality check (QC) pipeline and principal component analyses (PCA). That was back in 2011 (!) with PLINK 1.7 and EIGENSOFT v2 or v3. Amazing tools, groundbreaking, but back then I reached their limits so many times. This is when my interest for efficient and optimized software originated (see Research for a hint of what we are working on).
  • When the case-control data sets for the discovery data sets were set in stone they real thing started. ~42K samples with genome-wide data (we are still in 2011). Bash arrays and directory structure strategies, coupled with hypercomputing clusters saved the day. A bird’s eye view of the directory structure highlights the beauty of symmetry. From then and onwards all similar projects, e.g. the MS Chip replication data sets, follow the same structure pattern. A sad realization that computational “wealth” allows better science. Even today is a lot of scientist cannot fathom the importance and need of computational resources.
  • Then the imputation day came… The first version of the 1000 Genomes just came out but everybody realized that there is no way to impute data with the same strategies we used for HapMap. Even efforts to impute the smallest chromosome, i.e. 22, resulted in spectacular crashes of hundreds of processes on the cluster. Back to the drawing board and search for new tools. Here comes BEAGLE, we were working with MACH until then. Less crashed, but still very far away from been able to impute a full chromosome. So, divide and conquer! I wrote some code to split the chromosomes, impute, and put them back together. Lots of testing, re-writing and more of the same and after few days we had a working imputation pipeline. And then the moment of truth. How do you impute 42K samples with the new 1,000 Genomes panel that includes ~34M genetic positions? None (?; maybe a handful) had achieved anything like that before. Three clusters later, 73,297 processes, ~1.2M CPU hours, and 4 months we were ready! If I had only one (fast, really fast) computer I would need ~130 years to finish this step. 130 years! You gotta love computer science! By the way, this is where we passed the line of the 5TB of data. Then ~2 days (!) of analyzing these TBs of data with PLINK v1.7, in parallel using a cluster, and we had our first genome-wide results! Genomic inflation factor of 1.13 (!), extremely smaller compared to the ~1.4 that other collaborators faced previously with half the data.
  • In December 12 2011 the “Meta-analysis v3.0 First-Pass and Per Stratum Results v1.0” is sent to the consortium list. 75 pages! From that report: “Overall 35,272 SNPs reached a p-value less than 1.0e-05 and 26,352 genome-wide significance under fixed effects model (<5e-08, uncorrected p-values). These 35,272 are located 173 2Mbs regions were identified with the best tagging SNP having an uncorrected p-value (fixed effects) less than 1.0e-05”.  Translation: we might find more than 150 genome-hits in MS when we finish this project (we found many more)! Back then (2011) we only new of ~50!!
  • Next step the independent effects. The long story short: when we find one genetic variant to have a low p-value, hence having stronger association, there are several hundred others around it that have also low p-values. Most of them will capture the same effect, i.e. point to the same underlying cause. Others however are independent. These would carry some effect even if the original genetic variant would not be there. One way to test is is to put them in the same model. Practically this process is done in steps and we had to do this in each of ~2K regions we split the genome… Couple of weeks of writing code, testing, crashing, designing, re-writing, I had the final version that took ~1 full week to run in ~1,000 nodes using PLINK v1.7 (nowadays PLINK v1.9 could do this in ~1 day in ~1,000 nodes; our new in-house software ~1 day in <200 nodes). Oh, this step generated more than 10M files with results…!

 

PS: Eight years ago from the moment I pressed the button to submit the paper to the journal (on behalf of the corresponding author Phil De Jager and the IMSGC), I was waiting in the Zurich airport during my layover. It was my “2-suitcases” flight from Greece to Boston, starting the new chapter in my professional life. Reflecting back at that day I could never have imagined what I would be doing eight years later. This is one of the proudest moments in my life. The date coincidence is the cherry on the top. A sing to mark the start of the next big things that are coming, faster this time(!). Check our “Research” page for a small taste.

Note: This post reflects personal opinions only. None of these comments represents opinions of any of the other authors or the consortium.

 

Extra: A snapshot from the facebook post the moment before I open the file with the results of the discovery phase. I still smile when I see the “4 likes”. Only other “nerds” could understand this.  And of course a much needed reference to the Answer to the Ultimate Question of Life, the Universe and Everything; after all I was also using supercomputers.

Back To Top