Home / C2D3 Computational Biology

C2D3 Computational Biology

C2D3 Computational Biology logo

We are living in a very exciting time for biology: whole-genome sequencing has opened up the field of genome-scale biology and with this a trend to larger-scale experiments, whether based on DNA sequencing or other technologies such as microscopy.  However it is also a time of great opportunity for small-scale biology as there is a new wealth of data to build from: one can turn to a computer to ask questions that previously might have taken months to answer in the laboratory. One of the great challenges for the field is analysing the large amounts of complex data generated, and synthesising them into useful systems-wide models of biological processes. Whether operating on a large or small scale the use of mathematical and computational methods is becoming an integral part of biological research.

There remains a world-wide shortage of skilled computational biologists. An important part of C2D3 Computational Biology is an MPhil course based at the Centre for Mathematical Sciences. The 11-month course introduces students to bioinformatics and other quantitative aspects of modern biology and medicine. It is intended especially for those whose first degree is in mathematics and computer science and others wishing to learn about the subject in preparation for a PhD course or a career in industry. Complementing the MPhil course is the Wellcome Trust PhD programme in Mathematical Genomics and Medicine.  Run jointly with the Wellcome Trust Sanger Institute this programme provides opportunities for collaborative research across the Cambridge region at the exciting interfaces between mathematics, genomics and medicine.

History and financial support 

C2D3 Computational Biology came about by the merger of the Cambridge Computational Biology Institute (CCBI) into C2D3 in 2021. The CCBI was established in 2003 to promote computational biology, interpreted broadly, within the University and in the region. It established (2004) the MPhil in Computational Biology programme, founded (2011) the Wellcome Trust Mathematical Genomics and Medicine 4-year PhD programme, and, among other activities, started a popular computational biology annual symposium. The CCBI was involved in setting up and helping to run the Cambridge Big Data (CBD) Strategic Research Initiative out of which the C2D3 Interdisciplinary Research Centre was formed. Similarly the CCBI was part of the group that helped set up the Alan Turing Institute.  

The CCBI received financial support equally from the four science schools of the University: 

  • The School of the Biological Sciences      
  • The School of Clinical Medicine      
  • The School of the Physical Sciences (via DAMTP, Physics, Chemistry)      
  • The School of Technology (via Engineering, Computer Science) 

Space was kindly provided by the Department of Applied Mathematics and Theoretical Physics, within the Centre for Mathematical Sciences. 

MPhil in Computational Biology  

The Cambridge-MIT Institute provided funds to establish the MPhil in Computational Biology and subsequently studentships have been provided by: 

  • Biotechnology and Biological Sciences Research Council      
  • Cancer Research UK      
  • Engineering and Physical Sciences Research Council      
  • Medical Research Council      
  • Microsoft Research 

MGM PhD Programme 

The PhD programme in Mathematical Genomics and Medicine is funded by the Wellcome Trust.

Mailing list

To sign-up to the mailing list, with option to join the C2D3 main mailing list, please complete the appropriate form here.


An introduction to counts-of-counts data

Wednesday, 12 October 2022, 3.00pm to 4.00pm
Speaker: Simon Tavaré PhD Herbert and Florence Irving Director Irving Institute for Cancer Dynamics & Professor, Departments of Statistics and Biological Sciences Columbia University
Venue: CMS, Meeting Room 15

Counts-of-counts data arise in many areas of biology and medicine, and have been studied by statisticians since the 1940s. One of the first examples, discussed by R. A. Fisher and collaborators in 1943 [1], concerns estimation of the number of unobserved species based on summary counts of the number of species observed once, twice, … in a sample of specimens. The data are summarized by the numbers C1, C2, … of species represented once, twice, … in a sample of size
N = C1 + 2 C2 + 3 C3 + …. containing S = C1 + C2 + … species; the vector C = (C1, C2, …) gives the counts-of-counts. Other examples include the frequencies of the distinct alleles in a human genetics sample, the counts of distinct variants of the SARS-CoV-2 S protein obtained from consensus sequencing experiments, counts of sizes of components in certain combinatorial structures [2], and counts of the numbers of SNVs arising in one cell, two cells, … in a cancer sequencing experiment.

In this talk I will outline some of the stochastic models used to model the distribution of C, and some of the inferential issues that come from estimating the parameters of these models. I will touch on the celebrated Ewens Sampling Formula [3] and Fisher’s multiple sampling problem concerning the variance expected between values of S in samples taken from the same population [3]. Variants of birth-death-immigration processes can be used, for example when different variants grow at different rates. The classical Yule process with immigration can be used to derive some of the combinatorial results in a simple way, through a probabilistic trick known as embedding.


[1] Fisher RA, Corbet AS & Williams CB. J Animal Ecology, 12, 1943
[2] Arratia R, Barbour AD & Tavaré S. Logarithmic Combinatorial Structures, EMS, 2002
[3] Ewens WJ. Theoret Popul Biol, 3, 1972
[4] Da Silva P, Jamshidpey A, McCullagh P & Tavaré S. Bernoulli, in press, 2022

A semantics knowledge commons for climate change

Wednesday, 19 October 2022, 3.00pm to 4.00pm
Speaker: Peter Murray-Rust, Reader Emeritus in Molecular Informatics, Yusuf Hamied Department of Chemistry
Venue: CMS, Meeting Room 15

Abstract not available

Title to be confirmed

Monday, 24 October 2022, 3.00pm to 4.00pm
Speaker: Francesca Buffa
Venue: CRUK CI Lecture Theatre

Abstract not available

Buffering genetic variation in populations

Wednesday, 26 October 2022, 3.00pm to 4.00pm
Speaker: Ritwick Sawarkar, MRC Toxicology Unit
Venue: CMS, Meeting Room 15

Natural populations harbour enormous genetic variation – differences in the coding and non-coding part that give rise to differential susceptibility to infections/ diseases and responsiveness to treatment. Think of side effects that we get due to covid vaccines – some people have strong effects, some have no effects after vaccination. It is likely that genetic differences among individuals of the population underlies this variation in response to vaccines. Cells have evolved multiple mechanisms by which the effects of genetic variation are minimised, or ‘buffered’. Our work focusses on genetic variation in the non-coding parts of the genome and a cellular strategy dependent on a molecular chaperone. The talk will specifically focus on variation in repetitive parts of the genome and single-nucleotide polymorphism, combining computational tools for data analysis with hypothesis testing using biochemical approaches.

Principles of Protein Structural Ensembles Determination

Wednesday, 2 November 2022, 2.00pm to 3.00pm
Speaker: Michele Vendruscolo, Co-Director, Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry
Venue: CMS, Meeting Room 15

Achieving a comprehensive understanding of the behaviour of proteins is greatly facilitated by the knowledge of their structures, thermodynamics and dynamics. This information can be provided in an effective manner in terms of structural ensembles. A structural ensemble can be obtained by determining the structures, populations and interconversion rates for all the main states that a protein can occupy. I will describe how the well-established principles of protein structure determination should be extended to the case of protein structural ensembles determination. These principles concern primarily how to deal with conformationally heterogeneous states, and with experimental measurements that are averaged over such states and affected by a variety of errors. I will address some conceptual problems in the determination of structural ensembles and define future goals towards the establishment of objective criteria for the comparison, validation, visualization, and dissemination of such ensembles.

About us

The Cambridge Centre for Data-Driven Discovery (C2D3) brings together researchers and expertise from across the academic departments and industry to drive research into the analysis, understanding and use of data science and AI. C2D3 is an Interdisciplinary Research Centre at the University of Cambridge.

  • Supports and connects the growing data science and AI research community 
  • Builds research capacity in data science and AI to tackle complex issues 
  • Drives new research challenges through collaborative research projects 
  • Promotes and provides opportunities for knowledge transfer 
  • Identifies and provides training courses for students, academics, industry and the third sector 
  • Acts as a gateway for external organisations 

Join us