Defining a new Cosmic Playground: Navigating the Protein Space

Did you ever consider exploring the protein space? Probably not. I am trained in molecular biology and have a nerdy interest in areas of theoretical biology that most people do not even want to know of. But bear with me because it is stunning, let me take you by the hand and explore the concept of constellations of proteins circling around the genomes of all the species ever lived.
According to astronomers, there are approximately 400 billion stars in our Milky Way galaxy alone. And beyond, in the observable universe, there are about 2 trillion galaxies—each with its own stellar ensemble.
So, if we multiply the number of stars in a typical galaxy (around 100 billion) by the estimated number of galaxies, we arrive at an astounding cosmic total: approximately 200 billion trillion stars (or 200 sextillion).
So what do I mean with protein space ? Keep in mind all life on earth uses the same building blocks to make protein, the amino acids which are encoded in our DNA by triplets of nucleic acids. Proteins are the tools of all life, they are used for almost all tasks; proteins are used to build a body, repair tissues, multiply cells and in general essential to stay alive. The cool thing is that the language of proteins is relatively simple. Proteins are strings of amino acids and build from a set of 20 different amino acids. A small protein can for example consist of a string of hundred amino acids. And like beads on a string there are dazzling numbers of possible combinations while only a few are present in the human genome. An estimate of the size of the entire proteome sequence space is impossible as there are so many sizes of proteins and modifications complicating the calculations but lets for the sake of simplicity do a calculation for a protein of 400 amino acids in which any of the normally occurring 20 amino acids can be found. For an average protein of 400 aminoacids the number of combinations is 20E400 (approx. 2.6*10E520) human genome research revealed that the human genome encodes roughly 3X10E3 or 30.000 proteins. Evolution left us humans with 30.000 proteins while there is a universe of possibilities 2.6 times 10 with 520 zero’s, now this is what I call the proteome space. It is a vast domain and there are lots of opportunities to roam and get lost in this space while efficient and targeted exploration is crucial for useful and meaningful applied biology. So how do we explore this space? Well, first of all we have some constraints so we start to look closeby: the human genome already confines the boundaries of the human protein space. So using the human genome we already identified a small but very relevant part of the human protein space.

Here’s where it gets exciting. While we’ve identified and verified approximately 30,000 proteins, the remaining 98% of our genome remains poorly understood. Lets again use a space a metaphor: it’s like gazing at the night sky and knowing there are countless more solar systems ( constellations of amino acids) waiting to be discovered. And just like we see the stars on a dark clear night in the field we can see more and more details encoded in the human genome as we start to understand the language of life better. However, we do not have a clue what many of the sequences in the human genome mean in terms of proteins. And if we extrapolate that lack of understanding to all life on earth we have a countless universes of protein systems ahead. First of all we explore the human system, analogous like we started our protein space exploration by observing the planets (proteins) circling our sun (human genome) in our solar system. We start with the proteins in our own bodies, the human proteome. This proteome, the language of life, can be explored with large language models, or better known as Artificial Intelligence (AI). one example is the recent exploration of the crispr cas in an article entitled: Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences. The authors used AI called openCRISPr2 to explore the CRISPR-Cas protein space to find new variants and useful variations of the known proteins.
Now imagine this type of bioinformatics to explore the complete protein space. The size of this space is gigantic and unfathomable for humans. The human genome is estimated to contain approximately 3 billion base pairs, which encode for about 30.000 protein-coding genes. Each protein-coding gene contains a sequence of nucleotides that is translated into a sequence of amino acids, which make up the protein.
Our luck is that we live in an unprecedented time of discovery where for example bioinformatics became helpful to explore the protein space in new ways.


Leave a comment