The largest atlas of genetic diversity: 11 million letters make each person unique

“Today we unveil the language with which God created life,” the President of the United States said 25 years ago from the White House. Democrat Bill Clinton announced the acquisition of the first draft of the human genome , a historic milestone that would “revolutionize” the diagnosis and treatment of virtually all known diseases. A quarter of a century later, scientists are still trying to understand that language, which is far more complex than previously thought.
This Wednesday, the largest map of human variability at the genomic level will be presented. The study reveals a new layer of diversity in the genetic code that makes each person three times more different from another than previously thought.
The breakthrough is based on the study of the entire genomes of 1,019 people from 26 populations on five continents, which represents a record of breadth and diversity, as the first genomes were made only with DNA from white Westerners. A second study has read the entire genomes of 65 people at an unprecedented level of depth. Both studies are published today in Nature , a benchmark for the world's best science.
These studies provide "an unprecedented reference to the genetic variation that each person inherits from their parents," which determines our physical appearance, intellect, and personality, summarizes computational biologist Bernardo Rodríguez-Martín , co-author of one of the studies. The data represent "a giant step" in the diagnosis and understanding of rare diseases, but also of cancer and other much more common ailments, emphasizes the researcher from the Center for Genomic Regulation (CRG) in Barcelona.
Until a few years ago, technology only allowed for the reading of short fragments of the genome, about 50 chemical letters of DNA (the entire genome is composed of 3 billion). These new studies apply long-form sequencing technologies to analyze "structural variants," each with tens of thousands of letters. The work has identified 167,000 structural variants, half of which were completely unknown until now. Three out of five of these are rare, meaning they occur in small groups of people, yet they may be key to their health.
One person differs from another by approximately 25,000 structural variants, equivalent to approximately 7.5 million chemical letters of DNA. This triples the variation known until now, which included only short sequences—one A for one C, for example. In total, there are at least 11 million chemical letters that make each person genetically unique.
A single-letter change in the complete genome sequence is a common cause of some of the thousands of known rare diseases, which globally affect millions of people worldwide. Rodríguez-Marín believes that studying structural variations between individuals can reveal hidden causes of other rare diseases. His team has designed a new tool to screen tens of thousands of patient-specific variants and narrow the number of possible causes to around 200, facilitating diagnosis. This type of analysis is already used at the Sant Joan de Déu Hospital in Barcelona.
Our genome is riddled with insanely repeated sequences that account for up to 60% of the entire genetic code. The culprits are the so-called "jumping genes," which copy themselves over and over again, sometimes thousands of times, like viruses that never leave the genome. The new study reveals that some of these fragments, known as L1, are capable of regulating gene function to their own benefit. "Thousands of copies of these elements have been found in colon, lung, and esophageal tumors," explains Rodríguez-Marín. Some of these jumps can deactivate essential genes to suppress cancer in the body. Until just a few years ago, these repetitive sections were considered junk DNA of little interest.
The second study, led by researchers at the European Molecular Biology Laboratory in Germany, has read the virtually complete genomes of 65 individuals at an unprecedented level of detail. This means that 99% of each participant's genetic code has been read, but also that 92% of all the "gaps" that remained to be read have been covered, highlights Jan Korbel , co-author of the study. "Most human genomes obtained to date had blank spaces, because their genetic sequence could not be read due to the fact that they contained many repetitive sequences, structural complexity, and technological limitations," he adds. One of the milestones of this study has been reading the centromeres, the neck that joins the two parts of the 23 human chromosomes, where errors can trigger autoimmune diseases and cancer.
The studies published this Wednesday, signed by around 100 scientists from seven countries, are also key to the human pangenome project promoted by the National Institutes of Health in the United States to summarize all human genetic variability. The project currently consists of the genomes of 200 people from different backgrounds.
Biologist Bárbara Hernando , from the National Cancer Research Center, praises both studies, specifically for their potential applications to understanding cancer. These types of inherited structural variations "contribute to 6% of some childhood solid tumors, so they are of crucial importance," she emphasizes. "Furthermore, we have many cases of hereditary cancer with no known cause. It's possible that many of them are due to these types of variants, which the short-sequencing technologies still prevalent in clinical practice are unable to detect," she emphasizes. The same technology can be used to explore spontaneous mutations that arise during life, such as genomic aberrations that enhance the most aggressive tumors .
The researcher also highlights the inclusion of more diverse populations. “Ethnic origin is key to cancer susceptibility and prognosis. For example, people of African descent tend to suffer from much more aggressive prostate tumors due to inherited variants. Including the analysis of these structural variants in clinical practice can significantly improve cancer treatments, especially in non-European populations who have so far received little attention,” she adds.
Álvaro Rada , from the Institute of Biomedicine and Biotechnology of Cantabria, highlights the overall importance of this new reinterpretation of the human genome. "Until now, we didn't have a complete catalog of structural variations with which to know how diverse the human genome is. The great advance of this work is providing a repertoire of these variations in different populations that will help us understand whether a patient has a pathological variant or not," he emphasizes. Rada details the dramatic impact of this technology. "This new map will be essential to better understand the genetic risk of suffering from any of the many non-infectious diseases, such as cardiovascular diseases, diabetes, obesity, or Alzheimer's."
Rodríguez-Marín was 12 years old when Clinton announced the first draft of the human genome, and he's much more aware of the growing complexity he faces. "I agree that the genome is the language of life, although as a scientist, I believe there's no evidence to support that it was written by a creator God," he explains. "On the other hand, it's a very anthropocentric assertion to think that by decoding the human genome we will understand the language of life. Each species has its own genome, the result of Darwinian evolution ."
EL PAÍS