Right now, my research focuses on generative machine learning and how to efficiently detect it in the wild - specifically when it is being used to fabricate consensus to create a cyber-attack vector. You can read more about that over on EPFL news
More generaly, I am interested in what we can learn from seeing biological organisms as computing systems. In the context of rogue generative ML, I am leveraging the Fisher-Gillespi-Orr theory of evolution, as well as theories of co-evolution. Prior to that, I was interested in analysing biological systems as univresal approximators and formalized biological knowledge body as a support for optimal molecular mechanism inference.
Before that, I did my PhD in the Rong Li lab , exploring the molecular mechanisms by which aneuploidy is confering stress resistance in cancer and pathogens, which opened up on developing tools to prioritize potential drug targets to treat breast cancer metastases in Joel Bader and Andrew Ewald labs, as part of the CDT2 consortium .
You can also find me over on those websites:
Why: Deep Machine Lerning (ML) have achieved tremedous results in recent years. A lot of them were due to the unreasonable effectiveness of the stochastig gradient descent (SGD) on the problems that could be reduced to differentialbe manifolds - such as for instance word embeddings for text generation. However a number of problems - such as self-driving or strategy games - require discrete actions that cannot be differentiated. Until now, reinforcement learning has proven to be quite successful, but is still challenging to scale and does not deal well with some classes of problems.
What: I proposed a new way of trivially distributing the learning in an efficient and byzantine-resilient manner in case a large number of nodes are able to sample possible solutions efficiently.
Why: A lot of major achievements of the Deep ML has been achieved thanks to the unreasonable effictiveness of SGD. Due to that, there has been a lot of research invested into understanding when and why SGD converges and finds good results rapidly. There are approaches that allow to learn effectively on non-differentiable manifolds, such as reinforcement learning, but it is unclear if they has properties as nice as the SGD.
What:I put forwards a proof that showed and equivalence between a class of evolutionary algorithms (GO-EA) and SGD. While less computationally efficient, the proposed EA is much more easy to parallelize and do not require an explicitely differntiable mainfold - just one that is smooth enough.
Why: Detecting generative machine learnign models in the wild is a two-edged sword. On one hand, we need it to distinguish real human beings from generative models. On the other hand, creating a good detector can in some cases enable the attacher training the generative model to improve it further. It is even the principle itself of Generative Adversarial Networks (GANs). While the best performing generative language models are based on the Transformer architecture, there are a number of them still based on the previous generation of the latent space representation - LSTM (long-short term memory).
What: Along with an undergrad project student, we have empirically showed that the transition to the transformer is not trivial, and as of now, re-use of efficient detectors to imrpove source generative models is not immediately feasible
Why: Since their introduction by Goodfellow in 2014, they have managed to achieve some impressive results in image manipulation. Several alternative architectures aiming to improve the original architecture have been proposed and published. Unfortunately, recent results suggesting that the improved results resulted from selection of best samples and numerous restarts of the training process.
What: We managed to show that some practices inspired by the theory of evolution and co-evolution allowed to stabilize the training of GANs and achieve better results with lower computational expenses.
More: Prepring on Arxiv
Why: Despite a number of ad-hock suggestions on how network theory could be used to elucidate biological mechanisms, there is still no understanding why they should work at all, or which versions of network biology algorithms work well.
What: I have showed that a version of network biology algorithm is equivalent to the formulation of most likely and most parcimonius molecular mechanism underlying a phenotype of interest, with regards to the existing biological knowledge.
More: a GitHub repository providing implementation of the algorithm in the context of biology
Why: My colleagues in the Rong Li lab have made an observation that misfolded proteins in heat-shocked bakers yeast would tend to accumulate on the mitochondria. Why?
What: We have eliminated the hypothesis it could have been due to the electrostatic therodynamics reasons and linked it to the RNA-rich stress granules. Upon further investigation, my colleagues have discovered that they were formed due to the import of misfolded proteins by mitochondria, where they were degraded. This mechanism also exists in mamallian cells, potentially explaining how mitochondrial defects tend to correlate with neurodegenerative disorders due to misfolded protein accumulation, such as Alzheimer.
Why: The vast majority of the drugs in development and on the market today do not bind a single target, but rather have an affinity for a number of targets. While this is not a great news for a single pathway - single disease - single target - single molecule paradigm, and in some cases result in secondary effects, it also means that a lot of drugs can be repurposed for neglected diseases. However we need to figure out which ones can potentially be and prioritize them.
What: I was able to do a proof of concept that used a systems network biology representation of existing biological knowledge to use the results of in-silico binding assays to infer potential off-target activity.
Why: Despite extensive investigations into how the biological organisms operate, there has been little research as to why they operate this way - or, as a matter of fact, have emerged at all in the chaos of primordial chemical reactions.
What: We approach biological organisms as universal approximators - a term that has been coined in the process of trying to better understand deep neural networks. We show that this approach not only makes sense, but also explain a number of wierednesses in biological organisms and and allows us to fairly accurately predict the number of essential genes in yeast, abstracting away their exact biological function and treating them as just "knobs" that can be turned to best adapt to the environment.
More: Prepring on BioRxiv
Why: One of the main problems with treating cancers are the small population of cancer cells that survive treatments that kill most of the tumor, and re-start the tumor growth following a treatment - sometimes years down the line. A common explanation are so-called "cancer stem cells" - cells that reproduce slowly and inhabit well-protected niches.
What: Performing a re-analysis of a growth assay in stress conditions of the breast cancer cell lines (aka in presence of drugs), we have noticed that the pattern was actually different. Some cells performed great in some conditions, terribly in others - as if they were covering different regions in a latent space where drugs were acting in. The few cell lines that seemed to perform well across the board didn't have the drug resistance features that would be expected in the stem or stem-like cells. Combined with prior findings in yeast, I was able to formalize a model of adaptive evolution related the the Fisher Geometrical model, leading to a tool that's able to estimate the dimension of the latent space, combinatorial therapies that we would expect to be most efficient, as well as point towards the generalists - strains most likely to survive after combinatorial therapies.
Why: Today, the life is organized into thre main branches - Eucaryotes, Bacteria and Archeobacteria. However, the similarity in genetic code acress the three main branches, strongly points to a time when they were one - probably a single organism - commonly referred to as the Last Universal Common Ancestor (LUCA). What can we say about the genome of this organism?
What: Building on top of the past research into the minimal likely genome that is common to all living organisms on earth, we have discovered that it is likely that LUCA did not have two of the amino acids all the living organisms share today. Which pooses the question - did they emerge as a result of a convergent evolution? Emerge shortly before the three main branches split off? My role in this project was to write software that was performing automated homology analysis.Paper presenting the findings
Why: Despite numerous tools to perform numerical anlysis of microscopy images, they tend to rely on programming comptences and understanding of undrelying numberical methods of end users. Which is an obstacle in their wide-spread adoption and correct application by the experimental biology community
What: I have distilled a natural language of queries my biological colleagues had when it came to analysing the microscopy images for their projects throughout my PhD. The resulting language is built on top of Python and opperates using natural primitives, as well as providing straight-jackets to make sure that no accidental p-hacking can occur. It's modular and modules can be added by collaborators specializing in numerical methods.
More: a GitHub repository of the source code with examples of several applications.
Why: Aneuploidy is a common feature of cancer cells and - in many cases - correlate with adaptive stress response in other organisms. However, there is a debate whether aneuploidy is adaptive or nefarious for the cells. In the later case, it could be exploited to kill aneuploid cancer cells more efficiently and create drugs for cancer that would promote aneuploidy. However, that would require the understanding of mechanisms by which aneuploidy stresses out the cells of the organism
What: My colleagues were able to demonstrate that aneuploidy-induced stress correlates with osmotic shock response in baker's yeast, and further demonstrate that aneuploid cells were in fact in a hyper-osmotic state. However, it was not clear why. My contribution, was to use fundational thermodynamics to demonstrate that such an over-pressure was expected in case of any aneuploidy - through transcriptional imbalance and protein functionning mainly through complexes, and provide simulations that quantitatively reproduced the experimental results.
Why: My colleagues have experimentally observed that the size of the actomyosing ring that is required for the budding yeast division scaled in contraction speed with the size of that ring - itself correlated with the size of the mother cell. the size of the cell, which was not easy to explain with then dominant hypothesis of subunits assembly. Which was even less clear, is how the ring count contract upon removal of contractile motors - myosine.
What: My contribution was to propose that actomyosin strans were assemled randomly and that contraction was driven not only by the myosin motors, but also through the depolymerization of antisense actin strands in the presence of cross-likers able to uniformly diffuse along the lenghts of the strands.Paper presenting the findings
Outside research, I sometimes play with data to answer a common question over on Quora, such as for instance about PISA rankings and what it means for France, do scientific outreach, such as for instance about Ebola (also here, here and here), COVID, origins of COVID , vaccines (also here), HCQ for COVID (also here), chemical thermodynamics, Simpson's paradox, or putting questions in historical perspective, such as for instance Rendition of Japan in WWII, Importance of Maginot line, reality of the human waves attacks, or the concept of Deep Battle doctrile.
In my free time, I also swim and bike in the summer and ski in winter, or alternatively try to teach my cats tricks (so far they are doing a better job at teaching me to do tricks for them instead).