Ítem


Machine learning guided identification of 2-oxoglutarate dependent halogenases

This work is a project on enzyme annotation. In particular, Fe and α-ketoglutarate dependent dioxygenases (aKGs), and halogenases. These enzymes are very similar, as halogenases are a subclass of aKGs, only differentiated by one amino acid in their binding site. Proteins can be found in huge (+250 million) public databases. They use multiple algorithms based on information and properties manually inputted. As the number of proteins grow faster and faster, automatic annotation algorithms are needed. Enzymes are characterised by their amino acid sequence. This sequence codifies all relevant information about the protein, but it is not trivial to extract it. However, automatic algorithms need to use only the sequence, as it is the only thing known about a protein in a vacuum. As the protein sequence is a list of letters, very powerful transformer models have been developed recently to try to convert these sequences to embeddings imitating NLP embedding models. This project uses ProtBert and ProtT5, based around Bert and T5. With the embeddings, multiple downstream tasks can be done. The one that interests us is classification: knowing if a protein is an aKG or not, or an halogenase or not. In order to test these algorithms, a high quality dataset is required. This is the central part of the project, as it needs to be sufficiently big for ML tasks

9

Director: Palarea‐Albaladejo, Javier
Sánchez, Mateo
Altres contribucions: Universitat de Girona. Escola Politècnica Superior
Autor: Garçon, Albert
Data: juny 2024
Resum: This work is a project on enzyme annotation. In particular, Fe and α-ketoglutarate dependent dioxygenases (aKGs), and halogenases. These enzymes are very similar, as halogenases are a subclass of aKGs, only differentiated by one amino acid in their binding site. Proteins can be found in huge (+250 million) public databases. They use multiple algorithms based on information and properties manually inputted. As the number of proteins grow faster and faster, automatic annotation algorithms are needed. Enzymes are characterised by their amino acid sequence. This sequence codifies all relevant information about the protein, but it is not trivial to extract it. However, automatic algorithms need to use only the sequence, as it is the only thing known about a protein in a vacuum. As the protein sequence is a list of letters, very powerful transformer models have been developed recently to try to convert these sequences to embeddings imitating NLP embedding models. This project uses ProtBert and ProtT5, based around Bert and T5. With the embeddings, multiple downstream tasks can be done. The one that interests us is classification: knowing if a protein is an aKG or not, or an halogenase or not. In order to test these algorithms, a high quality dataset is required. This is the central part of the project, as it needs to be sufficiently big for ML tasks
9
Format: application/pdf
Cita: 26592
Accés al document: http://hdl.handle.net/10256/27574
Llenguatge: eng
Drets: Attribution-NonCommercial-NoDerivatives 4.0 International
URI Drets: http://creativecommons.org/licenses/by-nc-nd/4.0/
Matèria: Enzims
Enzymes
Proteines -- Estructura
Proteins -- Structure
Aprenentatge automàtic
Machine learning
Bioinformatics
Bioinformàtica
Títol: Machine learning guided identification of 2-oxoglutarate dependent halogenases
Tipus: info:eu-repo/semantics/masterThesis
Repositori: DUGiDocs

Matèries

Autors