Applied Machine Learning for Biological Data
Location: Online
Organisers
Trainers
- Sabry Razick (University of Oslo)
- Pubudu Saneth Samarakoon (University of Oslo)
- Burcin Buket Ogul (University of Oslo)
- Milan De Cauwer (SINTEF Norway)
- Katarzyna Michalowska (SINTEF Norway)
- Elias Myklebust (Simula Research Laboratory)
BioNT - BIO Network for Training - is an international consortium of academic entities and small and medium-sized enterprises (SMEs). BioNT is dedicated to providing a comprehensive training program and fostering a community for digital skills relevant to the biotechnology industry and biomedical sector. With a curriculum tailored for both beginners and advanced professionals, BioNT aims to equip individuals with the necessary expertise in handling, processing, and visualising biological data, as well as utilising computational biology tools. Leveraging the consortium's strong background in digital literacy training and extensive network of collaborations, BioNT is poised to professionalise life sciences data management, processing, and analysis skills.
This intensive workshop focuses on applying machine learning techniques to biological and genomic data, combining theoretical foundations with hands-on coding experience. Participants will work through real-world scenarios using Python-based tools and frameworks that are critical for modern bioinformatics.
Module 1 (optional) provides a solid foundation in scientific computing with Python. Across two half-day sessions, participants will explore essential data handling techniques using NumPy and Pandas—tools widely adopted for manipulating and analyzing biological data.
Module 2 (mandatory) spans five full days and begins by introducing core concepts in machine learning. On the first day, the participants will be introduced to unsupervised learning, and they will implement clustering algorithms and dimensionality reduction techniques using real-world genomics data. The workshop then dives into supervised learning with a focus on classification and regression, including logistic regression and tree-based methods. Participants will construct and evaluate ML models, perform cross-validation, and tune hyperparameters in hands-on sessions tailored to cancer genomics datasets. Later sessions introduce deep learning concepts and the PyTorch framework. Participants will learn to build and train simple neural networks and explore a deep learning-based bioinformatics tool used in genomic variant calling. The final day introduces accelerated genomics through GPU-powered workflows. Participants will learn about GPU technology and how to use containerized bioinformatics tools. They will also implement high-performance, GPU-accelerated pipelines using Parabricks.
Please note: Registration is only required for participation in Module 2. We can accommodate 40 participants for Module 2 due to logistical constraints, including access to virtual machines and GPUs. Module 1 is optional and does not require registration.
This workshop offers a comprehensive, practical journey through the machine learning landscape in bioinformatics, from data wrangling to deep learning and scalable genomic workflows.
Join this workshop if you are:
- A life scientist, bioinformatician, or data analyst working with biological or genomic data
- Curious about how machine learning can be applied to biological research questions
- Looking to strengthen your Python skills for data handling and analysis
- Interested in implementing classification, regression, or clustering models on real-world datasets
- Exploring the use of deep learning techniques, in bioinformatics
- Involved in next-generation sequencing (NGS) workflows and want to optimize them with GPU acceleration
- Committed to building reproducible and scalable analysis pipelines using container technology
- Eager to understand and apply best practices in model evaluation, tuning, and validation
- New to machine learning and seeking a hands-on, structured introduction
Learning Outcomes:
By the end of this workshop, you will be able to:
- Apply data manipulation techniques using NumPy and Pandas.
- Define essential machine learning terminology and differentiate between supervised and unsupervised learning approaches.
- Implement and evaluate regression and classification models on biological datasets through hands-on coding exercises.
- Apply regularization techniques and hyperparameter tuning to optimize model performance while preventing overfitting.
- Analyze biological questions to determine the most appropriate machine learning approach (regression, classification, clustering).
- Interpret and evaluate machine learning models using appropriate metrics and cross-validation techniques to ensure reliability.
- Develop scripts using PyTorch to build and train simple neural networks and implement deep learning based bioinformatics tools using genomics datasets.
- Design end-to-end machine learning workflows for biological applications, from data preprocessing to model deployment.
- Implement containerization using Docker to enhance reproducibility and scalability in bioinformatics workflows.
- Compare CPU-native versus GPU-accelerated approaches for genomic data processing and identify computational bottlenecks.
Optional: Contribute a Use Case
To help us better align the workshop content with participants’ professional backgrounds, you may optionally submit a brief use case from your work or industry. These use cases can provide valuable context and may be used during the workshop for discussion or practical exercises.
Please note that use cases will only be included if they align with the workshop objectives and can be integrated without significant additional effort.
Submission can be done it in the pre-workshop survey. It is optional and does not guarantee that your example will be included.
Programme
|
Module 1 |
||
Date |
Start |
End |
Session Title |
27th of May |
09:00 |
12:00 |
Module 1 – NumPy: Fundamentals, Indexing, Masking, Vectorized Operations |
28th of May |
09:00 |
12:00 |
Module 1 – Pandas: Data Structures, Cleaning, Transformation, Integration |
Module 2 - day 1 - 2nd of June |
|||
Start |
End |
Duration |
Title |
09:00 |
09:20 |
20 min |
Welcome and Introduction |
09:20 |
10:30 |
70 min |
Introduction to “Module 2”; ML terminology and ML in Bioinformatics
|
10:30 |
10:40 |
10 min |
Break |
10:40 |
12:00 |
80 min |
Unsupervised Learning: Clustering (K-Means Clustering, Hierarchical clustering, Clustering evaluation metrics) |
12:00 |
13:00 |
60 min |
Lunch |
13:00 |
14:00
|
60 min |
Unsupervised Learning: Dimensionality reduction (Principal component analysis - PCA) and T-SNE |
14:00 |
14:10 |
10 min |
Break |
14:10 |
15:50 |
100 min |
Hands-on session demonstrating PCA and clustering in cancer genomics |
15:50 |
16:00 |
10 min |
Feedback and summary |
Module 2 - day 2 - 3rd of June |
|||
Start |
End |
Duration |
Title |
09:00 |
09:10 |
10 min |
Welcome back |
09:10 |
10:50 |
100 min |
Classification: Logistic regression; Tree-based methods; Matrices for classification evaluation |
10:50 |
11:00 |
10 min |
Break |
11:00 |
12:00 |
60 min |
Hands-on session demonstrating Logistic regression in cancer genomics |
12:00 |
13:00 |
60 min |
Lunch |
13:00 |
14:30
|
90 min |
Regression: Regression mechanics, Loss function, Regularized regression, Matrices for regression evaluation |
14:30 |
14:40 |
10 min |
Break |
14:40 |
15:50 |
70 min |
Regression: Regression mechanics, Loss function, Regularized regression, Matrices for regression evaluation |
15:50 |
16:00 |
10 min |
Feedback and summary |
Module 2 - day 3 - 4th of June |
|||
Start |
End |
Duration |
Title |
09:00 |
09:10 |
10 min |
Welcome back |
09:10 |
10:50 |
100 min |
Model validation and optimization (Overfitting and underfitting, Standardizing Data, Handling missing data)
|
10:50 |
11:00 |
10 min |
Break |
11:00 |
12:00 |
60 min |
Model validation and optimization (Overfitting and underfitting, Standardizing Data, Handling missing data) |
12:00 |
13:00 |
60 min |
Lunch |
13:00 |
14:00
|
60 min |
Model validation and optimization (K-fold cross-validation) |
14:00 |
14:10 |
10 min |
Break |
14:10 |
15:50 |
100 min |
Hands-on session: ML workflow with biological data |
15:50 |
16:00 |
10 min |
Feedback and summary |
Module 2 - day 4 : 5th of June |
|||
Start |
End |
Duration |
Title |
09:00 |
09:10 |
10 min |
Welcome back |
09:10 |
10:50 |
100 min |
Introduction to deep learning (Basic concepts of Neural Networks - NN; Simple NN with PyTorch) |
10:50 |
11:00 |
10 min |
Break |
11:00 |
12:00 |
60 min |
Introduction to deep learning (Basic concepts of Neural Networks - NN; Simple NN with PyTorch) |
12:00 |
13:00 |
60 min |
Lunch |
13:00 |
14:30
|
90 min |
Building simple NN with PyTorch, Deep learning applications in genomics |
14:30 |
14:40 |
10 min |
Break |
14:40 |
15:50 |
70 min |
Hands-on session demonstrating deep-learning-based variant calling via DeepVariant |
15:50 |
16:00 |
10 min |
Feedback and summary |
Module 2 - day 5 : 6th of June |
|||
Start |
End |
Duration |
Title |
09:00 |
09:10 |
10 min |
Welcome back |
09:10 |
10:50 |
100 min |
Introduction to Accelerated Genomics (NGS data analysis, GPU introduction) |
10:50 |
11:00 |
10 min |
Break |
11:00 |
12:00 |
60 min |
Introduction to Accelerated Genomics (NGS data analysis, GPU introduction) |
12:00 |
13:00 |
60 min |
Lunch |
13:00 |
14:00
|
60 min |
Introduction and implementation of containers |
14:00 |
14:10 |
10 min |
Break |
14:10 |
15:50 |
100 min |
Accelerated Genomics workflows with Parabricks |
15:50 |
16:00 |
10 min |
Feedback and summary |
Recommendations:
- To follow the workshop more efficiently, we recommend having a two-screen setup
- To actively communicate during the workshop, please familiarise yourself with Markdown formatting by reviewing the HedgeDoc features document
Interaction between participants, trainers and helpers
The workshop will be delivered in a Zoom webinar format, with participants’ visibility disabled to preserve their privacy. You, as a participant, will be able to see and learn from the trainers but a direct interaction (e.g. chat or voice) will not be possible during the sessions. Instead, a collaborative document, previously setup by the trainers, will be shared with you before the session. You will be expected to engage and interact anonymously with other participants as well as with the workshop helpers and trainers directly in this document.
Trainer Hubs
All BioNT workshops are offered at no cost, but there are a limited number of seats available. To make workshops more accessible for members of the same company we highly recommend organising what we refer to as "Training Hubs." In this arrangement, one person is formally registered for the workshop, but the knowledge sharing can be expanded to numerous colleagues within their company or SME through live-streaming the session.
How to register for Module 2
The workshop is free of charge. However, registration is required only if you wish to participate in Module 2. Module 1 is open and optional; no registration is needed to join Module 1 sessions.
To participate, please follow these steps:
- Click on the window “Participate” at the top of this page
- You will be redirected to the members.cecam.org page. If you already have an account on our platform, please proceed to step 5
- On the top-right corner click "Register" and complete the provided form. As indicated, completing this form does not register you to the workshop. Within 72 hours you will receive an email confirming your account has been activated. Due to this processing time, we advise you to register a few days before the registration deadline
- After receiving the account activation confirmation, visit the workshop page again and follow instructions starting from step 1
- You should now have an active account. After login in with your login details, you should be redirected to the workshop registration page
- In order to start your registration please follow the instructions of the linked pre-workshop survey until you will get your unique identifier
- To finalise your registration please use the unique identifier in the CECAM platform in the corresponding section and press “Send mail”
- Your application is now submitted for evaluation. If selected, you will be contacted later to confirm your attendance and provide instructions for installing the required software and participating in the online workshop.
References
Silvia Di Giorgio (ZB MED – Information Centre for Life Sciences) - Organiser
Norway
Sabry Razick (University of Oslo) - Organiser
Pubudu Samarakoon (University of Oslo) - Organiser