Genetic Analysis Project 2020

Project Concluded:   August 15, 2022

Analysis Finished:   Expected 2024

DB Reference:   genetic_2020

part of project scripts

Introduction

The genetic analysis project contains a group of people from different backgrounds. This includes physicians, pharmacists, researchers, and computer scientists. This project aims to do gene-level research (which is beyond my level of knowledge). As a computer "scientist", I am responsible for analysis script development, architecture design, hardware purchase, system deployment, network configuration, etc. In a word, I am in charge of all non-medical tech-related stuff.

This project is one of the biggest projects that I've ever conducted. It includes ten components: overview, preliminary, installation, pipeline, data download, data process, data analysis, storage, script development, and others.

sample image of a DNA

Components

  • Overview:

    1. The purpose
  • Preliminary:

    1. Feasibility
    2. Performance Estimation
    3. Resrouce Estimation
    4. Structure
    5. Hardware and the future
  • Installation:

    1. Hardware Preparation
    2. Hardware Installation
    3. Software & OS Configuration
    4. The Datacenter
  • Pipeline:

    1. Overview & Software
    2. Dealing with errors
    3. File Structure
  • Data Download:

    1. Overview
    2. Obtain and Preprocess the Links
    3. Software & Structure
    4. Testing & Firewall Circumvention
    5. Error Processing
  • Data Process:

    1. Metadata Storage & Database
    2. Assigning Tasks to Queue
  • Data Analysis:

    1. Overview
    2. Software & Structure
    3. Optimization & Performance Improvement
    4. Dealing with Errors
    5. Result Collection
    6. Automation
  • Storage:

    1. System Overview
    2. Logical Volume
    3. Identifier and Database
  • Script Development:

    1. Design Principle
    2. Major Scripts
    3. Recovery from Missing Files
    4. Human Interaction Improvement and Auxiliary Scripts
  • Others:

    1. Management and Security
    2. The Next Steps

The Purpose

For security purpose(s), the information about affiliated institution(s) & people was not discussed and is not related to the technical details of this project.

Beginning in 2020, I participated in this project as my part-time hobby. I was excited to get involved not only because I work with a great team of doctors and researchers, but also to have a chance to get hands on the leading technologies and new generation of equipment. In this project, I was responsible for all technical-related stuff including but not limited to software & scripts development, hardware, network, database, virtualization, etc.

The genetic analysis project is a medical-related bioinformatics project. The purpose of it is to analyze the gene sequences of Homo sapiens, Mus musculus, and Rat rattus. Then, compare the gene expression with the control group, and try to find the potential treatment(s) to "reverse" the abnormal expression pattern. To begin, we have to obtain (a lot) gene sequence data, analyze it with our own pipeline, collect the result, and store the source data when needed. Due to the size of the data (~2000+ TB). It poses a high requirement for the performance and efficiency of the entire computer and storage system.

The entire process includes open-source software, self-developed Python scripts, a cluster of bare-metal high-performance servers, a high-speed 40G fiber network, an active directory domain for management, two hypervisors hosting VMs for different needs, a hybrid storage system(Tape & Storage Servers) and a set of remote servers for various purposes. Although I'm interested (and planned to learn a bit), the biology concepts are currently beyond my knowledge level and I will be focusing on the computer technologies for now.

It might seem to be unbelievable at the beginning, but I indeed finished all these on my own across 2 years of my part-time, with help from others on some physical labor. All the stuff involved might be basic in their corresponding area, but back in time they were totally new to me and I had to learn and apply all of these from the very beginning on my own(without causing trouble to the existing environment). These experiences, hands-on practice, and process of solving problems are precious to me and all contribute to my broad interest in computer technologies. 

I'm especially grateful to all those people who offered their unwavering trust and generous support to me during this entire endeavor. Without them, it would not be possible for me to have the chance to achieve all these milestones.

For detailed information about this project, please refer to the navigation menu at the bottom of each page, which includes other project components and their corresponding subsections.

Navigate through the Overview Section

Navigate through the Genetic Project