The Genetic Analysis Project

Data Download - Overview

Introduction

It's not easy to download hundreds TBs of data from public internet, especially in China. From obtain the link to error processing, this section has five components: overview, obtain & preprocess the links, software & structure, testing & firewall circumvention, and error processing.

Obtain & Preprocess the Links

SRA is a public database, and each accession could contain gene sequences for different kinds & purposes using different technology. Not only do I need to get all links for required accessions, but also need to remove those entries we don't want and leave only the useful ones.

Software & Structure

Due to the firewall strategy of remote servers, downloading hundreds of TBs of data cannot be accomplished on a single server, or specifically, a single IP. I have to design a downloading structure containing different server nodes to reach the desired speed, continuously.

Testing & Firewall Circumvention

Due to extensive network censorship in China, data transmitted across the border could be interfered with by the great firewall. The file we obtained is legal scientific data, but we need to find a way to rule out the effect of the firewall.

Error Processing

Errors couldn't be avoided during large-scale transmission, and I need to find solutions for different kinds of errors. Unlike the previous chapter (pipeline - dealing with errors), this section discusses in-depth downloading-specific errors.

Data Download - Overview

Introduction

Obtain & Preprocess the Links

Software & Structure

Testing & Firewall Circumvention

Error Processing

Navigate through the Data Download Section

Navigate through the Genetic Project

Contact