Data Download - Overview

rough pipeline of data processing.

Introduction

It's not easy to download hundreds TBs of data from public internet, especially in China. From obtain the link to error processing, this section has five components: overview, obtain & preprocess the links, software & structure, testing & firewall circumvention, and error processing.

Obtain & Preprocess the Links

SRA is a public database, and each accession could contain gene sequences for different kinds & purposes using different technology. Not only do I need to get all links for required accessions, but also need to remove those entries we don't want and leave only the useful ones.

Software & Structure

Due to the firewall strategy of remote servers, downloading hundreds of TBs of data cannot be accomplished on a single server, or specifically, a single IP. I have to design a downloading structure containing different server nodes to reach the desired speed, continuously.

Testing & Firewall Circumvention

Due to extensive network censorship in China, data transmitted across the border could be interfered with by the great firewall. The file we obtained is legal scientific data, but we need to find a way to rule out the effect of the firewall.

Error Processing

Errors couldn't be avoided during large-scale transmission, and I need to find solutions for different kinds of errors. Unlike the previous chapter (pipeline - dealing with errors), this section discusses in-depth downloading-specific errors.

Navigate through the Data Download Section

Navigate through the Genetic Project