Before taking any action, the feasibility of the project must be considered. This includes the entire pipeline, like analysis software, data availability, etc. For each step of the pipeline, I don't have to go in deep now but need to have a rough idea about what is needed and whether it is available & reliable for a long enough period.
This section contains three parts: data availability, downloading software, and analysis software.
Since I'm not an expert in bioinformatics, I have to rely on other specialists to provide me with the initial source. I will then use my knowledge to have the data downloaded efficiently.
First, the researchers search for the data they want on the NCBI SRA searching website, obtain the accession files, and use the link within the file to download. However, since December 1, 2019, the FTP site(ftp://ftp-trace.ncbi.nlm.nih.gov/sra/) that hosts all the SRA files has been decommissioned. A new tool called sra-toolkit replaced it. Unlike FTP which a lot of download utilities like aria2c could be used, the sra-toolkit is less flexible and is not practical for large-scale downloading. Before wasting too much time on the sra-toolkit, I decided to find alternatives.
Based on online searches, I soon realized that the same data will usually be available through ENA-EBI. Unlike SRA, EBI offers two ways to download: IBM Aspera Connect and FTP. In that case, it would be much better if we could obtain download links from SRA and then get the actual file from EBI.
EBI provides two ways to download, one is IBM Aspera Connect and the other is FTP.
IBM Aspera Connect
Following the official guide, Aspera Connect is available on all major platforms, specifically Linux and Windows. The download links we obtained is starting in "ftp://", but could be easily converted to an Aspera format starting in "ascp://" automatically using Python script.
FTP is a commonly used transfer protocol, nearly all downloading utility supports this protocol. From my past experience, I chose to use Aria2c first, which is a command line downloading utility.
Based on the feedback from researchers, they found an open-source platform called "Galaxy" to do genetic analysis. This is an all-in-one solution to create a complete workflow. Although Galaxy relies on the GUI to work, it might be possible to run on the command line. For now, we only need to know if we can deploy the Galaxy platform on our own machine, and the answer is yes.
As an alternative, the tools used in Galaxy are open-sourced, which means it is possible for me to write a dedicated analysis pipeline in Python without the whole platform. Eventually, I go with this way due to the performance issue and human labor required of the Galaxy platform, which will be discussed later.