Skip to content

Commit 59ff867

Browse files
committed
edits and readme
1 parent fb0e675 commit 59ff867

File tree

4 files changed

+43
-4
lines changed

4 files changed

+43
-4
lines changed

README.md

+30-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,33 @@ NCBI now provides [a clustered nr database](https://ncbiinsights.ncbi.nlm.nih.go
44
We were interested in using this database to reduce search times and to increase the taxonomic diversity of returned sequences when doing BLAST searches.
55
However, as of March 2023, the database is not available for download.
66
Therefore, we re-made this database ourselves.
7-
The [README.sh](./README.sh) file in this repository documents how we performed the clustering and created a taxonomy sheet that annotates the lowest common ancestor for each protein cluster.
7+
The [Snakefile](./Snakefile) in this repository documents how we performed the clustering and created a taxonomy sheet that annotates the lowest common ancestor for each protein cluster.
8+
9+
## Getting started with this repository
10+
11+
This repository uses snakemake to run the pipeline and conda to manage software environments and installations.
12+
You can find operating system-specific instructions for installing miniconda [here](https://docs.conda.io/en/latest/miniconda.html).
13+
We executed the pipeline on AWS EC2 with an Ubuntu image (ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20230208).
14+
15+
```
16+
curl -JLO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # download the miniconda installation script
17+
bash Miniconda3-latest-Linux-x86_64.sh # run the miniconda installation script. Accept the license and follow the defaults.
18+
source ~/.bashrc # source the .bashrc for miniconda to be available in the environment
19+
# configure miniconda channel order
20+
conda config --add channels defaults
21+
conda config --add channels bioconda
22+
conda config --add channels conda-forge
23+
conda config --set channel_priority strict # make channel priority strict so snakemake doesn't yell at you
24+
conda install mamba # install mamba for faster software installation.
25+
26+
conda env create -n nr -f environment.yml
27+
conda activate nr
28+
```
29+
30+
After cloning the repository, you can then run the snakefile with:
31+
32+
```
33+
snakemake -j 1 --use-conda --rerun-incomplete -k -n
34+
```
35+
36+
where `-j` specifies the number of threads to run with, `--use-conda` uses conda to manage software environments, `--rerun-incomplete` re-runs incomplete files, `-k` tells the pipeline to continue with independent steps when one step fails, and `-n` signifies to run a dry run first.

Snakefile

+1-3
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,6 @@ rule all:
66
"nr_cluster_taxid_formatted_final.tsv.gz",
77
"nr_cluster_uniq_reps_line_count.txt"
88

9-
# rules to add:
10-
# 3. add header sequences to nr_cluster.tsv
11-
129
#############################################################################
1310
## cluster NR with mmseqs2
1411
#############################################################################
@@ -125,6 +122,7 @@ rule get_lca_taxid_for_each_cluster:
125122
output: "nr_cluster_taxid_lca.tsv"
126123
params: datadir="inputs/taxdump/"
127124
#conda: "envs/taxonkit.yml" # needs to be 0.14.2, which isn't on conda yet
125+
# in the meantime, the executable can be downloaded from this url: https://github.com/shenwei356/taxonkit/files/11073880/taxonkit_linux_amd64.tar.gz
128126
shell:'''
129127
./taxonkit lca --data-dir {params.datadir} -i 2 -s ";" -o {output} {input.tsv} --buffer-size 1G
130128
'''

environment.yml

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
channels:
2+
- conda-forge
3+
- bioconda
4+
- defaults
5+
dependencies:
6+
- snakemake=7.25.0

envs/mmseqs2.yml

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
channels:
2+
- conda-forge
3+
- bioconda
4+
- defaults
5+
dependencies:
6+
- mmseqs2=14.7e284

0 commit comments

Comments
 (0)