The concatipede package allows to concatenate sequences from different Multiple Sequence Alignments (MSAs) based on a correspondence table that indicates how sequences from different MSAs are linked to each other.
For this package to work, all the MSAs in fasta format should be placed in the same folder (that can be the working directory). Only the fasta files to be concatenated should be present in the directory. If other fasta files are present and you don´t want to include them, see the
concatipede_prepare(exclude = "") option.
All the fasta files of interest should be put in a target directory. Here, in order for this example to work, we first set up a
concatipede_test/ directory, copy the example fasta files shipped with the package in it and set it as working directory. The concatipede package functions automatically load and save files in the working directory, however options to direct them to different directories are present.
# Save the path to the initial directory for later clean-up old_dir <- getwd() # Create a directory to put the fasta files for this example dir.create("concatipede_test") # Set it as the working directory setwd("concatipede_test") # Copy the example fasta files shipped with the package into that directory example_files = list.files(system.file("extdata", package="concatipede"), full.names = TRUE) file.copy(from = example_files, to = getwd())
Now all the data used in this vignette is copied in the working directory.
The first step is to generate a template correspondence table with all the sequence names. You can do this with the function
But first let’s check what fasta files are in our directory:
##  "/__w/concatipede/concatipede/vignettes/COI_Macrobiotidae.fas" ##  "/__w/concatipede/concatipede/vignettes/ITS2_Macrobiotidae.fas" ##  "/__w/concatipede/concatipede/vignettes/LSU_Macrobiotidae.fas" ##  "/__w/concatipede/concatipede/vignettes/SSU_Macrobiotidae.fas"
Those are alignments for 4 markers of tardigrades from the family Macrobiotidae. With the function
concatipede_prepare() we will generate an excel table with the sequence names in the order they are found in each alignment.
concatipede_prepare(out = "seqnames")
Once it is done, an excel file should be saved in your working directory. The template excel file looks like this:
Each row of the correspondence table will be used to build one concatenated sequence, by extracting and concatenating the corresponding sequences from the input files. You can modify the template excel correspondence table to reflect how you want to concatenate the sequences from the different alignments.
It is important that the first cell of the first row (cell A1, “name”) is not modified; the other column names are the filenames of the fasta alignments. In the name column you must set the name of the concatenated sequences.
You can copy and save different versions of the correspondence table in different sheets of the excel file: they can be selected with an appropriate option in the R session later.
After modifying the template to specify the correct matches for concatenation, the first sheet of our example excel file now looks like this:
Note: If the names of the sequences in each input file are sufficiently consistent, you can also try to use the
auto_match_seqs() function to ask the package to perform some automatic name-matching for you (which you can still review and correct before doing the actual concatenation). See the example at the end of this vignette for more details.
Now we input this excel file to the
concatipede(filename = "Macrobiotidae_seqnames.xlsx", out = "Macrobiotidae_4genes", excel.sheet = 1)
This will create a new subfolder with the concatenated alignment in different formats.
Note that also a .txt file containing the start and ending positions of the different markers in the concatenated alignments is saved in the output directory.
Here is a visualization of the concatenated alignment:
We can also choose a subset of the taxa from a correspondence table, stored as a separate sheet in the excel file. For example, we can ask
concatipede() to use the second sheet of the Excel file below when performing the concatenation. This sheet only contains data for the Macrobiotus taxa.
concatipede(filename = "Macrobiotidae_seqnames.xlsx",out="Macrobiotus_4genes",excel.sheet = 2)
Here is another example in which we select only one marker, stored in the “COI” sheet of the Excel file:
concatipede(filename = "Macrobiotidae_seqnames.xlsx",out="Macrobiotidae_COI",excel.sheet = 3)
This function extract the GenBank accession numbers from the sequence names in a correspondence table and return a GenBank accession table. If the function cannot find a GenBank accession number in the sequence name it will return a NA.
It accepts both a path of the excel corresponcende table or a dataframe.
genbank.table = get_genbank_table(filename = "Macrobiotidae_seqnames.xlsx", excel.sheet = 1)
This function renames the sequences in the provided fasta files based on a correspondence table and a vector of markers names.
The new sequence names will be composed by the GenBank accession number (if present), the name indicated in the correspondennce table and the marker name provided. The function creates a new directory where the renamed fasta files are saved, together with an updated correspondence table.
Concatipede also includes a function that tries to automatch the sequence names from different fasta files (
If the sequence names are not “clean” the matching won’t be perfect, so it is advisable to check the automatched excel file.
In the example below it is also illustrated how to use base concatipede functions to customize the template preparation, sequence names matching and concatenation:
find_fasta() %>% concatipede_prepare() %>% write_xl("template.xlsx") %>% auto_match_seqs() %>% write_xl("template_automatched.xlsx") %>% concatipede() %>% write_fasta("merged-seqs.fasta") %>% image(cex=0.3)
This is the end of this vignette! To go back to your initial directory, you can run:
# Clean-up: go back to initial directory setwd(old_dir)