R/matching.R
auto_match_seqs.Rd
The algorithm used to match sequences across fasta files based on their names is outlined below.
auto_match_seqs(x, method = "lv", xlsx)
A table (data frame or tibble) typically produced by
concatipede_prepare
. It must be of the same format as a
table returned by this function: a first column called "name" followed
by one column per fasta file. Those columns have the name of their
corresponding fasta file, and they contain the names of the sequences in
this file, with one sequence name per cell. The number of rows in the
number of sequences of the fasta file with the most sequences, and the
columns for the other fasta files are filled with NA
for padding.
Method for string distance calculation. See
?stringdist::stringdist-metrics
for details. Default is
"lv"
.
Optional, a path to use to save the output table as an Excel file.
A table (tibble) with the same columns as x
and with sequence
names automatically matched across fasta files. Sequence names which did
not have a best reciprocal match in other fasta files are appended to
the end of the table, so that the output table columns contain all the
unique sequence names present in the corresponding column of the input
table. The first column, "name", contains a suggested name for the row
(not guaranteed to be unique). If a path was provided to the xlsx
argument, an Excel file is saved and the table is returned invisibly.
Let's assume a situation with N fasta files, with each fasta file i having n_i sequence names. The problem of matching the names in the best possible way across the fasta files is similar to that of identifying homologous proteins across species, using e.g. reciprocal blast.
The algorithm steps are:
For each pair of fasta files, identify matching names using a reciprocal match approach: two names match if and only if they are their reciprocal best match.
Those matches across fasta files define a graph.
We identify sub-graphs such that (i) they contain at most one sequence name per fasta file and (ii) all nodes in a given sub-graph are fully connected (i.e., they are all their best reciprocal matches across any pair of fasta files).
xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
#> # A tibble: 405 × 5
#> name al18S_aligned_M… al28S_aligned_M… COI_aligned_Mus… ITS2_aligned_MA…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Macrobio… KT935502_Macrob… KT935501_Macrob… KT951668_Macrob… KT935500_Macrob…
#> 2 Mesobiot… KX129793_Mesobi… KX129794_Mesobi… KX129796_Mesobi… KX129795_Mesobi…
#> 3 Tenuibio… KX810045_Tenuib… KX810049_Tenuib… KX810042_Tenuib… KX810046_Tenuib…
#> 4 Macrobio… KY797265_Macrob… KY797266_Macrob… KY797267_Macrob… KY797268_Macrob…
#> 5 Mesobiot… MF441488_Mesobi… MF441489_Mesobi… MF441491_Mesobi… MF441490_Mesobi…
#> 6 Paramacr… MF568532_Parama… MF568533_Parama… MF568534_Parama… MF568535_Parama…
#> 7 Macrobio… MH063922_Macrob… MH063924_Macrob… MH057764_Macrob… MH063923_Macrob…
#> 8 Macrobio… MH063926_Macrob… MH063935_Macrob… MH057767_Macrob… MH063931_Macrob…
#> 9 Mesobiot… MH197148_Mesobi… MH197265_Mesobi… MH195153_Mesobi… MH197156_Mesobi…
#> 10 Mesobiot… MH197149_Mesobi… MH197266_Mesobi… MH195154_Mesobi… MH197157_Mesobi…
#> # … with 395 more rows
if (FALSE) {
auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")
}