Build a template table with automatically matched sequence names

The algorithm used to match sequences across fasta files based on their names is outlined below.

auto_match_seqs(x, method = "lv", xlsx)

Arguments

x: A table (data frame or tibble) typically produced by concatipede_prepare. It must be of the same format as a table returned by this function: a first column called "name" followed by one column per fasta file. Those columns have the name of their corresponding fasta file, and they contain the names of the sequences in this file, with one sequence name per cell. The number of rows in the number of sequences of the fasta file with the most sequences, and the columns for the other fasta files are filled with NA for padding.
method: Method for string distance calculation. See ?stringdist::stringdist-metrics for details. Default is "lv".
xlsx: Optional, a path to use to save the output table as an Excel file.

Value

A table (tibble) with the same columns as x and with sequence names automatically matched across fasta files. Sequence names which did not have a best reciprocal match in other fasta files are appended to the end of the table, so that the output table columns contain all the unique sequence names present in the corresponding column of the input table. The first column, "name", contains a suggested name for the row (not guaranteed to be unique). If a path was provided to the xlsxargument, an Excel file is saved and the table is returned invisibly.

Details

Let's assume a situation with N fasta files, with each fasta file i having n_i sequence names. The problem of matching the names in the best possible way across the fasta files is similar to that of identifying homologous proteins across species, using e.g. reciprocal blast.

The algorithm steps are:

For each pair of fasta files, identify matching names using a reciprocal match approach: two names match if and only if they are their reciprocal best match.
Those matches across fasta files define a graph.
We identify sub-graphs such that (i) they contain at most one sequence name per fasta file and (ii) all nodes in a given sub-graph are fully connected (i.e., they are all their best reciprocal matches across any pair of fasta files).

Examples

xlsx_file <- concatipede_example("sequences-test-matching.xlsx")
xlsx_template <- readxl::read_xlsx(xlsx_file)
auto_match_seqs(xlsx_template)
#> # A tibble: 405 × 5
#>    name      al18S_aligned_M… al28S_aligned_M… COI_aligned_Mus… ITS2_aligned_MA…
#>    <chr>     <chr>            <chr>            <chr>            <chr>           
#>  1 Macrobio… KT935502_Macrob… KT935501_Macrob… KT951668_Macrob… KT935500_Macrob…
#>  2 Mesobiot… KX129793_Mesobi… KX129794_Mesobi… KX129796_Mesobi… KX129795_Mesobi…
#>  3 Tenuibio… KX810045_Tenuib… KX810049_Tenuib… KX810042_Tenuib… KX810046_Tenuib…
#>  4 Macrobio… KY797265_Macrob… KY797266_Macrob… KY797267_Macrob… KY797268_Macrob…
#>  5 Mesobiot… MF441488_Mesobi… MF441489_Mesobi… MF441491_Mesobi… MF441490_Mesobi…
#>  6 Paramacr… MF568532_Parama… MF568533_Parama… MF568534_Parama… MF568535_Parama…
#>  7 Macrobio… MH063922_Macrob… MH063924_Macrob… MH057764_Macrob… MH063923_Macrob…
#>  8 Macrobio… MH063926_Macrob… MH063935_Macrob… MH057767_Macrob… MH063931_Macrob…
#>  9 Mesobiot… MH197148_Mesobi… MH197265_Mesobi… MH195153_Mesobi… MH197156_Mesobi…
#> 10 Mesobiot… MH197149_Mesobi… MH197266_Mesobi… MH195154_Mesobi… MH197157_Mesobi…
#> # … with 395 more rows
if (FALSE) {
  auto_match_seqs(xlsx_template, xlsx = "my-automatic-output.xlsx")
}