read_tsv(corrected_file) %>%select(family_correction, genus_correction, species_correction) %>%gather("level") %>%na.omit() %>%group_by(level) %>%summarise(corrections =n()) %>%kable(format.args =list(big.mark =","),caption ="Number of corrections per level." )
Number of corrections per level.
level
corrections
family_correction
1,038
genus_correction
2,500
species_correction
9,889
World Flora Online (WFO)
We used WorldFlora and fuzzyjoin for synonymy and misspelling corrections. Beware, the R package is very memory consuming and I used batches of species names in the code. you might want to adapt it to the capacity of your computer. I also subsampled the backbones to the family list from the data to increase speed, but this might not be optimal.
Among matched species, most were exact (not fuzzy misspelled species) and with a single match in WFO, still 257257 had multiple matches and 444 were fuzzy misspelled species, both to investigate:
The 38 species that did not match WFO were manually cured by searching manually on WFO and on the web. Some were missed because lacking the family name in the WFO backbone subset, some do not matched species names, some used the subspecies names, and other were really misspelled.
By default WorldFlora limits fuzzy distance to 4, which seems more than acceptable and could be increased, but this can also be dealt with manually.
Code
wfo %>%filter(Matched, Fuzzy) %>%group_by(spec.name) %>%summarise(fuzzy_dist =unique(Fuzzy.dist)) %>%group_by(fuzzy_dist) %>%summarise(N =n()) %>%kable(caption ="Number of misspelled species by the number of misspelled letters.") # nolint
Number of misspelled species by the number of misspelled letters.
fuzzy_dist
N
1
247
2
181
3
75
4
49
However, among the misspelled species with multiple matches, 34 have multiple newly accepted names:
Number of synonymy, misspelling, and update corrections per taxonomic level
level
N
family
266
genus
463
species
819
Taxonomic data
All taxonomic data are saved in outputs/taxonomy_vX.tsv with the following columns:
site: the site name
scientific_*: all columns with the scientific name
vernacular_*: all columns with the vernacular name
family_*: all columns with the family name
genus_*: all columns with the genus name
species_*: all columns with the species name
*_raw: all columns with the raw information as read in the inventories, see analyses
*_corrected: all columns with the manual corrections of dubious names, see analyses
*_cleaned: all columns with the synonymy and misspelling corrections with WFO and manual corrections, see analyses, this is the final taxonomic classification that should be used in further analyses
*_correction: if a manual corrections of dubious names occured, see analyses
cleaning_type: WFO automatic cleaning or using manual curation