A function that aggregates many of the steps involved in the scoring of the California Stream Condition Index (CSCI) into a single function. These steps include data quality flagging, conversions of taxonomic names, iterative subsampling (20 iterations), metric calculations, prediction of expected taxa and metric values, scoring, and aggregation into a final index. Input data includes sample-wise raw, unprocessed taxonomy in a flat format, and station-wise predictor data in a crosstab format. See example data (bugs_stations) for reference. A complete description of the index is provided in Mazor et al. (in review). The O/E component of this function is adapted from John van Sickle's RIVPACS model building scripts.

CSCI(bugs, stations, rand = sample.int(10000, 1), distinct = TRUE)

Arguments

bugs

A data frame with BMI data (see details)

stations

A data frame with environmental data, one row per station (see details)

rand

An integer to control the random number generator (RNG) seed for the subsampling. By default set to sample.int(10000, 1)

distinct

A logical value to overwrite the Distinct column in bugs with NA values, default (FALSE) is leave as is.

Value

A list of data frames that serve as reports in varying detail:

core

A summary of the CSCI results, and data quality flags, averaged across 20 iterations.

Suppl1_mmi

A detailed breakdown of the results of the MMI component of the CSCI, averaged across 20 iterations.

Suppl1_grps

Probability of biotic group membership in a SampleID by Group format

Suppl1_OE

A detailed breakdown of the results of the O/E component of the CSCI, averaged across 20 iterations. Capture probabilities and mean abundances of each OTU are provided.

Suppl2_mmi

Similar to Suppl1_mmi, except broken down by iteration

Suppl2_OE

Similar to Suppl1_OE, except brown down by replicatesiteration. Iteration-wise O/E scores are also provided.

Details

A valid "bugs" data frame consists of the following columns: StationCode, SampleID, FinalID (i.e., taxa names), LifeStageCode ("A", "L", "P", or "X"), BAResult (i.e., taxa counts), and Distinct (a positive integer where the taxonomist has indicated distinctiveness, else left blank or 0). Values for FinalID and LifeStageCode must conform to values from SWAMP lookup tables (http://swamp.mpsl.mlml.calstate.edu/). See CSCI guidance document for details on these fields.

A valid "stations" data frame consists of the following columns: StationCode (must match with same column in the "bugs" data frame), BDH_AVE, ELEV_RANGE, KFCT_AVE, P_MEAN, LogWSA, New_Lat, New_Long, PPT_00_09, SITE_ELEV, SumAve_P, TEMP_00_09. See CSCI guidance document for details on these fields.

The data frames are also subject to the following constraints: no missing blank cells in any field in either data frame (except for the Distinct column); all values under StationCode in the "bugs" data frame must be represented under StationCode in the "stations" data frame; every SampleID must be associated with only a single StationCode; no duplicated data in either data frame (e.g., every combination of the SampleID, FinalID, LifeStageCode, and Distinct should be unique in the "bugs" data frame).

In order to produce replicable results, the RNG seed can be controlled using the rand argument. Any integer may be entered, which will be passed to set.seed.

References

R.D. Mazor, A. Rehn, P. R. Ode, M. Engeln, K. Schiff. (2013) Development of a bioassessment tool for streams in heterogeneous regions: Accommodating environmental complexity through site specificity in the California Stream Condition Index. In review.

J. Van Sickle. (2010) R code to make predictions of O/E for a new set of sites based on a Random Forest predictive model (Version 4.2)[R script]

See also

Examples

data(bugs_stations) #A list of two data frames: bugs and stations results <- CSCI(bugs = bugs_stations[[1]], stations = bugs_stations[[2]]) ls(results) #see all the components of the report
#> [1] "core" "Suppl1_grps" "Suppl1_mmi" "Suppl1_OE" "Suppl2_mmi" #> [6] "Suppl2_OE"
results$core #see the core report
#> StationCode SampleID Count Number_of_MMI_Iterations Number_of_OE_Iterations #> 1 Site3 BadSample1 100 1 1 #> 2 Site3 BadSample2 600 20 1 #> 3 Site1 Sample1 556 20 20 #> 4 Site2 Sample2 826 20 20 #> 5 Site3 Sample3 607 20 20 #> 6 Site3 Sample4 513 20 1 #> Pcnt_Ambiguous_Individuals Pcnt_Ambiguous_Taxa E Mean_O OoverE #> 1 0.0000000 0.000000 10.248486 1.00 0.09757538 #> 2 83.3333333 50.000000 10.248486 1.00 0.09757538 #> 3 0.5395683 2.631579 7.544418 8.95 1.18630752 #> 4 0.9685230 1.666667 12.953853 11.15 0.86074778 #> 5 9.7199341 6.250000 10.248486 13.15 1.28311631 #> 6 37.6218324 41.025641 10.248486 9.00 0.87817846 #> OoverE_Percentile MMI MMI_Percentile CSCI CSCI_Percentile #> 1 0.00 0.1638082 0.00 0.1306918 0.00 #> 2 0.00 0.3488195 0.00 0.2231975 0.00 #> 3 0.84 0.8281616 0.17 1.0072346 0.52 #> 4 0.23 0.8295090 0.17 0.8451284 0.17 #> 5 0.93 1.2047019 0.87 1.2439091 0.94 #> 6 0.26 0.9080194 0.30 0.8930989 0.25