Score samples using the CSCI tool

A function that aggregates many of the steps involved in the scoring of the California Stream Condition Index (CSCI) into a single function. These steps include data quality flagging, conversions of taxonomic names, iterative subsampling (20 iterations), metric calculations, prediction of expected taxa and metric values, scoring, and aggregation into a final index. Input data includes sample-wise raw, unprocessed taxonomy in a flat format, and station-wise predictor data in a crosstab format. See example data (bugs_stations) for reference. A complete description of the index is provided in Mazor et al. (in review). The O/E component of this function is adapted from John van Sickle's RIVPACS model building scripts.

CSCI(bugs, stations, rand = sample.int(10000, 1), distinct = TRUE)

Arguments

bugs	A data frame with BMI data (see details)
stations	A data frame with environmental data, one row per station (see details)
rand	An integer to control the random number generator (RNG) seed for the subsampling. By default set to `sample.int(10000, 1)`
distinct	A logical value to overwrite the `Distinct` column in `bugs` with `NA` values, default (`FALSE`) is leave as is.

Value

A list of data frames that serve as reports in varying detail:

core

A summary of the CSCI results, and data quality flags, averaged across 20 iterations.

Suppl1_mmi

A detailed breakdown of the results of the MMI component of the CSCI, averaged across 20 iterations.

Suppl1_grps

Probability of biotic group membership in a SampleID by Group format

Suppl1_OE

A detailed breakdown of the results of the O/E component of the CSCI, averaged across 20 iterations. Capture probabilities and mean abundances of each OTU are provided.

Suppl2_mmi

Similar to Suppl1_mmi, except broken down by iteration

Suppl2_OE

Similar to Suppl1_OE, except brown down by replicatesiteration. Iteration-wise O/E scores are also provided.

Details

A valid "bugs" data frame consists of the following columns: StationCode, SampleID, FinalID (i.e., taxa names), LifeStageCode ("A", "L", "P", or "X"), BAResult (i.e., taxa counts), and Distinct (a positive integer where the taxonomist has indicated distinctiveness, else left blank or 0). Values for FinalID and LifeStageCode must conform to values from SWAMP lookup tables (http://swamp.mpsl.mlml.calstate.edu/). See CSCI guidance document for details on these fields.

A valid "stations" data frame consists of the following columns: StationCode (must match with same column in the "bugs" data frame), BDH_AVE, ELEV_RANGE, KFCT_AVE, P_MEAN, LogWSA, New_Lat, New_Long, PPT_00_09, SITE_ELEV, SumAve_P, TEMP_00_09. See CSCI guidance document for details on these fields.

The data frames are also subject to the following constraints: no missing blank cells in any field in either data frame (except for the Distinct column); all values under StationCode in the "bugs" data frame must be represented under StationCode in the "stations" data frame; every SampleID must be associated with only a single StationCode; no duplicated data in either data frame (e.g., every combination of the SampleID, FinalID, LifeStageCode, and Distinct should be unique in the "bugs" data frame).

In order to produce replicable results, the RNG seed can be controlled using the rand argument. Any integer may be entered, which will be passed to set.seed.

References

R.D. Mazor, A. Rehn, P. R. Ode, M. Engeln, K. Schiff. (2013) Development of a bioassessment tool for streams in heterogeneous regions: Accommodating environmental complexity through site specificity in the California Stream Condition Index. In review.

J. Van Sickle. (2010) R code to make predictions of O/E for a new set of sites based on a Random Forest predictive model (Version 4.2)[R script]

Examples

data(bugs_stations) #A list of two data frames: bugs and stations
results <- CSCI(bugs = bugs_stations[[1]], stations = bugs_stations[[2]])
ls(results) #see all the components of the report
#> [1] "core"        "Suppl1_grps" "Suppl1_mmi"  "Suppl1_OE"   "Suppl2_mmi" 
#> [6] "Suppl2_OE"  
results$core #see the core report
#>   StationCode   SampleID Count Number_of_MMI_Iterations Number_of_OE_Iterations
#> 1       Site3 BadSample1   100                        1                       1
#> 2       Site3 BadSample2   600                       20                       1
#> 3       Site1    Sample1   556                       20                      20
#> 4       Site2    Sample2   826                       20                      20
#> 5       Site3    Sample3   607                       20                      20
#> 6       Site3    Sample4   513                       20                       1
#>   Pcnt_Ambiguous_Individuals Pcnt_Ambiguous_Taxa         E Mean_O     OoverE
#> 1                  0.0000000            0.000000 10.248486   1.00 0.09757538
#> 2                 83.3333333           50.000000 10.248486   1.00 0.09757538
#> 3                  0.5395683            2.631579  7.544418   8.95 1.18630752
#> 4                  0.9685230            1.666667 12.953853  11.15 0.86074778
#> 5                  9.7199341            6.250000 10.248486  13.15 1.28311631
#> 6                 37.6218324           41.025641 10.248486   9.00 0.87817846
#>   OoverE_Percentile       MMI MMI_Percentile      CSCI CSCI_Percentile
#> 1              0.00 0.1638082           0.00 0.1306918            0.00
#> 2              0.00 0.3488195           0.00 0.2231975            0.00
#> 3              0.84 0.8281616           0.17 1.0072346            0.52
#> 4              0.23 0.8295090           0.17 0.8451284            0.17
#> 5              0.93 1.2047019           0.87 1.2439091            0.94
#> 6              0.26 0.9080194           0.30 0.8930989            0.25

Score samples using the CSCI tool

Arguments

Value

Details

References

See also

Examples

Contents

Author