dataCompareR vignette

Here we offer end-to-end examples of using dataCompareR and, for those who want to know, to provide details of how the package performs the comparison.

dataCompareR examples

Matching without a key

For the purpose of this vignette we’ll intentionally modify iris to use for our comparison.

library(dataCompareR)

# We'll use iris for our comparison
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Make a copy of iris
iris2 <- iris
# And change it, first by subsetting just the first 140 rows
iris2 <- iris2[1:140,]
# then removing the Petal.Width column
iris2$Petal.Width <- NULL
# And then changing some values
iris2[1:10,1] <- iris2[1:10,1] + 1

And then run a comparison using the rCompare function

# run the comparison
compIris <- rCompare(iris, iris2)

## Running rCompare...

## Warning: `select_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `select()` instead.
## ℹ The deprecated feature was likely used in the dataCompareR package.
##   Please report the issue at
##   <https://github.com/capitalone/dataCompareR/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rCompare returns an S3 object which you can use with summary and print. Summary is a good way to check the results

# Check the results
summary(compIris)

## dataCompareR is generating the summary...

## 
## Data Comparison
## ===============
## 
## Date comparison run: 2025-03-12 04:59:44.426165  
## Comparison run on R version 4.4.3 (2025-02-28)  
## With dataCompareR version 0.1.4  
## 
## 
## Meta Summary
## ============
## 
## 
## |Dataset Name |Number of Rows |Number of Columns |
## |:------------|:--------------|:-----------------|
## |iris         |150            |5                 |
## |iris2        |140            |4                 |
## 
## 
## Variable Summary
## ================
## 
## Number of columns in common: 4  
## Number of columns only in iris: 1  
## Number of columns only in iris2: 0  
## Number of columns with a type mismatch: 0  
## No match key used, comparison is by row
## 
## 
## Columns only in iris: PETAL.WIDTH  
## Columns in both : PETAL.LENGTH, SEPAL.LENGTH, SEPAL.WIDTH, SPECIES  
## 
## Row Summary
## ===========
## 
## Total number of rows read from iris: 150  
## Total number of rows read from iris2: 140    
## Number of rows in common: 140  
## Number of rows dropped from iris: 10  
## Number of rows dropped from  iris2: 0  
## 
## 
## Data Values Comparison Summary
## ==============================
## 
## Number of columns compared with ALL rows equal: 3  
## Number of columns compared with SOME rows unequal: 1  
## Number of columns with missing value differences: 0  
## 
## Columns with all rows equal : PETAL.LENGTH, SEPAL.WIDTH, SPECIES
## 
## Summary of columns with some rows unequal: 
## 
## 
## 
## |Column       |Type (in iris) |Type (in iris2) | # differences|Max difference | # NAs|
## |:------------|:--------------|:---------------|-------------:|:--------------|-----:|
## |SEPAL.LENGTH |double         |double          |            10|1              |     0|
## 
## 
## 
## Unequal column details
## ======================
## 
## 
## 
## #### Column -  SEPAL.LENGTH
## Showing sample of size 5
## 
## 
## 
## |   | SEPAL.LENGTH (iris)| SEPAL.LENGTH (iris2)|Type (iris) |Type (iris2) | Difference|
## |:--|-------------------:|--------------------:|:-----------|:------------|----------:|
## |1  |                 4.7|                  5.7|double      |double       |         -1|
## |2  |                 4.6|                  5.6|double      |double       |         -1|
## |3  |                 4.6|                  5.6|double      |double       |         -1|
## |4  |                 5.1|                  6.1|double      |double       |         -1|
## |5  |                 4.9|                  5.9|double      |double       |         -1|

Or you save a copy of the report using saveReport

# Write the summary to a file
saveReport(compIris, reportName = 'compIris')

Comparison with a key

In the first example, we compared our data based on it’s order. What if want to match our data of a key? We’ll produce another test data set based on the pressure dataset

# We'll use the pressure dataset for comparison
head(pressure)

##   temperature pressure
## 1           0   0.0002
## 2          20   0.0012
## 3          40   0.0060
## 4          60   0.0300
## 5          80   0.0900
## 6         100   0.2700

# Make a copy of pressure
pressure2 <- pressure
# And change it, first by randomising the row order
pressure2 <- pressure2[sample(nrow(pressure2)),]
# then changing just one element, so for temperature of
pressure2[5,1]

## [1] 320

# We modify pressure to be twice as large
pressure2[5,2] <- pressure2[5,2] * 2

Run the comparison with rCompare specifying that we want to match on temperature

# run the comparison
compPressure <- rCompare(pressure, pressure2, keys = 'temperature')

## Running rCompare...

## Warning: `arrange_()` was deprecated in dplyr 0.7.0.
## ℹ Please use `arrange()` instead.
## ℹ See vignette('programming') for more help
## ℹ The deprecated feature was likely used in the dataCompareR package.
##   Please report the issue at
##   <https://github.com/capitalone/dataCompareR/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

And this time, we’ll choose to get a shorter summary using print

# Check the results - use print for a quick summary
print(compPressure)

## All columns were compared, all rows were compared 
## There are  1 mismatched variables:
## First and last 5 observations for the  1 mismatched variables
##   TEMPERATURE valueA valueB variable  typeA  typeB diffAB
## 1         320    376    752 PRESSURE double double   -376

We can also extract the mismatching data to explore further using generateMismatchData which generates a list containing two data frames, each having the missing rows from the comparison.

# use generateMismatchData to pull out the mismatching rows from each table
mismatches <- generateMismatchData(compPressure, pressure, pressure2)

mismatches

## $pressure_mm
##   TEMPERATURE PRESSURE
## 1         320      376
## 
## $pressure2_mm
##   TEMPERATURE PRESSURE
## 1         320      752

Internal functions

It is possible to use the other functions not exposed to the end user through the 3 colons format like dataCompareR:::functionName. Please take care when using them, as some of the checks are done up front, so they may make assumptions on the input.

dataCompareR workflow

The aspects of the dataCompareR::rCompare function that matter to the end user are:-

Coerce the two data objects provided to data frames (if required).
- Coercion uses as.data.frame. If you need more advanced coercion, please do this before calling dataCompareR.
Run data validation and cleaning
- Currently the validation step is fairly minimal, mainly confirming that any match key fields are present in both data sets.
- Cleaning consists of a few steps (these are run on a copy of the data passed to dataCompareR - the original data is unchanged)
  1. Convert factors to strings, to make for a clearer comparison
  2. Tidy up column names (trim white space, convert to upper case)
  3. (Optional) Trim strings to remove leading and trailing white space
  4. (Optional) Round numerics to a specified number of decimal places
Match columns between tables
- Columns present in both tables will be compared
- Columns in only one of the two tables will not be compared, but will be recorded as missing in the output.
Match rows. This step differs depending on whether a match key is specified.
- No match key.
  - Rows are matched based on order. If table A contains 100 rows and table B contains 150 rows, the last 50 rows from B will be dropped, and the comparison will proceed on the matching 100 rows. The fact that 50 rows were dropped will be recorded in the output.
- With match key.
  - Rows are matched based on the key(s) provided. Rows that exist in both tables will be compared. Rows that do not exist in both tables will be dropped, with details of rows dropped recorded in the output.
Data comparison for matching columns/rows
- Find which elements match between the remaining columns and row in the two data frames. The == operator is used to assess equality, with the exception of NA and NaN, which are handled in the following way
  - If both elements are NA, match is TRUE
  - If both elements are NaN, match is TRUE
  - If one element is NA and the other NaN, match is FALSE
  - If one element is NA, and the other is a valid value, match is FALSE
  - If one element is NaN, and the other is a valid value, match is FALSE
- For mismatching elements, store details of the mismatch, and for numeric fields, store the difference between them
Write everything that was learned in to an dataCompareR object