Errorlocate uses the linear, categorical and conditional rules from a
rules set formulated with R package validate,
to create a Mixed Integer Problem.
For most users the details of the translation are not relevant and
hidden in locate_errors. Often the number of errors found
and the processing time are much more relevant parameters.
In a few cases, you may run into a problems with your error localization problem:
locate_errors is high.locate_errors missed an obvious error.locate_errors indicates that it did not find a valid
solution (for some records) .Problem a. can be addressed by using the parallel argument of
locate_errors (and replace_errors). Problem b
can be due to that error_locate ignores non-linear rules,
and therefore is not able to deduce the errors, because it only takes
linear, categorical and conditional rules into a account.
There may also be problems with your rules set. Problems set may be
mitigated by using the validatetools
package that can detect conflicting and redundant rules and has methods
to simplify your rule set.
If you want to dive deep into the mixed integer problem that is
created by error_locate you can use the
inspect_mip function.
In the following sections an example is given of how linear, categorical and conditional rules are written as Mixed Integer Problems. First let’s see how these rules in validator can be formally defined.
Each translatable rule \(r_i(\mathbf{x})\) can be written as a
disjunction of atomic clauses \(C_i^j(x)\): it is a function \(r_i\) that operates on (some of) the values
of record \(\mathbf{x} = (x_1, \ldots,
x_n)\) and is TRUE (valid) or FALSE
(not valid)
\[ r_i(\mathbf{x}) = \bigvee_j C_i^j(\mathbf{x}) \]
with each atomic clause:
\[ C_i^j(\mathbf{x}) = \left\{ \begin{array}{l} \mathbf{a}^T\mathbf{x} \leq b \\ \mathbf{a}^T\mathbf{x} = b \\ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ \end{array} \right. \]
Each linear, categorical or conditional rule \(r_i\) can be written in this form.
rules <- validator(example_1 = if (income > 0) age >= 16)
rules$exprs()
#> $example_1
#> income <= 0 | (age - 16 >= -1e-08)
#> attr(,"reference")
#> example_1 
#>         1So the rule if (income > 0) age >= 16 can be
written as (income <= 0 OR age >=16)
rules <- validator(example_2 = if (has_house == "yes") income >= 1000)
rules$exprs()
#> $example_2
#> has_house != "yes" | (income - 1000 >= -1e-08)
#> attr(,"reference")
#> example_2 
#>         1So the rule if (has_house == "yes") income >= 1000)
can be written as (has_house != "yes" OR
age >=1000)
The rules form a system \(R(\mathbf{x})\):
\[ R(\mathbf{x}) = \bigwedge_i r_i \] which means that all rules \(r_i\) must be valid. If \(R(\mathbf{x})\) is true for record \(\mathbf{x}\), then the record is valid, otherwise one (or more) of the rules is violated.
Each rule set \(R(\mathbf{x})\) can be translated into a mip problem and solved.
\[ \begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \\ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \\ \end{array} \] - \(f(\mathbf{x})\) is the (weighted) number of changed variable: \(\delta_i \in {0,1}\)
\[ f(\mathbf{x}) = \sum_{i=1}^N w_i \delta_i \]
\(\mathbf{R}\) contains rules:
\(\mathbf{R}_H(\mathbf{x}) \leq
\mathbf{d}_H\) that were specified with
validate/validator
\(\mathbf{R}_0(\mathbf{x}, \mathbf{\delta}) \leq \mathbf{d}_0\) : soft constraints that try fix the current record of \(\mathbf{x}\) to the observed values.
inspect_mip:Most users will use the function locate_errors to find
errors. The function inspect_mip works exactly same, except
that it operates on just one record in stead of a whole
data.frame. The result of inspect_mip is a mip
object, that is not yet executed and can be inspected.
rules <- validator( r1 = age >= 18
                  , r2 = income >= 0
                  )
data <- data.frame(age = c(12, 35), income = c(2000, -1000))
data| age | income | 
|---|---|
| 12 | 2000 | 
| 35 | -1000 | 
So we detect two errors in the dataset:
| name | items | passes | fails | nNA | error | warning | expression | 
|---|---|---|---|---|---|---|---|
| r1 | 2 | 1 | 1 | 0 | FALSE | FALSE | age - 18 >= -1e-08 | 
| r2 | 2 | 1 | 1 | 0 | FALSE | FALSE | income - 0 >= -1e-08 | 
Lets inspect the first record
The mip object contains the mip problem before it is
executed. We can inspect the lp problem, prior to solving it with
lpSolveApi with
  Model name: errorlocate
                        age          income      .delta_age   .delta_income           
  Minimize                0               0   1.10298728745  1.088278376264           
  r1                     -1               0               0               0  <=    -18
  r2                      0              -1               0               0  <=      0
  age_ub                  1               0          -1e+07               0  <=     12
  income_ub               0               1               0          -1e+07  <=   2000
  age_lb                 -1               0          -1e+07               0  <=    -12
  income_lb               0              -1               0          -1e+07  <=  -2000
  Kind                  Std             Std             Std             Std           
  Type                 Real            Real             Int             Int           
  Upper                 Inf             Inf               1               1           
  Lower                -Inf            -Inf               0               0Validator rules r1 and r2 are encoded in
two lines of the model. The values of the current record are encoded as
soft constraints in age_ub, age_lb,
income_lb and income_ub. These constraints try
to fix the values of age at 12 and income at
2000, but can be violated, setting .delta_age or
.delta_income to 1.
For large problems the lp problem can be written to disk for inspection
Once we execute the mip project, the lp solver is executed on the problem:
Extra arguments are passed through to lpSolveAPI. The
result object contains several properties:
res$solution indicates of a solution was found
res$s indicates the lpSolveAPI status, what
kind of solution was found.
res$errors indicates which fields/values are deemed
erroneous:
res$values contains the values for the valid solution
that has been found by the lpsolver:
Note that the solver has found that setting age from 12
to 18 gives a valid solution. .delta_age = 1 indicates that
age contained an error.
The result object res also contains an lp
object after optimization. This object can be further investigated using
lpSolveAPI functions.
  Model name: errorlocate
                        age          income      .delta_age   .delta_income           
  Minimize                0               0   1.10298728745  1.088278376264           
  age_ub                  1               0          -1e+07               0  <=     12
  income_ub               0               1               0          -1e+07  <=   2000
  income_lb               0              -1               0          -1e+07  <=  -2000
  Kind                  Std             Std             Std             Std           
  Type                 Real            Real             Int             Int           
  Upper                 Inf             Inf               1               1           
  Lower                  18               0               0               0Note that the lp problem has been simplified. For example the single
variable constraints,the lp problem/object after solving shows that the
solver has optimized some of the rules: it has moved rule
r1 and r2 into the Lower boundary
conditions. It also removed age_lb because that was
superfluous with respect to the boundary conditions.
In categorical rules, each category is coded in a separate column/mip
variable: e.g. if we have a working variable, with two
categories (“job”, “retired”), the mip problem is encoded as
follows:
| working | 
|---|
| ? | 
  Model name: errorlocate
                  working:?      working:job  working:retired   .delta_working      
  Minimize                0                0                0   1.384103718214      
  r1                      0                1                1                0  =  1
  working                 1                0                0                1  =  1
  Kind                  SOS              SOS              SOS              Std      
  Type                  Int              Int              Int              Int      
  Upper                   1                1                1                1      
  Lower                   0                0                0                0Row r1 indicates that either working:job or
working:retired must be true. The Kind row
(SOS) indicates that these variables share the same switch,
only one of them can be set.
With categorical variables it is also possible to specify
if-then rules. These are encoded as one mip rule:
rules <- validator( r1 = if (voted == TRUE) adult == TRUE)
data <- data.frame(voted = TRUE, adult = FALSE)| voted | adult | 
|---|---|
| TRUE | FALSE | 
  Model name: errorlocate
                     adult           voted    .delta_adult    .delta_voted       
  Minimize               0               0  1.495953047416  1.358809254132       
  r1                    -1               1               0               0  <=  0
  voted                  0               1               0               1   =  1
  adult                 -1               0               1               0   =  0
  Kind                 Std             Std             Std             Std       
  Type                 Int             Int             Int             Int       
  Upper                  1               1               1               1       
  Lower                  0               0               0               0| age | income | 
|---|---|
| 12 | 2000 | 
errorlocate encodes this rule into multiple rules (as
noted in the theoretical section above), so rule r1 is
chopped into 1 rule + 2 sub rules:
r1: if (income > 0) age >= 16:
r1._lin1: if (r1._lin1 == FALSE) income <= 0
r1._lin2: if (r1._lin2 == FALSE) age >= 16
r1: r1._lin1 == FALSE | r1._lin2 == FALSE
This can be seen with:
mip$mip_rules()
#> [[1]]
#> r1: r1._lin1 + r1._lin2 <= 1
#> [[2]]
#> r1._lin1: income - 1e+07*r1._lin1 <= 0
#> [[3]]
#> r1._lin2: -age - 1e+07*r1._lin2 <= -16
#> [[4]]
#> income_ub: income - 1e+07*.delta_income <= 2000
#> [[5]]
#> age_ub: age - 1e+07*.delta_age <= 12
#> [[6]]
#> income_lb: -income - 1e+07*.delta_income <= -2000
#> [[7]]
#> age_lb: -age - 1e+07*.delta_age <= -12The resulting lp model is:
  Model name: errorlocate
                        age          income      .delta_age   .delta_income        r1._lin1        r1._lin2           
  Minimize                0               0   1.38872261066  1.190017589718               0               0           
  r1                      0               0               0               0               1               1  <=      1
  r1._lin1                0               1               0               0          -1e+07               0  <=      0
  r1._lin2               -1               0               0               0               0          -1e+07  <=    -16
  income_ub               0               1               0          -1e+07               0               0  <=   2000
  age_ub                  1               0          -1e+07               0               0               0  <=     12
  income_lb               0              -1               0          -1e+07               0               0  <=  -2000
  age_lb                 -1               0          -1e+07               0               0               0  <=    -12
  Kind                  Std             Std             Std             Std             Std             Std           
  Type                 Real            Real             Int             Int             Int             Int           
  Upper                 Inf             Inf               1               1               1               1           
  Lower                -Inf            -Inf               0               0               0               0This works together with categorical, linear and conditional rules.
The weights for each variable are normally set to 1, and
errorlocate adds some random remainder to the weights: so
the solutions are unique and reproducible (using
set.seed).
set.seed(42)
rules <- validator( r1 = if (voted == TRUE) adult == TRUE)
data <- data.frame(voted = TRUE, adult = FALSE)
mip <- inspect_mip(data, rules, weight = c(voted = 3, adult=1))$objective contains the generated weights:
These are assigned to the delta variables in the
objective function of the mip.
  Model name: errorlocate
                     adult           voted    .delta_adult    .delta_voted       
  Minimize               0               0  1.468537706648  3.457403021748       
  r1                    -1               1               0               0  <=  0
  voted                  0               1               0               1   =  1
  adult                 -1               0               1               0   =  0
  Kind                 Std             Std             Std             Std       
  Type                 Int             Int             Int             Int       
  Upper                  1               1               1               1       
  Lower                  0               0               0               0