-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
141 lines (104 loc) · 5.1 KB
/
README.Rmd
File metadata and controls
141 lines (104 loc) · 5.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
output: github_document
---
<!-- badges: start -->
[](https://github.com/ncn-foreigners/blocking/actions/workflows/R-CMD-check.yaml)
[](https://github.com/ncn-foreigners/blocking/actions/workflows/test-coverage.yaml)
[](https://app.codecov.io/gh/ncn-foreigners/blocking?branch=main)
[](https://CRAN.R-project.org/package=blocking)
[](https://cran.r-project.org/package=blocking)
[](https://cran.r-project.org/package=blocking)
[](https://github.com/SNStatComp/awesome-official-statistics-software)
<!-- badges: end -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# Overview
## Description
This R package is designed to block records for data deduplication and record linkage (also known as entity resolution) using [approximate nearest neighbor algorithms (ANN)](https://en.wikipedia.org/wiki/Nearest_neighbor_search) and graphs (via the `igraph` package).
It supports the following R packages that bind to specific ANN algorithms:
+ [rnndescent](https://cran.r-project.org/package=rnndescent) (default, very powerful, supports sparse matrices),
+ [RcppHNSW](https://cran.r-project.org/package=RcppHNSW) (powerful but does not support sparse matrices),
+ [RcppAnnoy](https://cran.r-project.org/package=RcppAnnoy),
+ [mlpack](https://cran.r-project.org/package=mlpack) (see `mlpack::lsh` and `mlpack::knn`).
The package can be used with the [reclin2](https://cran.r-project.org/package=reclin2) package via the `blocking::pair_ann` function.
## Installation
Install the stable version from CRAN:
```{r, eval=FALSE}
install.packages("blocking")
```
You can also install the development version from GitHub:
```{r, eval=FALSE}
# install.packages("pak") # uncomment if needed
pak::pkg_install("ncn-foreigners/blocking")
```
## Basic usage
Load packages for the examples:
```{r}
library(blocking)
library(reclin2)
```
Generate simple data with three groups (`df_example`) and reference data (`df_base`):
```{r}
df_example <- data.frame(txt = c(
"jankowalski",
"kowalskijan",
"kowalskimjan",
"kowaljan",
"montypython",
"pythonmonty",
"cyrkmontypython",
"monty"
))
df_base <- data.frame(txt = c("montypython", "kowalskijan", "other"))
df_example
df_base
```
Deduplication using the `blocking` function. Output contains information:
+ the method used (`nnd` refers to the NN descent algorithm),
+ number of blocks created (here 2 blocks),
+ number of columns used for blocking, i.e., how many shingles were created by the `text2vec` package (here 28),
+ reduction ratio, i.e., how large the reduction of comparison pairs is (here 0.5714, which means blocking reduces comparisons by over 57%).
```{r}
blocking_result <- blocking(x = df_example$txt)
blocking_result
```
Table with blocking results contains:
+ row numbers from the original data,
+ block number (integers),
+ distance (from the ANN algorithm).
```{r}
blocking_result$result
```
Deduplication using the `pair_ann` function for integration with the `reclin2` package. Use the pipeline with the `reclin2` package:
```{r}
pair_ann(x = df_example, on = "txt") |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
```
Linking records using the same function where `df_base` is the "register" and `df_example` is the reference data:
```{r}
pair_ann(x = df_base, y = df_example, on = "txt", deduplication = FALSE) |>
compare_pairs(on = "txt", comparators = list(cmp_jarowinkler())) |>
score_simple("score", on = "txt") |>
select_threshold("threshold", score = "score", threshold = 0.55) |>
link(selection = "threshold")
```
## See also
See section `Data Integration (Statistical Matching and Record Linkage)` in [the Official Statistics Task View](https://CRAN.R-project.org/view=OfficialStatistics).
Packages that allow blocking:
+ [klsh](https://CRAN.R-project.org/package=klsh) -- k-means locality sensitive hashing,
+ [reclin2](https://CRAN.R-project.org/package=reclin2) -- `pair_blocking`, `pair_minsim` functions,
+ [fastLink](https://CRAN.R-project.org/package=fastLink) -- `blockData` function.
Other:
+ [clevr](https://CRAN.R-project.org/package=clevr) -- evaluation of clustering, helper functions,
+ [exchanger](https://github.com/cleanzr/exchanger) -- Bayesian Entity Resolution with Exchangeable Random Partition Priors.
## Funding
Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.