rbm25 is a light wrapper around the rust bm25
crate. It provides a simple interface to the Okapi BM25 algorithm for text search.
Note the package does not provide any text preprocessing, this needs to be done before using the package.
Installation
You can install the development version of rbm25 like so:
# Development Version
# devtools::install_github("DavZim/rbm25")
# CRAN release
install.packages("rbm25")
Example
The package exposes an R6 class BM25
that can be used to query a text corpus. For simplicity, there is also a bm25_score()
function that wraps the BM25
class.
library(rbm25)
# create a text corpus, where we want to find the closest matches for a query
corpus_original <- c(
"The rabbit munched the orange carrot.",
"The snake hugged the green lizard.",
"The hedgehog impaled the orange orange.",
"The squirrel buried the brown nut."
)
# text preprocessing: tolower, remove punctuation, remove stopwords
# note this is just an example and not the best way for larger amounts of text
stopwords <- c("the", "a", "an", "and")
corpus <- corpus_original |>
tolower() |>
gsub(pattern = "[[:punct:]]", replacement = "") |>
gsub(pattern = paste0("\\b(", paste(stopwords, collapse = "|"), ") *\\b"),
replacement = "") |>
trimws()
# define some metadata for the text corpus, e.g., the original text and the source
metadata <- data.frame(
text_original = corpus_original,
source = c("book1", "book2", "book3", "book4")
)
Using the BM25 Class
bm <- BM25$new(data = corpus, metadata = metadata)
bm
#> <BM25 (k1: 1.20, b: 0.75)> with 4 documents (language: 'Detect')
#> - Data & Metadata
#> text metadata.text_original
#> 1 rabbit munched orange carrot The rabbit munched the orange carrot.
#> 2 snake hugged green lizard The snake hugged the green lizard.
#> 3 hedgehog impaled orange orange The hedgehog impaled the orange orange.
#> 4 squirrel buried brown nut The squirrel buried the brown nut.
#> metadata.source
#> 1 book1
#> 2 book2
#> 3 book3
#> 4 book4
# note that query returns the values sorted by rank
bm$query(query = "orange", max_n = 2)
#> id score rank text
#> 1 3 0.4904281 1 hedgehog impaled orange orange
#> 2 1 0.3566750 2 rabbit munched orange carrot
#> text_original source
#> 1 The hedgehog impaled the orange orange. book3
#> 2 The rabbit munched the orange carrot. book1
Using the bm25_score()
function
# note that bm25_score returns the score in the order of the input data
scores <- bm25_score(data = corpus, query = "orange")
data.frame(text = corpus, scores_orange = scores)
#> text scores_orange
#> 1 rabbit munched orange carrot 0.3566750
#> 2 snake hugged green lizard 0.0000000
#> 3 hedgehog impaled orange orange 0.4904281
#> 4 squirrel buried brown nut 0.0000000