Title: | A Light Wrapper Around the 'BM25' 'Rust' Crate for Okapi BM25 Text Search |
---|---|
Description: | BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a user's search query. This package provides a light wrapper around the 'BM25' 'rust' crate for Okapi BM25 text search. For more information, see Robertson et al. (1994) <https://trec.nist.gov/pubs/trec3/t3_proceedings.html>. |
Authors: | David Zimmermann-Kollenda [aut, cre], Michael Barlow [aut] (bm25 Rust library), Authors of the dependency Rust crates [aut] (see AUTHORS file) |
Maintainer: | David Zimmermann-Kollenda <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.3 |
Built: | 2025-02-16 04:30:57 UTC |
Source: | https://github.com/davzim/rbm25 |
Class to construct the BM25 search object
new()
Creates a new instance of a BM25 class
BM25$new(data = NULL, lang = "detect", k1 = 1.2, b = 0.75, metadata = NULL)
data
text data, a vector of strings. Note any preprocessing steps (tolower, removing stopwords etc) need to have taken place before this!
lang
language of the data, see self$available_languages(), can also be "detect" to automatically detect the language, default is "detect"
k1
k1 parameter of BM25, default is 1.2
b
b parameter of BM25, default is 0.75
metadata
a data.frame with metadata for each document, default is NULL must be a data.frame with the same number of rows containing arbitrary metadata for each document, e.g. a file path or a URL
BM25 object
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25 bm25$get_data() bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange") # return all, same as max_n = Inf or NULL
available_languages()
Returns the available languages
BM25$available_languages()
a named character vector with language codes and their full names
BM25$new()$available_languages()
get_data()
Returns the data
BM25$get_data(add_metadata = TRUE)
add_metadata
whether to add metadata to the data, default is TRUE
a data.frame with the data and metadata if available and selected
BM25$new(data = letters, metadata = LETTERS)$get_data()
get_lang()
Returns the language used
BM25$get_lang()
a character string with the language code
BM25$new()$get_lang() BM25$new(lang = "en")$get_lang() BM25$new(lang = "detect")$get_lang()
print()
Prints a BM25 object
BM25$print(n = 5, nchar = 20)
n
number of data to print, default is 5
nchar
number of characters to print for each text, default is 20
the object invisible
BM25$new(data = letters, metadata = LETTERS)
add_data()
Adds data to the BM25 object
This can be useful to add more data later on, note this will rebuild the engine.
BM25$add_data(data, metadata = NULL)
data
a vector of strings
metadata
a data.frame with metadata for each document, default is NULL
NULL
bm25 <- BM25$new() bm25$add_data(letters, metadata = LETTERS) bm25
query()
Query the BM25 object for the N best matches
BM25$query(query, max_n = NULL, return_text = TRUE, return_metadata = TRUE)
query
the term to search for, note all preprocessing that was applied to the text corpus initially needs to be already performed on the term, e.g., tolower, removing stopwords etc
max_n
the maximum number of results to return, default is all
return_text
whether to return the text, default is TRUE
return_metadata
whether to return metadata, default is TRUE
a data.frame with the results
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange", return_text = FALSE, return_metadata = FALSE) bm25$query("orange", max_n = 3)
clone()
The objects of this class are cloneable with this method.
BM25$clone(deep = FALSE)
deep
Whether to make a deep clone.
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25$query("orange", max_n = 2) bm25$query("orange") ## ------------------------------------------------ ## Method `BM25$new` ## ------------------------------------------------ corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25 bm25$get_data() bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange") # return all, same as max_n = Inf or NULL ## ------------------------------------------------ ## Method `BM25$available_languages` ## ------------------------------------------------ BM25$new()$available_languages() ## ------------------------------------------------ ## Method `BM25$get_data` ## ------------------------------------------------ BM25$new(data = letters, metadata = LETTERS)$get_data() ## ------------------------------------------------ ## Method `BM25$get_lang` ## ------------------------------------------------ BM25$new()$get_lang() BM25$new(lang = "en")$get_lang() BM25$new(lang = "detect")$get_lang() ## ------------------------------------------------ ## Method `BM25$print` ## ------------------------------------------------ BM25$new(data = letters, metadata = LETTERS) ## ------------------------------------------------ ## Method `BM25$add_data` ## ------------------------------------------------ bm25 <- BM25$new() bm25$add_data(letters, metadata = LETTERS) bm25 ## ------------------------------------------------ ## Method `BM25$query` ## ------------------------------------------------ corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange", return_text = FALSE, return_metadata = FALSE) bm25$query("orange", max_n = 3)
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25$query("orange", max_n = 2) bm25$query("orange") ## ------------------------------------------------ ## Method `BM25$new` ## ------------------------------------------------ corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25 bm25$get_data() bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange") # return all, same as max_n = Inf or NULL ## ------------------------------------------------ ## Method `BM25$available_languages` ## ------------------------------------------------ BM25$new()$available_languages() ## ------------------------------------------------ ## Method `BM25$get_data` ## ------------------------------------------------ BM25$new(data = letters, metadata = LETTERS)$get_data() ## ------------------------------------------------ ## Method `BM25$get_lang` ## ------------------------------------------------ BM25$new()$get_lang() BM25$new(lang = "en")$get_lang() BM25$new(lang = "detect")$get_lang() ## ------------------------------------------------ ## Method `BM25$print` ## ------------------------------------------------ BM25$new(data = letters, metadata = LETTERS) ## ------------------------------------------------ ## Method `BM25$add_data` ## ------------------------------------------------ bm25 <- BM25$new() bm25$add_data(letters, metadata = LETTERS) bm25 ## ------------------------------------------------ ## Method `BM25$query` ## ------------------------------------------------ corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) bm25 <- BM25$new(data = corpus, lang = "en", metadata = data.frame(src = paste("file", 1:4))) bm25$query("orange", max_n = 2) bm25$query("orange", max_n = 3) bm25$query("orange", return_text = FALSE, return_metadata = FALSE) bm25$query("orange", max_n = 3)
A simple wrapper around the BM25 class.
bm25_score(data, query, lang = NULL, k1 = 1.2, b = 0.75)
bm25_score(data, query, lang = NULL, k1 = 1.2, b = 0.75)
data |
text data, a vector of strings. Note any preprocessing steps (tolower, removing stopwords etc) need to have taken place before this! |
query |
the term to search for, note all preprocessing that was applied to the text corpus initially needs to be already performed on the term, e.g., tolower, removing stopwords etc |
lang |
language of the data, see self$available_languages(), can also be "detect" to automatically detect the language, default is "detect" |
k1 |
k1 parameter of BM25, default is 1.2 |
b |
b parameter of BM25, default is 0.75 |
a numeric vector of the BM25 scores, note higher values are showing a higher relevance of the text to the query
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) scores <- bm25_score(data = corpus, query = "orange") data.frame(text = corpus, scores_orange = scores)
corpus <- c( "The rabbit munched the orange carrot.", "The snake hugged the green lizard.", "The hedgehog impaled the orange orange.", "The squirrel buried the brown nut." ) scores <- bm25_score(data = corpus, query = "orange") data.frame(text = corpus, scores_orange = scores)