Package 'rbm25' reference manual

Title:	A Light Wrapper Around the 'BM25' 'Rust' Crate for Okapi BM25 Text Search
Description:	BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a user's search query. This package provides a light wrapper around the 'BM25' 'rust' crate for Okapi BM25 text search. For more information, see Robertson et al. (1994) <https://trec.nist.gov/pubs/trec3/t3_proceedings.html>.
Authors:	David Zimmermann-Kollenda [aut, cre], Michael Barlow [aut] (bm25 Rust library), Authors of the dependency Rust crates [aut] (see AUTHORS file)
Maintainer:	David Zimmermann-Kollenda <[email protected]>
License:	MIT + file LICENSE
Version:	0.0.3
Built:	2025-02-16 04:30:57 UTC
Source:	https://github.com/davzim/rbm25

BM25 Object

Description

Class to construct the BM25 search object

Methods

Method `new()`

Creates a new instance of a BM25 class

Usage

BM25$new(data = NULL, lang = "detect", k1 = 1.2, b = 0.75, metadata = NULL)

Arguments

data: text data, a vector of strings. Note any preprocessing steps (tolower, removing stopwords etc) need to have taken place before this!
lang: language of the data, see self$available_languages(), can also be "detect" to automatically detect the language, default is "detect"
k1: k1 parameter of BM25, default is 1.2
b: b parameter of BM25, default is 0.75
metadata: a data.frame with metadata for each document, default is NULL must be a data.frame with the same number of rows containing arbitrary metadata for each document, e.g. a file path or a URL

Returns

BM25 object

Examples

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))
bm25
bm25$get_data()

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange") # return all, same as max_n = Inf or NULL

Method `available_languages()`

Returns the available languages

Usage

BM25$available_languages()

Returns

a named character vector with language codes and their full names

Examples

BM25$new()$available_languages()

Method `get_data()`

Returns the data

Usage

BM25$get_data(add_metadata = TRUE)

Arguments

add_metadata: whether to add metadata to the data, default is TRUE

Returns

a data.frame with the data and metadata if available and selected

Examples

BM25$new(data = letters, metadata = LETTERS)$get_data()

Method `get_lang()`

Returns the language used

Usage

BM25$get_lang()

Returns

a character string with the language code

Examples

BM25$new()$get_lang()
BM25$new(lang = "en")$get_lang()
BM25$new(lang = "detect")$get_lang()

Method `print()`

Prints a BM25 object

Usage

BM25$print(n = 5, nchar = 20)

Arguments

n: number of data to print, default is 5
nchar: number of characters to print for each text, default is 20

Returns

the object invisible

Examples

BM25$new(data = letters, metadata = LETTERS)

Method `add_data()`

Adds data to the BM25 object

This can be useful to add more data later on, note this will rebuild the engine.

Usage

BM25$add_data(data, metadata = NULL)

Arguments

data: a vector of strings
metadata: a data.frame with metadata for each document, default is NULL

Returns

NULL

Examples

bm25 <- BM25$new()
bm25$add_data(letters, metadata = LETTERS)
bm25

Method `query()`

Query the BM25 object for the N best matches

Usage

BM25$query(query, max_n = NULL, return_text = TRUE, return_metadata = TRUE)

Arguments

query: the term to search for, note all preprocessing that was applied to the text corpus initially needs to be already performed on the term, e.g., tolower, removing stopwords etc
max_n: the maximum number of results to return, default is all
return_text: whether to return the text, default is TRUE
return_metadata: whether to return metadata, default is TRUE

Returns

a data.frame with the results

Examples

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange", return_text = FALSE, return_metadata = FALSE)
bm25$query("orange", max_n = 3)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

BM25$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

corpus <- c(
  "The rabbit munched the orange carrot.",
  "The snake hugged the green lizard.",
  "The hedgehog impaled the orange orange.",
  "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))
bm25$query("orange", max_n = 2)
bm25$query("orange")

## ------------------------------------------------
## Method `BM25$new`
## ------------------------------------------------

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))
bm25
bm25$get_data()

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange") # return all, same as max_n = Inf or NULL

## ------------------------------------------------
## Method `BM25$available_languages`
## ------------------------------------------------

BM25$new()$available_languages()

## ------------------------------------------------
## Method `BM25$get_data`
## ------------------------------------------------

BM25$new(data = letters, metadata = LETTERS)$get_data()

## ------------------------------------------------
## Method `BM25$get_lang`
## ------------------------------------------------

BM25$new()$get_lang()
BM25$new(lang = "en")$get_lang()
BM25$new(lang = "detect")$get_lang()

## ------------------------------------------------
## Method `BM25$print`
## ------------------------------------------------

BM25$new(data = letters, metadata = LETTERS)

## ------------------------------------------------
## Method `BM25$add_data`
## ------------------------------------------------

bm25 <- BM25$new()
bm25$add_data(letters, metadata = LETTERS)
bm25

## ------------------------------------------------
## Method `BM25$query`
## ------------------------------------------------

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange", return_text = FALSE, return_metadata = FALSE)
bm25$query("orange", max_n = 3)
corpus <- c(
  "The rabbit munched the orange carrot.",
  "The snake hugged the green lizard.",
  "The hedgehog impaled the orange orange.",
  "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))
bm25$query("orange", max_n = 2)
bm25$query("orange")

## ------------------------------------------------
## Method `BM25$new`
## ------------------------------------------------

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))
bm25
bm25$get_data()

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange") # return all, same as max_n = Inf or NULL

## ------------------------------------------------
## Method `BM25$available_languages`
## ------------------------------------------------

BM25$new()$available_languages()

## ------------------------------------------------
## Method `BM25$get_data`
## ------------------------------------------------

BM25$new(data = letters, metadata = LETTERS)$get_data()

## ------------------------------------------------
## Method `BM25$get_lang`
## ------------------------------------------------

BM25$new()$get_lang()
BM25$new(lang = "en")$get_lang()
BM25$new(lang = "detect")$get_lang()

## ------------------------------------------------
## Method `BM25$print`
## ------------------------------------------------

BM25$new(data = letters, metadata = LETTERS)

## ------------------------------------------------
## Method `BM25$add_data`
## ------------------------------------------------

bm25 <- BM25$new()
bm25$add_data(letters, metadata = LETTERS)
bm25

## ------------------------------------------------
## Method `BM25$query`
## ------------------------------------------------

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
bm25 <- BM25$new(data = corpus, lang = "en",
                 metadata = data.frame(src = paste("file", 1:4)))

bm25$query("orange", max_n = 2)
bm25$query("orange", max_n = 3)
bm25$query("orange", return_text = FALSE, return_metadata = FALSE)
bm25$query("orange", max_n = 3)

Score a text corpus based on the Okapi BM25 algorithm

Description

A simple wrapper around the BM25 class.

Usage

bm25_score(data, query, lang = NULL, k1 = 1.2, b = 0.75)
bm25_score(data, query, lang = NULL, k1 = 1.2, b = 0.75)

Arguments

`data`	text data, a vector of strings. Note any preprocessing steps (tolower, removing stopwords etc) need to have taken place before this!
`query`	the term to search for, note all preprocessing that was applied to the text corpus initially needs to be already performed on the term, e.g., tolower, removing stopwords etc
`lang`	language of the data, see self$available_languages(), can also be "detect" to automatically detect the language, default is "detect"
`k1`	k1 parameter of BM25, default is 1.2
`b`	b parameter of BM25, default is 0.75

Value

a numeric vector of the BM25 scores, note higher values are showing a higher relevance of the text to the query

Examples

corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
scores <- bm25_score(data = corpus, query = "orange")
data.frame(text = corpus, scores_orange = scores)
corpus <- c(
 "The rabbit munched the orange carrot.",
 "The snake hugged the green lizard.",
 "The hedgehog impaled the orange orange.",
 "The squirrel buried the brown nut."
)
scores <- bm25_score(data = corpus, query = "orange")
data.frame(text = corpus, scores_orange = scores)

Package 'rbm25'

Help Index

BM25 Object

Description

Methods

Public methods

Method new()

Usage

Arguments

Returns

Examples

Method available_languages()

Usage

Returns

Examples

Method get_data()

Usage

Arguments

Returns

Examples

Method get_lang()

Usage

Returns

Examples

Method print()

Usage

Arguments

Returns

Examples

Method add_data()

Usage

Arguments

Returns

Examples

Method query()

Usage

Arguments

Returns

Examples

Method clone()

Usage

Arguments

Examples

Score a text corpus based on the Okapi BM25 algorithm

Description

Usage

Arguments

Value

See Also

Examples

Method `new()`

Method `available_languages()`

Method `get_data()`

Method `get_lang()`

Method `print()`

Method `add_data()`

Method `query()`

Method `clone()`