Laserize: An R package for obtaining LASER text embeddings by interfacing the Python `laserembeddings` module

The laserize package provides an R interface to the embed_sentence functionality provided by the laserembeddings Python module. laserembeddings is a port of Facebook Research's LASER

LASER provides models to compute multilingual sentence embeddings that are aligned in a common, language-independent vector space. Sentences with similar semantics from different languages are thus mapped to "close" vectors.

The embed_sentence function of the laserembeddings Python module allows to obtain these vector representations based on the Facebook's pre-trained model. The laserize package provides an itnerface to this functionality.

Installation

devtools::install_github("haukelicht/laserize")

Usage

Setup

To setup laserize, use setup_laser. This interactive function downloads all required modules and LASER model.

library(laserize)
setup_laser()

If provided a valid file path to its .py.venv argument, setup_laser creates a Python virtual environment (if not already exists) at the desired location.

library(laserize)
# with path to _existing_ Python vortual environment
setup_laser(.py.venv = "path/to/venv")
# with path to "dir" that should contain a _new_ Python vortual environment
setup_laser(.py.venv = "path/to/existing/dir/venv")

For details and more options see ?laserize::setup_laser.

Embedding sentences

Sentences/texts can be embedded by passing a data.frame object to laserize::laserize. The data frame needs to have columns 'id' (sentence ID), 'text' (sentence text), 'lang' (sentence language).

For details and more options see ?laserize::laserize.

test_df <- tibble::tribble(
  ~id, ~text, ~lang,
  001, "Hallo Welt", "de",
  002, "Auf wiedersehen", "de",
  003, "Hello world", "en",
  004, "XXGWRXYYFGEG", "unkown",
)

# obtain LASER embeddings 
res <- laserize(test_df)
# 'res' is a name list with four elements
str(res, 1) 
# each list element is a list with elements 'id', 'text', 'lang', and 'e'
str(res[[1]], 1) 

# simplified output (matrix with IDs as row names)
res <- laserize(test_df, simplify = TRUE)
is.matrix(res) # a matrix
# rows as many as sentences in 'test_df',
# columns as many as embedding dimensions
dim(res)

# check sentence similarities
cosine_sim <- function(x, y) sum(x*y)/sqrt(sum(x**2)*sum(y**2))
# representations of greetings in German and English are very similar
cosine_sim(x = res[1, ], y = res[3, ])
# representations of German greeting and goodbye are somewhat dissimilar
cosine_sim(x = res[1, ], y = res[2, ])

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
R		R
data-raw		data-raw
data		data
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.md		README.md
laserize.Rproj		laserize.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Laserize: An R package for obtaining LASER text embeddings by interfacing the Python `laserembeddings` module

Installation

Usage

Setup

Embedding sentences

About

Releases

Packages

Languages

haukelicht/laserize

Folders and files

Latest commit

History

Repository files navigation

Laserize: An R package for obtaining LASER text embeddings by interfacing the Python laserembeddings module

Installation

Usage

Setup

Embedding sentences

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Laserize: An R package for obtaining LASER text embeddings by interfacing the Python `laserembeddings` module

Packages