Home Artificial Intelligence Posit AI Weblog: Introducing the textual content package deal

Posit AI Weblog: Introducing the textual content package deal

Posit AI Weblog: Introducing the textual content package deal


AI-based language evaluation has lately gone by a “paradigm shift” (Bommasani et al., 2021, p. 1), thanks partly to a brand new approach known as transformer language mannequin (Vaswani et al., 2017, Liu et al., 2019). Firms, together with Google, Meta, and OpenAI have launched such fashions, together with BERT, RoBERTa, and GPT, which have achieved unprecedented massive enhancements throughout most language duties resembling net search and sentiment evaluation. Whereas these language fashions are accessible in Python, and for typical AI duties by HuggingFace, the R package deal textual content makes HuggingFace and state-of-the-art transformer language fashions accessible as social scientific pipelines in R.


We developed the textual content package deal (Kjell, Giorgi & Schwartz, 2022) with two goals in thoughts:
To function a modular resolution for downloading and utilizing transformer language fashions. This, for instance, contains reworking textual content to phrase embeddings in addition to accessing frequent language mannequin duties resembling textual content classification, sentiment evaluation, textual content era, query answering, translation and so forth.
To offer an end-to-end resolution that’s designed for human-level analyses together with pipelines for state-of-the-art AI methods tailor-made for predicting traits of the individual that produced the language or eliciting insights about linguistic correlates of psychological attributes.

This weblog publish exhibits learn how to set up the textual content package deal, remodel textual content to state-of-the-art contextual phrase embeddings, use language evaluation duties in addition to visualize phrases in phrase embedding house.

Set up and establishing a python surroundings

The textual content package deal is establishing a python surroundings to get entry to the HuggingFace language fashions. The primary time after putting in the textual content package deal it is advisable to run two features: textrpp_install() and textrpp_initialize().

# Set up textual content from CRAN
set up.packages("textual content")
library(textual content)

# Set up textual content required python packages in a conda surroundings (with defaults)

# Initialize the put in conda surroundings
# save_profile = TRUE saves the settings so that you just would not have to run textrpp_initialize() once more after restarting R
textrpp_initialize(save_profile = TRUE)

See the prolonged set up information for extra info.

Rework textual content to phrase embeddings

The textEmbed() operate is used to rework textual content to phrase embeddings (numeric representations of textual content). The mannequin argument lets you set which language mannequin to make use of from HuggingFace; when you’ve got not used the mannequin earlier than, it is going to mechanically obtain the mannequin and obligatory information.

# Rework the textual content knowledge to BERT phrase embeddings
# Observe: To run quicker, attempt one thing smaller: mannequin = 'distilroberta-base'.
word_embeddings <- textEmbed(texts = "Good day, how are you doing?",
                            mannequin = 'bert-base-uncased')

The phrase embeddings can now be used for downstream duties resembling coaching fashions to foretell associated numeric variables (e.g., see the textTrain() and textPredict() features).

(To get token and particular person layers output see the textEmbedRawLayers() operate.)

There are lots of transformer language fashions at HuggingFace that can be utilized for numerous language mannequin duties resembling textual content classification, sentiment evaluation, textual content era, query answering, translation and so forth. The textual content package deal contains user-friendly features to entry these.

classifications <- textClassify("Good day, how are you doing?")
generated_text <- textGeneration("The that means of life is")

For extra examples of obtainable language mannequin duties, for instance, see textSum(), textQA(), textTranslate(), and textZeroShot() underneath Language Evaluation Duties.

Visualizing phrases within the textual content package deal is achieved in two steps: First with a operate to pre-process the information, and second to plot the phrases together with adjusting visible traits resembling coloration and font measurement.
To reveal these two features we use instance knowledge included within the textual content package deal: Language_based_assessment_data_3_100. We present learn how to create a two-dimensional determine with phrases that people have used to explain their concord in life, plotted in response to two totally different well-being questionnaires: the concord in life scale and the satisfaction with life scale. So, the x-axis exhibits phrases which might be associated to low versus excessive concord in life scale scores, and the y-axis exhibits phrases associated to low versus excessive satisfaction with life scale scores.

word_embeddings_bert <- textEmbed(Language_based_assessment_data_3_100,
                                  aggregation_from_tokens_to_word_types = "imply",
                                  keep_token_embeddings = FALSE)

# Pre-process the information for plotting
df_for_plotting <- textProjection(Language_based_assessment_data_3_100$harmonywords, 
                                  word_embeddings_bert$textual content$harmonywords,

# Plot the information
plot_projection <- textProjectionPlot(
  word_data = df_for_plotting,
  y_axes = TRUE,
  p_alpha = 0.05,
  title_top = "Supervised Bicentroid Projection of Concord in life phrases",
  x_axes_label = "Low vs. Excessive HILS rating",
  y_axes_label = "Low vs. Excessive SWLS rating",
  p_adjust_method = "bonferroni",
  points_without_words_size = 0.4,
  points_without_words_alpha = 0.4
Supervised Bicentroid Projection of Harmony in life words

This publish demonstrates learn how to perform state-of-the-art textual content evaluation in R utilizing the textual content package deal. The package deal intends to make it straightforward to entry and use transformers language fashions from HuggingFace to research pure language. We sit up for your suggestions and contributions towards making such fashions obtainable for social scientific and different functions extra typical of R customers.

  • Bommasani et al. (2021). On the alternatives and dangers of basis fashions.
  • Kjell et al. (2022). The textual content package deal: An R-package for Analyzing and Visualizing Human Language Utilizing Pure Language Processing and Deep Studying.
  • Liu et al (2019). Roberta: A robustly optimized bert pretraining strategy.
  • Vaswaniet al (2017). Consideration is all you want. Advances in Neural Data Processing Programs, 5998–6008


In the event you see errors or wish to recommend modifications, please create a difficulty on the supply repository.


Textual content and figures are licensed underneath Inventive Commons Attribution CC BY 4.0. Supply code is out there at https://github.com/OscarKjell/ai-blog, until in any other case famous. The figures which were reused from different sources do not fall underneath this license and could be acknowledged by a be aware of their caption: “Determine from …”.


For attribution, please cite this work as

Kjell, et al. (2022, Oct. 4). Posit AI Weblog: Introducing the textual content package deal. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/

BibTeX quotation

  writer = {Kjell, Oscar and Giorgi, Salvatore and Schwartz, H Andrew},
  title = {Posit AI Weblog: Introducing the textual content package deal},
  url = {https://blogs.rstudio.com/tensorflow/posts/2022-09-29-r-text/},
  yr = {2022}



Please enter your comment!
Please enter your name here