Recommendation numbered, Nº: 30032020p1
📘 url personal use : |🔓 1 Openaccess – 🛒 2 to buy | (Copy & paste at the browser)
-Creative Commons address: tidytextmining.com
-Paid version address: amzn.to/3bDEYQO
Summary
This book is focused on practical software examples and data explorations. There are few equations, but a great deal of code. We especially focus on generating real insights from the literature, news, and social media that we analyze. We don’t assume any previous knowledge of text mining. Professional linguists and text analysts will likely find our examples elementary, though we are confident they can build on the framework for their own analyses. We do assume that the reader is at least slightly familiar with dplyr, ggplot2, and the %>%
“pipe” operator in R, and is interested in applying these tools to text data. For users who don’t have this background, we recommend books such as R for Data Science. We believe that with a basic background and interest in tidy data, even a user early in their R career can understand and apply our examples.
Chapters
- Chapter 1 outlines the tidy text format and the
unnest_tokens()
function. It also introduces the gutenbergr and janeaustenr packages, which provide useful literary text datasets that we’ll use throughout this book. - Chapter 2 shows how to perform sentiment analysis on a tidy text dataset, using the
sentiments
dataset from tidytext andinner_join()
from dplyr. - Chapter 3 describes the tf-idf statistic (term frequency times inverse document frequency), a quantity used for identifying terms that are especially important to a particular document.
- Chapter 4 introduces n-grams and how to analyze word networks in text using the widyr and ggraph packages.
- Chapter 5 introduces methods for tidying document-term matrices and corpus objects from the tm and quanteda packages, as well as for casting tidy text datasets into those formats.
- Chapter 6 explores the concept of topic modeling, and uses the
tidy()
method to interpret and visualize the output of the topicmodels package. - Chapter 7 demonstrates an application of a tidy text analysis by analyzing the authors’ own Twitter archives. How do Dave’s and Julia’s tweeting habits compare?
- Chapter 8 explores metadata from over 32,000 NASA datasets (available in JSON) by looking at how keywords from the datasets are connected to title and description fields.
- Chapter 9 analyzes a dataset of Usenet messages from a diverse set of newsgroups (focused on topics like politics, hockey, technology, atheism, and more) to understand patterns across the groups.
Authors
[Unofficial biography. For informational purposes only]
Julia Silge
Data scientist and software engineer at RStudio where She works on open source modeling tools. She studied physics and astronomy, finishing the PhD in 2005. She is both an international speaker and a real-world practitioner focusing on data analysis and machine learning practice. (Source: juliasilge.com).
David Robinson
Data scientist at Heap. His interests include statistics, data analysis, education, and programming in R. Is also the author of the broom and fuzzyjoin packages, and of the e-book Introduction to Empirical Bayes. He previously worked as Chief Data Scientist at DataCamp and as a data scientist at Stack Overflow, and received a PhD in Quantitative and Computational Biology from Princeton University.
Please, thank the author and Publisher
Thank you very much for this work to @juliasilge and @drob, via @States_AI_IA #R #datascience #openscience #openaccess #ai #artificialintelligence #ia #thebibleai #ebook #free
Tweet
Liked this post? Follow this blog to get more.