-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
168 lines (116 loc) · 4.06 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
output:
rmarkdown::github_document
editor_options:
chunk_output_type: console
---
```{r pkg-knitr-opts, include=FALSE}
hrbrpkghelpr::global_opts()
```
```{r badges, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::stinking_badges()
```
```{r description, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::yank_title_and_description()
```
## What's Inside The Tin
The following functions are implemented:
### DSL
- `web_client`/`webclient`: Create a new HtmlUnit WebClient instance<br/><br/>
- `wc_go`: Visit a URL<br/>
- `wc_html_nodes`: Select nodes from web client active page html content
- `wc_html_text`: Extract attributes, text and tag name from webclient page html content<br/><br/>
- `wc_html_attr`: Extract attributes, text and tag name from webclient page html content
- `wc_html_name`: Extract attributes, text and tag name from webclient page html content
- `wc_headers`: Return response headers of the last web request for current page
- `wc_browser_info`: Retreive information about the browser used to create the 'webclient'
- `wc_content_length`: Return content length of the last web request for current page
- `wc_content_type`: Return content type of web request for current page<br/><br/>
- `wc_render`: Retrieve current page contents<br/><br/>
- `wc_css`: Enable/Disable CSS support
- `wc_dnt`: Enable/Disable Do-Not-Track
- `wc_geo`: Enable/Disable Geolocation
- `wc_img_dl`: Enable/Disable Image Downloading
- `wc_load_time`: Return load time of the last web request for current page
- `wc_resize`: Resize the virtual browser window
- `wc_status`: Return status code of web request for current page
- `wc_timeout`: Change default request timeout
- `wc_title`: Return page title for current page
- `wc_url`: Return load time of the last web request for current page
- `wc_use_insecure_ssl`: Enable/Disable Ignoring SSL Validation Issues
- `wc_wait`: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing
### Just the Content (pls)
- `hu_read_html`: Read HTML from a URL with Browser Emulation & in a JavaScript Context
### Content++
- `wc_inspect`: Perform a "Developer Tools"-like Network Inspection of a URL
## Installation
```{r install-ex, results='asis', echo=FALSE, cache=FALSE}
hrbrpkghelpr::install_block()
```
## Usage
```{r cache=FALSE}
library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg
# current verison
packageVersion("htmlunit")
```
Something `xml2::read_html()` cannot do, read the table from <https://hrbrmstr.github.io/htmlunitjars/index.html>:
![](man/figures/test-url-table.png)
```{r ex1}
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"
pg <- xml2::read_html(test_url)
html_table(pg)
```
☹️
But, `hu_read_html()` can!
```{r ex2}
pg <- hu_read_html(test_url)
html_table(pg)
```
All without needing a separate Selenium or Splash server instance.
### Content++
We can also get a HAR-like content + metadata dump:
```{r ex3}
xdf <- wc_inspect("https://rstudio.com")
colnames(xdf)
select(xdf, method, url, status_code, content_length, load_time)
group_by(xdf, content_type) %>%
summarise(
total_size = sum(content_length),
total_load_time = sum(load_time)/1000
)
```
### DSL
```{r ex4}
wc <- web_client(emulate = "chrome")
wc %>% wc_browser_info()
wc <- web_client()
wc %>% wc_go("https://usa.gov/")
# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()
wc %>%
wc_html_nodes("a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)
wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_text, trim = TRUE) %>%
head(10)
wc %>%
wc_html_nodes(xpath=".//a") %>%
sapply(wc_html_attr, "href") %>%
head(10)
```
Handy function to get rendered plain text for text mining:
```{r ex5}
wc %>%
wc_render("text") %>%
substr(1, 300) %>%
cat()
```
### htmlunit Metrics
```{r echo=FALSE}
cloc::cloc_pkg_md()
```
## Code of Conduct
Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.