Skip to content

Commit

Permalink
Fix and check links with hyperlink (#271)
Browse files Browse the repository at this point in the history
* Fix and check links with hyperlink

* Add CI jobs

* Trigger if documentation changed

* Build on fork

* Remove temporary build on fork repo

* Remove comment

* Remove console.log
  • Loading branch information
ggrossetie authored Feb 17, 2021
1 parent f1ccd65 commit 0f58d2c
Show file tree
Hide file tree
Showing 12 changed files with 143 additions and 23 deletions.
28 changes: 28 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: Docs

on:
push:
branches:
- '4.0'
- 'master'
pull_request:
branches:
- '*'

jobs:
build:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2

- name: Use Node.js 14
uses: actions/setup-node@v1
with:
node-version: '14'
- run: npm install
working-directory: 'doc'
- run: npm run build:docs
working-directory: 'doc'
- run: npm run lint:links
working-directory: 'doc'
20 changes: 20 additions & 0 deletions .github/workflows/notify.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Trigger Publish

on:
push:
paths:
- 'doc/docs'
branches:
- '4.0'

jobs:
trigger_publish:
runs-on: ubuntu-latest

steps:
- name: Trigger Developer Event
uses: peter-evans/repository-dispatch@master
with:
token: ${{ secrets.BUILD_ACCESS_TOKEN }}
repository: neo4j-documentation/docs-refresh
event-type: spark-connector
17 changes: 14 additions & 3 deletions doc/docs.yml
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
site:
title: Neo4j Connector for Apache Spark User Guide
url: /neo4j-spark-docs

content:
sources:
- url: ../
branches: HEAD
start_path: doc/docs

output:
dir: ./build/site/developer

ui:
bundle:
url: https://s3-eu-west-1.amazonaws.com/static-content.neo4j.com/build/ui-bundle.zip
snapshot: true

urls:
html_extension_style: indexify

asciidoc:
attributes:
page-theme: docs
page-cdn: /_/
experimental: ''
page-cdn: /static/assets
page-theme: developer
page-canonical-root: /developer
page-disabletracking: true
3 changes: 3 additions & 0 deletions doc/docs/antora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,6 @@ asciidoc:
theme: docs
connector-version: 4.0.0
copyright: Neo4j Inc.
url-neo4j-product-gds-lib: https://neo4j.com/product/graph-data-science-library/
url-gh-spark-notebooks: https://github.com/utnaf/neo4j-connector-apache-spark-notebooks
url-neo4j-gds-manual: https://neo4j.com/docs/graph-data-science/current/
8 changes: 4 additions & 4 deletions doc/docs/modules/ROOT/pages/architecture.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ MERGE (c)-[:BOUGHT { quantity: event.quantity }]->(p);
```

Notice that in this case the entire job can be done by a single cypher statement. As data frames get complex,
these cypher statements too can get quite complex.
these cypher statements too can get quite complex.

==== Pros

Expand Down Expand Up @@ -248,7 +248,7 @@ available on the server**.
It's impossible to pick a single batch size that works for everyone, because how much memory your transactions
take up depends on the number of properties & relationships, and other factors. A good general aggressive value
to try is around 20,000 - but you can increase this number if your data is small, or if you have a lot of memory
on the server. Lower the number if it's a small database server, or the data your pushing has many large
on the server. Lower the number if it's a small database server, or the data your pushing has many large
properties.

=== Tune your Neo4j Memory Configuration
Expand All @@ -266,7 +266,7 @@ At the Neo4j Cypher level, it's very common to use the Spark connector in a way
In Neo4j, this looks up a node by some "key" and then creates it only if it does not already exist.

[NOTE]
**It is strongly recommended to assert indexes or constraints on any graph property that you use as part of
**It is strongly recommended to assert indexes or constraints on any graph property that you use as part of
`node.keys`, `relationship.source.node.keys`, `relationship.target.node.keys` or other similar key options**

A common source of poor performance is to write Spark code that generates `MERGE` cypher, or otherwise tries
Expand Down Expand Up @@ -314,7 +314,7 @@ extreme cases with too much parallelism, Neo4j may reject the writes with lock c
[NOTE]
**You can use as many partitions as there are cores in the Neo4j server, if you have properly partitioned your data to avoid Neo4j locks**

There is an exception to the "1 partition" rule above; if your data writes are partitioned ahead of time to avoid locks, you
There is an exception to the "1 partition" rule above; if your data writes are partitioned ahead of time to avoid locks, you
can generally do as many write threads to Neo4j as there are cores in the server. Suppose we want to write a long list of `:Person` nodes, and we know they are distinct by the person `id`. We might stream those into Neo4j in 4 different partitions, as there will not be any lock contention.

== Schema Considerations
Expand Down
8 changes: 4 additions & 4 deletions doc/docs/modules/ROOT/pages/faq.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

== How can I speed up writes to Neo4j?

The Spark connector fundamentally writes data to Neo4j in batches. Neo4j is a transactional
The Spark connector fundamentally writes data to Neo4j in batches. Neo4j is a transactional
database, and so all modifications are made within a transaction. Those transactions in turn
have overhead.
have overhead.

The two simplest ways of increasing write performance are:
* Increase the batch size (option `batch.size`). The larger the batch, the fewer transactions are executed to write all of your data, and the less transactional overhead is incurred.
Expand Down Expand Up @@ -35,7 +35,7 @@ environment operates in terms of DataFrames as it always did, and this connector

== Can this connector be used for pre-processing of data and loading into Neo4j?

Yes. This connector enables spark to be used as a good method of loading data directly into Neo4j. See link:architecture.adoc[the architecture section] for a detailed discussion of
Yes. This connector enables spark to be used as a good method of loading data directly into Neo4j. See xref:architecture.adoc[the architecture section] for a detailed discussion of
"Normalized Loading" vs. "Cypher Destructuring" and guidance on different approaches for how to do performant data loads into Neo4j.

== My writes are failing due to Deadlock Exceptions
Expand All @@ -46,7 +46,7 @@ link:https://neo4j.com/developer/kb/explanation-of-error-deadlockdetectedexcepti

Typically this is caused by too much parallelism in writing to Neo4j. For example, when you
write a relationship `(:A)-[:REL]->(:B)`, this creates a "lock" in the database on both nodes.
If some simultaneous other thread is attempting to write to those nodes too often, deadlock
If some simultaneous other thread is attempting to write to those nodes too often, deadlock
exceptions can result and a transaction will fail.

In general, the solution is to repartition the dataframe prior to writing it to Neo4j, to avoid
Expand Down
10 changes: 5 additions & 5 deletions doc/docs/modules/ROOT/pages/gds.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
This chapter provides an information on using the Neo4j Connector for Apache Spark with Neo4j's Graph Data Science Library.
--

link:https://neo4j.com/graph-data-science-library/[Neo4j's Graph Data Science (GDS) Library] lets data scientists benefit from powerful graph algorithms. It provides unsupervised machine learning methods and heuristics that learn and describe the topology of your graph. The GDS Library includes hardened graph algorithms with enterprise features, like deterministic seeding for consistent results and reproducible machine learning workflows.
link:{url-neo4j-product-gds-lib}[Neo4j's Graph Data Science (GDS) Library] lets data scientists benefit from powerful graph algorithms. It provides unsupervised machine learning methods and heuristics that learn and describe the topology of your graph. The GDS Library includes hardened graph algorithms with enterprise features, like deterministic seeding for consistent results and reproducible machine learning workflows.

GDS Algorithms are bucketed into 5 "families":

Expand All @@ -17,13 +17,13 @@ GDS Algorithms are bucketed into 5 "families":
== GDS Operates via Cypher

All of the link:https://neo4j.com/docs/graph-data-science/current/[functionality of GDS] is used by issuing cypher queries. As such, it is easily
All of the link:{url-neo4j-gds-manual}[functionality of GDS] is used by issuing cypher queries. As such, it is easily
accessible via Spark, because the Neo4j Connector for Apache Spark can issue Cypher queries and read their results back. This combination means
that you can use Neo4j & GDS as a graph co-processor in an existing ML workflow that you may implement in Apache Spark.

== Example

In the link:https://github.com/utnaf/spark-connector-notebooks[sample Zeppelin Notebook repository], there is a GDS example that can be run against
In the link:{url-gh-spark-notebooks}[sample Zeppelin Notebook repository], there is a GDS example that can be run against
a Neo4j Sandbox, showing how to use the two together.

=== Create a Virtual Graph in GDS Using Spark
Expand Down Expand Up @@ -69,7 +69,7 @@ To run an analysis, the result is just another Cypher query, executed as a spark
%pyspark
query = """
CALL gds.pageRank.stream('got-interactions')
CALL gds.pageRank.stream('got-interactions')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
"""
Expand All @@ -92,7 +92,7 @@ df.show()
=== Streaming versus Persisting GDS Results

When link:https://neo4j.com/docs/graph-data-science/current/common-usage/running-algos/[running GDS algorithms] the library gives you the choice
of either streaming the results of the algorithm back the caller, or mutating the underlying graph. Using GDS together with spark provides an
of either streaming the results of the algorithm back the caller, or mutating the underlying graph. Using GDS together with spark provides an
additional option of transforming or otherwise using a GDS result. Ultimately, either modality will work with the Neo4j Connector for Apache
Spark, and it is left up to your option what's best for your use case.

Expand Down
2 changes: 1 addition & 1 deletion doc/docs/modules/ROOT/pages/quickstart.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ RETURN count(p) AS count

=== Examples

You can find examples on how to use the Neo4j Connector for Apache Spark at link:https://github.com/utnaf/spark-connector-notebooks[this repository].
You can find examples on how to use the Neo4j Connector for Apache Spark at link:{url-gh-spark-notebooks}[this repository].
It's a collection of Zeppelin Notebooks with different usage scenarios, along with a getting started guide.

The repository is in constant development, and feel free to submit your examples.
4 changes: 2 additions & 2 deletions doc/docs/modules/ROOT/pages/reading.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ spark.read.format("org.neo4j.spark.DataSource")

.List of available read options
|===
|Setting Name |Description |Default Value |Required
|Setting Name |Description |Default Value |Required

|`query`
|Cypher query to read the data
Expand Down Expand Up @@ -161,7 +161,7 @@ If your query returns a graph entity please use the `labels` or `relationship` m

The struct of the Dataset returned by the query is influenced by the query itself,
in this particular context it could happen that the connector won't be able to sample the Schema from the query,
in these cases we suggest trying with the option `schema.strategy` set to `string` as described <<bookmark-string-strategy,here>>.
in these cases we suggest trying with the option `schema.strategy` set to `string` as described xref:quickstart.adoc#bookmark-string-strategy[here].

[NOTE]
Read query *must always* return some data (read: *must always* have a return statement).
Expand Down
8 changes: 5 additions & 3 deletions doc/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,8 @@
"server": "forever start server.js",
"start": "npm run server && npm-watch",
"stop": "forever stop server.js",
"build:docs": "antora --fetch --stacktrace docs.yml"
"build:docs": "antora --fetch --stacktrace docs.yml",
"lint:links": "node tasks/lint-links.js"
},
"license": "ISC",
"dependencies": {
Expand All @@ -25,7 +26,8 @@
},
"devDependencies": {
"express": "^4.17.1",
"npm-watch": "^0.7.0",
"forever": "^3.0.2"
"forever": "^3.0.2",
"hyperlink": "^4.6.0",
"npm-watch": "^0.7.0"
}
}
4 changes: 3 additions & 1 deletion doc/server.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ const express = require('express')
const app = express()
app.use(express.static('./build/site'))

app.get('/', (req, res) => res.redirect('/spark'))
app.use('/static/assets', express.static('./build/site/developer/_'))

app.get('/', (req, res) => res.redirect('/developer/spark'))

app.listen(8000, () => console.log('📘 http://localhost:8000'))
54 changes: 54 additions & 0 deletions doc/tasks/lint-links.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
const path = require('path')
const hyperlink = require('hyperlink')
const TapRender = require('@munter/tap-render');

const root = path.join(__dirname, '..')

;(async () => {
const tapRender = new TapRender()
tapRender.pipe(process.stdout)
try {
const skipPatterns = [
// initial redirect
'load index.html',
'load docs/index.html',
// google fonts
'load https://fonts.googleapis.com/',
// static resources
'load static/assets',
// externals links
// /
'load try-neo4j',
// /developer
'load developer',
// /labs
'load labs',
// /docs
'load docs',
// rate limit on twitter.com (will return 400 code if quota exceeded)
'external-check https://twitter.com/neo4j',
// workaround: not sure why the following links are not resolved properly by hyperlink :/
'load build/site/developer/spark/quickstart/reading',
'load build/site/developer/spark/quickstart/writing'
]
const skipFilter = (report) => {
return Object.values(report).some((value) => {
return skipPatterns.some((pattern) => String(value).includes(pattern))
}
)
};
await hyperlink({
root,
inputUrls: [`build/site/developer/spark/index.html`],
skipFilter: skipFilter,
recursive: true,
},
tapRender
)
} catch (err) {
console.log(err.stack);
process.exit(1);
}
const results = tapRender.close();
process.exit(results.fail ? 1 : 0);
})()

0 comments on commit 0f58d2c

Please sign in to comment.