Investigate about possible useless Neo4j connections #224

utnaf · 2020-10-13T14:56:52Z

While writing tests I came across this scenario:

  @Test
  def testComplexReturnStatementNoValues(): Unit = {
    val df = ss.read.format(classOf[DataSource].getName)
      .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
      .option("query",
        """MATCH (p:Person)-[b:BOUGHT]->(pr:Product)
          |RETURN id(p) AS personId, id(pr) AS productId, {quantity: b.quantity, when: b.when} AS map, "some string" as someString, {anotherField: "201", and: 1} as map2""".stripMargin)
      .option("schema.strategy", "string")
      .load()

    assertEquals(Seq("personId", "productId", "map", "someString", "map2"), df.columns.toSeq)
  }

Given that I'm 101% sure that the assertEquals is green, executing this causes this timeout error

java.lang.AssertionError: Timeout hit (30 seconds) while waiting for condition to match: 
Expected: <true>
     but: was <false>
Expected :<true>
Actual   :<false>

Connection log is:

For test testComplexReturnStatementNoValues => connections before: 2, after: 3

Including an action in the test (like df.count() make the whole thing work, no error anymore and the test is green.

Investigate if we have useless connection hanging that causes the problem or if it's test configuration issue.

The text was updated successfully, but these errors were encountered:

moxious · 2020-10-15T11:46:02Z

Within the neo4j driver object it's possible to configure the size of the connection pool that it opens when you initialize it. If you don't configure this, I think you get something like 3-5 connections, the driver assuming that you'll issue multiple queries and so on.

If it is the case that neo4j operations are always single-threaded within a worker node, it might make sense to explicitly configure max connections to be 1 for all driver instances.

moxious · 2020-10-16T11:53:11Z

Related to this: I can get some very weird driver errors (not connector errors) when playing around with connection schemes.

For example, imagine any simple read query to the database, doesn't matter what.

Using the notebook repo, do a simple read (which uses bolt:// by default)
Now do the same thing, but switch connection URL in the example to neo4j://
Now do the same thing, switching back to bolt://

The strange errors I'm seeing may be related to connection reuse in the worker node? I'm guessing. Not reporting this as a separate issue right now because I can't reliably reproduce it myself. But related to the ticket, some questions arise for me:

What should the strategy be for connection pooling within a worker node?
After some action on the worker node is complete, should the driver instance stay open? Keeping it open reduces startup time for the next operation. but what would happen if I wanted to create a new driver with different settings on top of that? How would it be handled?

moxious added the question label Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate about possible useless Neo4j connections #224

Investigate about possible useless Neo4j connections #224

utnaf commented Oct 13, 2020

moxious commented Oct 15, 2020

moxious commented Oct 16, 2020

Investigate about possible useless Neo4j connections #224

Investigate about possible useless Neo4j connections #224

Comments

utnaf commented Oct 13, 2020

moxious commented Oct 15, 2020

moxious commented Oct 16, 2020