Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate about possible useless Neo4j connections #224

Open
utnaf opened this issue Oct 13, 2020 · 2 comments
Open

Investigate about possible useless Neo4j connections #224

utnaf opened this issue Oct 13, 2020 · 2 comments
Labels

Comments

@utnaf
Copy link
Contributor

utnaf commented Oct 13, 2020

While writing tests I came across this scenario:

  @Test
  def testComplexReturnStatementNoValues(): Unit = {
    val df = ss.read.format(classOf[DataSource].getName)
      .option("url", SparkConnectorScalaSuiteIT.server.getBoltUrl)
      .option("query",
        """MATCH (p:Person)-[b:BOUGHT]->(pr:Product)
          |RETURN id(p) AS personId, id(pr) AS productId, {quantity: b.quantity, when: b.when} AS map, "some string" as someString, {anotherField: "201", and: 1} as map2""".stripMargin)
      .option("schema.strategy", "string")
      .load()

    assertEquals(Seq("personId", "productId", "map", "someString", "map2"), df.columns.toSeq)
  }

Given that I'm 101% sure that the assertEquals is green, executing this causes this timeout error

java.lang.AssertionError: Timeout hit (30 seconds) while waiting for condition to match: 
Expected: <true>
     but: was <false>
Expected :<true>
Actual   :<false>

Connection log is:

For test testComplexReturnStatementNoValues => connections before: 2, after: 3

Including an action in the test (like df.count() make the whole thing work, no error anymore and the test is green.

Investigate if we have useless connection hanging that causes the problem or if it's test configuration issue.

@moxious
Copy link
Contributor

moxious commented Oct 15, 2020

Within the neo4j driver object it's possible to configure the size of the connection pool that it opens when you initialize it. If you don't configure this, I think you get something like 3-5 connections, the driver assuming that you'll issue multiple queries and so on.

If it is the case that neo4j operations are always single-threaded within a worker node, it might make sense to explicitly configure max connections to be 1 for all driver instances.

@moxious
Copy link
Contributor

moxious commented Oct 16, 2020

Related to this: I can get some very weird driver errors (not connector errors) when playing around with connection schemes.

For example, imagine any simple read query to the database, doesn't matter what.

  • Using the notebook repo, do a simple read (which uses bolt:// by default)
  • Now do the same thing, but switch connection URL in the example to neo4j://
  • Now do the same thing, switching back to bolt://

The strange errors I'm seeing may be related to connection reuse in the worker node? I'm guessing. Not reporting this as a separate issue right now because I can't reliably reproduce it myself. But related to the ticket, some questions arise for me:

  • What should the strategy be for connection pooling within a worker node?
  • After some action on the worker node is complete, should the driver instance stay open? Keeping it open reduces startup time for the next operation. but what would happen if I wanted to create a new driver with different settings on top of that? How would it be handled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants