Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getStatus() for job reference of finished light job should throw JobNotFoundException [HZ-2753] #19257

Open
olukas opened this issue Aug 3, 2021 · 9 comments · May be fixed by hazelcast/hazelcast-client-protocol#469
Assignees
Labels
Milestone

Comments

@olukas
Copy link
Contributor

olukas commented Aug 3, 2021

JetInstance.getJob() for a finished light job doesn't throw JobNotFoundException, but instead reports the job as failed with the JNFE as the cause, which is wrong. The JNFE should be thrown from the getJob() method.

Reproducer:

HazelcastInstance inst = createHazelcastInstance();
HazelcastInstance client = createHazelcastClient();
Job job1 = inst.getJet().newLightJob(streamingDag());
assertJobExecuting(job1, inst);
Job job2 = client.getJet().getJob(job1.getId());
job1.cancel();
assertThatThrownBy(job1::join)
        .isInstanceOf(CancellationException.class);

sleepSeconds(1);
// this fails - job2.getStatus() returns FAILED
assertThatThrownBy(job2::getStatus)
        .isInstanceOf(JobNotFoundException.class);
@olukas olukas added Type: Defect Team: Core Source: Internal PR or issue was opened by an employee Module: Jet Issues/PRs for Jet labels Aug 3, 2021
@viliam-durina viliam-durina added this to the 5.0 milestone Aug 4, 2021
@fbarotov
Copy link
Contributor

When I run the following test it passes

  @Test
  public void jobStatusTest() {
      HazelcastInstance inst = createHazelcastInstance();
      HazelcastInstance client = createHazelcastClient();
      
      Job job1 = inst.getJet().newLightJob(streamingDag());
      assertJobExecuting(job1, inst);
      Job job2 = client.getJet().getJob(job1.getId());
      job1.cancel();

      assertThatThrownBy(job1::join)
              .isInstanceOf(CancellationException.class);

      sleepSeconds(1);

      assertEquals(job1.getStatus(), FAILED);
      assertEquals(job2.getStatus(), RUNNING);

      assertNull(inst.getJet().getJob(job1.getId()));
      assertNull(client.getJet().getJob(job1.getId()));
  }

🤔

@viliam-durina viliam-durina modified the milestones: 5.0, 5.1, 5.0.1 Sep 7, 2021
@AyberkSorgun AyberkSorgun modified the milestones: 5.0.1, 5.1 Nov 18, 2021
@viliam-durina viliam-durina modified the milestones: 5.1, 5.2 Jan 25, 2022
@AyberkSorgun AyberkSorgun modified the milestones: 5.2, 5.2 Backlog Aug 16, 2022
@AyberkSorgun AyberkSorgun modified the milestones: 5.2 Backlog, 5.3 Backlog Oct 27, 2022
@k-jamroz
Copy link
Contributor

@burakgok this is very old issue, maybe already fixed.

@AyberkSorgun AyberkSorgun modified the milestones: 5.3 Backlog, 5.4 Backlog May 16, 2023
@burakgok
Copy link
Contributor

The title is incorrect because job proxies can only be obtained for existing jobs and that's why they cannot throw JobNotFoundException. The reproducer is also inconsistent with the description because it checks the result of Job.getStatus(). JetService.getJob(long) throws JobNotFoundException for completed/failed light jobs since it supported light jobs (0d33469).

@k-jamroz
Copy link
Contributor

Job proxy is obtained before light job finishes but method on it is invoked after the job finished.

@k-jamroz k-jamroz reopened this Jul 17, 2023
@burakgok
Copy link
Contributor

Job proxies cannot claim that the job to which they refer doesn't exist. If they cannot perform a function because the job is completed and ultimately removed, they have to report the situation as such. For Job.getStatus() for example, if the job is completed, no matter what state JobCoordinationService is in, the job should be reported as COMPLETED. Another example: if a status listener cannot be added due to job completion, an error message saying that "status listener cannot be added to COMPLETED job" should be thrown.

@k-jamroz
Copy link
Contributor

Slightly modified test from @fbarotov demonstrates one strange behaviour of the API:

    @Test
    public void jobStatusTest() {
        HazelcastInstance inst = createHazelcastInstance();
        HazelcastInstance client = createHazelcastClient();

        Job job1 = inst.getJet().newLightJob(streamingDag());
        sleepSeconds(1);  //TODO: use assert with timeout
        assertJobExecuting(job1, inst);
        Job job2 = client.getJet().getJob(job1.getId());
        job1.cancel();

        assertThatThrownBy(job1::join)
                .isInstanceOf(CancellationException.class);

        sleepSeconds(1);

        assertEquals(job1.getStatus(), FAILED);
        assertEquals(job2.getStatus(), RUNNING);  // succeeds!
        sleepSeconds(2);
        assertEquals(job2.getStatus(), RUNNING);  // fails

        assertNull(inst.getJet().getJob(job1.getId()));
        assertNull(client.getJet().getJob(job1.getId()));
    }

This is because first job2.getStatus() creates a local job future but is it not yet complete even though the job has already ended on member because the future wait for the member response but getStatus() returns immediately. Seems like yet another flavour of #23082.

@k-jamroz
Copy link
Contributor

k-jamroz commented Jul 17, 2023

Job proxies cannot claim that the job to which they refer doesn't exist.

Currently sometimes they do because the join future is lazily created if you create proxy using id. For example:

    public static DAG blockingBatchDag() {
        DAG dag = new DAG();
        dag.newVertex("v", () -> new MockP().initBlocks());
        return dag;
    }
    
    @Test
    public void jobStatusTest() {
        HazelcastInstance inst = createHazelcastInstance();
        inst.getConfig().getJetConfig().setCooperativeThreadCount(1);
        HazelcastInstance client = createHazelcastClient();

        Job job1 = inst.getJet().newLightJob(blockingBatchDag());
        sleepSeconds(1);
        Job job2 = client.getJet().getJob(job1.getId());
        assertNotNull(job2);

        MockP.unblock();

        job1.join();
        sleepSeconds(1);
        job2.join();  // throws JobNotFoundException
    }

@k-jamroz
Copy link
Contributor

k-jamroz commented Jul 17, 2023

also in the last example you get wrong job status if query on client side twice with sleep: RUNNING then FAILED which is wrong.

@github-actions github-actions bot changed the title getStatus() for job reference of finished light job should throw JobNotFoundException getStatus() for job reference of finished light job should throw JobNotFoundException [HZ-2753] Jul 21, 2023
@github-actions
Copy link
Contributor

Internal Jira issue: HZ-2753

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
7 participants