Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When first and last cycle are out of date, should we try refreshing them and retrying before we throw? #913

Closed
nicktindall opened this issue Sep 29, 2021 · 5 comments
Assignees

Comments

@nicktindall
Copy link
Contributor

In StoreTailer.toEnd, we get the lastCycle which is only checked against the file listing every 60s. In the event we've got lastCycle then failed to retrieve it from the SingleChronicleQueueStore is it worth forcing a directoryListing.refresh(true) and retrying before throwing an exception?

@nicktindall nicktindall changed the title When first and last cycle are out of date, should we try refreshing them before we throw? When first and last cycle are out of date, should we try refreshing them and retrying before we throw? Sep 29, 2021
@glukos
Copy link
Contributor

glukos commented Oct 12, 2021

@JerryShea @nicktindall
NotReachedException is thrown only by originalToEnd(), where this trick is already implemented:

    @NotNull
    private ExcerptTailer callOriginalToEnd() {
        try {
            return originalToEnd();
        } catch (NotReachedException e) {
            queue.refreshDirectoryListing();
            // due to a race condition, where the queue rolls as we are processing toEnd()
            // we may get a NotReachedException  ( see https://github.com/OpenHFT/Chronicle-Queue/issues/702 )
            // hence are are just going to retry.
            try {
                return originalToEnd();
            } catch (Exception ex) {
                Jvm.warn().on(getClass(), "Unable to find toEnd() so winding to the start " + ex);
                return toStart();
            }
        }
    }

In normal flow, when files are not manually moved, shared variable TableDirectoryListing#maxCycleValueis incremented by the roller. Other threads / processes will be aware of the change. I don't see any issues here so I'd close this ticket.

@nicktindall
Copy link
Contributor Author

I think the scenario we saw was in optimizedToEnd, where an IllegalStateException is thrown

final SingleChronicleQueueStore wireStore = queue.storeForCycle(
lastCycle, queue.epoch(), false, this.store);
this.setCycle(lastCycle);
if (wireStore == null)
throw new IllegalStateException("Store not found for cycle " + Long.toHexString(lastCycle) + ". Probably the files were removed? queue=" + queue.fileAbsolutePath());

It's also repeated in approximateLastCycle2:

final SingleChronicleQueueStore wireStore = (cycle == lastCycle) ? this.store : queue.storeForCycle(
lastCycle, queue.epoch(), false, this.store);
this.setCycle(lastCycle);
if (wireStore == null)
throw new IllegalStateException("Store not found for cycle " + Long.toHexString(lastCycle) + ". Probably the files were removed? queue=" + queue.fileAbsolutePath());

but that seems to be retried already
while (prevCycle < lastCycle) {
lastCycle--;
try {
return approximateLastCycle2(lastCycle);
} catch (IllegalStateException e) {
// try again.
}
}

This is probably seperate enough to what you're working on that I could try and do a PR for this today, it doesn't seem that hard

@nicktindall
Copy link
Contributor Author

nicktindall commented Oct 13, 2021

I just made an attempt at this and I don't think it's a good idea.

The problem is, if we implement it, it'll only refresh the cache values if we hit a path that looks for one of these missing files, if we don't, the invalid cache will happily live on.

I think #924 is actually the correct solution. If we have a file watcher we can cheaply (I think) check the validity of the cache (by looking for events in the watcher) every time we use values from it and if it isn't valid we can trigger a refresh. I'm not going to fix this and will close if you agree @glukos @JerryShea ?

@nicktindall
Copy link
Contributor Author

(i.e. maybe we can call refreshDirectoryListing prior to every use of the cached values, and in refresh we can quickly see if there's been any file-system change events, and if not we can skip the refresh)
I think it would make the "force" param redundant because we'd know if were actually necessary to do it or not.
DISCLAIMER: this is based on no real analysis, just spit-balling here

@glukos
Copy link
Contributor

glukos commented Oct 13, 2021

I'm ok with closing this.

In my understanding, #924 is mostly about performance. It will save us from unnecessary refreshes, but it will not fix a data race - obviously, WatchService doesn't priovide any guarantees on how fast events will arrive.
As I see it, the initial purpose of refresh was dealing with manual manipulations with queue dir. Ideally, we shouldn't rely on the refresh to resolve data races that happen in normal workflow.

Some more spit-balling - that's how tailers should find out that a new cycle is created:

                    if (!readOnly && createIfAbsent && wire.writeFirstHeader()) {
                        // implicitly reserves the wireStore for this StoreSupplier
                        wireStore = storeFactory.apply(that, wire);
                        wire.updateFirstHeader();
                        wire.usePadding(wireStore.dataVersion() > 0);

                        wireStore.initIndex(wire);
                        // do not allow tailer to see the file until it's header is written
                        directoryListing.onFileCreated(path, cycle); // increments shared variable
                        // allow directoryListing to pick up the file immediately

Firstly, we write header to a mmapped memory, then we increment shared counter.
Tailer in another thread / process may go here:

                if (!createIfAbsent &&
                        (cycle > directoryListing.getMaxCreatedCycle()
                                || cycle < directoryListing.getMinCreatedCycle()
                                || !path.exists())) {
                    return null;

and find updated getMaxCreatedCycle(), but !path.exists() way be true as per file creation hasn't been synced by OS yet.
Should we consider spinning for a while here before returning null?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants