MNT-18308 - Add flag forceAsyncAclCreation #984

evasques · 2020-05-05T11:07:33Z

If time fixedAclMaxTransactionTime exceeds, force ACL creation to be async if flag is set to true.

I did create a new method isSubjectToAsyncAclCreation to try to improve code legibility and avoid an if-else hell and changed method setFixAclPending to use that method where I kept the previous logic and added the new feature also:

Old logic for reference:

if async call is NOT required, make regular call
else check transaction time and,
- if transaction time is NOT exceeded, make regular method call
- else check if node has children and,
  - if node does not have children, make regular call
  - else, do not make regular call to create ACL and add aspect to be processed by job later.

The new feature's intention is, if the flag is set to true, to force the node to be processed by the job later if of the following conditions are met:

No matter if call is sync or async;
No matter if node has children or not;
Defined max transaction time was exceeded ( system.fixedACLs.maxTransactionTime )

Also found that the current job (FixedAclUpdater) had a bug in it: it is multi-thread and did not control which nodes it has already picked up to process. This causes the batch class to pick up nodes twice in different threads and if the node had meanwhile already been processed in the other thread it would throw a null pointer when trying to access properties that were already unset. It did not affect the overall job behavior but it was not efficient.

In order to fix this I added a control list currentlyProcessingNodes that only adds new nodes for the next batch if they are nor currently being processed. When a new node is picked up it goes into that list and as soon as it is processed it is removed from that list - keeping the overall list size manageable with max objects being batchsize*numThreads. We do not have to keep processed nodes in that list as when they are processed, the aspect is removed and properties are unset, so they have no way of being picked up again by the job.

…g processed in another thread

…ransactionTime exceeds, force ACL creation to be async if flag is set to true

… turned on an flag turned off

killerboot

Please check your formatting settings in the IDE as I can see multiple problems in the changes.

The FixedAclUpdater can be run only once across the entire cluster because it is synchronised via JobLockService, see org.alfresco.repo.domain.permissions.FixedAclUpdater#execute
Once the lock is obtained, the AclWorkProvider counts all nodes with ContentModel.ASPECT_PENDING_FIX_ACL and then gets the batches of nodes to be updated to the workers by the batch size ordering by node IDs, also tracking the min and max of node ID that where sent for processing. See org.alfresco.repo.domain.permissions.FixedAclUpdater.GetNodesWithAspectCallback
So I'm not sure how the same node can appear in two separate batches of one FixedAclUpdater.

killerboot · 2020-05-05T14:29:44Z

src/main/java/org/alfresco/repo/domain/permissions/ADMAccessControlListDAO.java

+     * @param nodeId
+     * @param asyncCall
+     * @return


This is not a valid JavaDoc. Please remove these lines.

killerboot · 2020-05-05T14:39:05Z

Why is it requited to introduce a new system.fixedACLsUpdater.forceAsyncAclCreation flag?
As far as I understood the difference is that we are going to:

force async for already requested async calls even if transaction time was not exceeded
force async for nodes that don't have children
In which category changing the permissions on a large site will fall?

evasques · 2020-05-05T15:16:17Z

About the job: I found the bug by chance when trying out the job and replicated it by running the already existing unit-test (org.alfresco.repo.domain.permissions.FixedAclUpdaterTest) and despite it turning out successful I kept getting null pointer exceptions in the log, like this:

2020-05-05 15:54:58,971 ERROR [domain.permissions.FixedAclUpdater] [main] FixedAclUpdater: 15 error(s) detected. Last error from entry "workspace://SpacesStore/f1c73053-15be-435c-a371-bc0d0f5566fe" java.lang.NullPointerException at org.alfresco.repo.domain.permissions.AclDAOImpl.getInheritedAccessControlList(AclDAOImpl.java:1126) at org.alfresco.repo.domain.permissions.ADMAccessControlListDAO.setFixedAcls(ADMAccessControlListDAO.java:400) at org.alfresco.repo.domain.permissions.ADMAccessControlListDAO.setInheritanceForChildren(ADMAccessControlListDAO.java:337) at org.alfresco.repo.domain.permissions.FixedAclUpdater$AclWorker$1.doWork(FixedAclUpdater.java:274) at org.alfresco.repo.domain.permissions.FixedAclUpdater$AclWorker$1.doWork(FixedAclUpdater.java:1) at org.alfresco.repo.security.authentication.AuthenticationUtil.runAs(AuthenticationUtil.java:602) at org.alfresco.repo.domain.permissions.FixedAclUpdater$AclWorker.process(FixedAclUpdater.java:298) at org.alfresco.repo.domain.permissions.FixedAclUpdater$AclWorker.process(FixedAclUpdater.java:1) at org.alfresco.repo.batch.BatchProcessor$TxnCallback.execute(BatchProcessor.java:723) at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:450) at org.alfresco.repo.batch.BatchProcessor$TxnCallback.run(BatchProcessor.java:767) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834)

I didn't notice the callback setting the min value of node id until you pointing it out.

Currently it gets the nodes in DESC order, iterates through them and keeps re-setting the maxNodeId with the current node id value. The id with the lowest number is LAST to be processed, so that's the one that sticks as the maxNodeId and we have the problem

A better solution to what I've done would in fact would be resetting the maxNodeId only when the current value is above the previously set maxNodeId

evasques · 2020-05-05T15:28:05Z

Why is it requited to introduce a new system.fixedACLsUpdater.forceAsyncAclCreation flag?
As far as I understood the difference is that we are going to:

force async for already requested async calls even if transaction time was not exceeded

force async for nodes that don't have children
In which category changing the permissions on a large site will fall?

@killerboot the flag is there and set as false by default to keep existing functionality: if we delegate all ACL creations that take more than 10s to the job, without customers having a say it it I believe it is kinda dangerous, thats why I put in the flag: only who want this and know the consequences (more critical one is ACLs will not be enforced when using search on that folder and children until job runs) can set it.

All pre-existing funcionality is kept if the flag is off: currently is only delegates to job if call is async AND node has children. Please note that the manage permissions calls a JS webscript is sync.

If we set the flag to true, in practical terms: even if the call is SYNC (so like when a user manages permissions through Share), if the time is exceeded, we will delegate all ACL creation of that tree to the job, even nodes without children.

killerboot · 2020-05-05T16:42:45Z

@evasques Thanks for clarifications.
In my point of view the property is not particularly useful because it is required to restart the server in order to set it. I don't think that Administrator can predict in advance if something like "large site permission change" is going to happen.
I would like to propose to change the approach a bit. The ACL changes are going to be processed synchronously up until the transaction time is exceeded and then scheduled for async processing for any ADMAccessControlListDAO#setFixedAcls call by default. It should be possible to override this behaviour and schedule the async processing straight away by setting the flag in Java APIs. This parameter (asyncCall) is already present in ADMAccessControlListDAO#setFixedAcls()

evasques · 2020-05-05T17:35:59Z

@killerboot
I need a bit more clarification on your propsal: from what I understood, you are proposing that I switch automatically the current async flag (FixedAclUpdater.FIXED_ACL_ASYNC_CALL_KEY) when time is exceeded, right?

If I understood correctly i have to point two possible issues on this:

We will be changing the default user experience: user manages permissions and even if it takes 20s the user expects that these ACLs will be set. If we change the default behavior it will only set the ACL transactions that it can fit in the first 10s, so the remaing ACL transactions will only be indexed after the job is run (by default it runs at midnight I think). So those nodes will be searchable by previous ACL until job is run and user had no feedback on this happening. The user will leave the 'manage permissions' screen with a successful message and have no idea if it was all set or if some (how many really depends on system's performance) will be done later at night.
When set to async and time is exceeded, it would in fact stop creating ACLs on other folders and start adding the aspect so the job can pick it up BUT it will continue to set ACLs on itself (the first node after time is exceeded) if its not a folder and all siblings that aren't folders. This can still be thousand of nodes.

I know the disadvantage of my approach is that an admin would have to restart the server for this. But the other side setting this would mean that: he knows the impact of doing so and will tweak the other configs also for this to make sense in his environment (like adjusting the fixedAclMaxTransactionTime for a larger number than the default 10s or change when the job runs)

Some systems can live with 10-30s transactions and they won't bat an eye if they know its a large folder and will expect acls created after that time.

But it is bad if we frequently have trees with 6M nodes inside them and need to set ACLs for the first time. I don't know the average transaction time on that case, but it should be minutes, not seconds and those are the ones we want to block from happening synchronously.

I think Alfresco's default behavior works fine for most customers and this only makes sense it really big repos - and big repos have high performance machines. In 10s one system can add 1000 nodes in the transaction (locally that's my average) but on a proper machine we can probably get 5x that. Time passed is hardly a unit of measure for the transaction size so telling that all systems will go async on ACL creation after that 10s mark means really different things depending on the system.

killerboot · 2020-05-06T11:07:47Z

@evasques regarding the issues you pointed out:

This is the same behaviour for the async calls at the moment. My proposal is to change it for sync calls as well. This will have to be documented, but it is the solution to avoid the creation of large transactions in repository. It may be a good idea to log a warning every time we have to do async processing because of time limits, If these situations happen too often the administrator will have to take action.
Correct me if I'm wrong but there is a recursion between ADMAccessControlListDAO#setFixedAcls() and #setFixAclPending(), so the time of txn is constantly checked during propagation to children, and if it is exceeded the processing starts to be async. So setting permission on itself should not take long, but the propagation to children may take longer and that is done async if the time is up.

If the administrator decides that long transactions are fine for the system, the configuration can be changed for bigger limits, but this may result in indexing issues in Search Services.

evasques · 2020-05-06T12:16:40Z

@killerboot

You are completely right on everything and I agree it would be best if this would be the default behavior, its the only way we can in fact mitigate the problem. I'm just predicting user complaints: "I set the permissions on the folder this morning, everything seems OK, look at the manage permissions page, user X does not have permissions on this folder, its all set correctly. How can he sill access the document?!". I think if we do this, then we need to at least inform the user that not all permissions were set and will be processed later by a job. Users don't read release notes and not all system administrators are so keen on informing users of small changes. If this by chance happens months after the upgrade, it will be hard for them to tie it to the upgrade and recheck release notes to see that this is a new feature, they will think there is a big problem. Do you think a message on the POST permissions response saying something like "There are too many permissions to process. Permissions will be set later by scheduled job." could help?
Yes however it does not break the current cycle (on parent folder). It just stops propagating on any other folder it passes by (siblings or next branch in tree), so it goes through all its siblings that are not folders.

killerboot · 2020-05-06T12:31:47Z

The release notes are exactly for things like that. Unfortunately we can't guarantee that people are going to read them.
I'm not sure that the message in the response body helps much as it is not going to be displayed in Share UI. I would consider UI changes to be out of scope of this issue, those will need localisation and potentially substantial amount of work making backport to 5.2 harder. I would add a warning in logs and review the default settings for time limit to be sure that administrator will notice it. Please keep in mind that Share or proxy between repository and Share has a timeout, so it can't wait for too long.
Can you please take a look if this can be fixed so that the timeout breaks the cycle?

evasques · 2020-05-11T09:38:57Z

@killerboot
I opened up a new PR with the solution you provided: #987

evasques added 3 commits May 5, 2020 11:28

MNT-18308 - fix job as it was re-picking nodes that were already bein…

9ace35b

…g processed in another thread

MNT-18308 - Added flag forceAsyncAclCreation to, if time fixedAclMaxT…

d5d795e

…ransactionTime exceeds, force ACL creation to be async if flag is set to true

MNT-18308 - Added two unit tests to assert correct behaviour for flag…

db71d2c

… turned on an flag turned off

evasques requested review from jottley and killerboot May 5, 2020 11:07

jottley approved these changes May 5, 2020

View reviewed changes

killerboot suggested changes May 5, 2020

View reviewed changes

killerboot reviewed May 5, 2020

View reviewed changes

evasques force-pushed the fix/MNT-18308_patch_forceACLcreationAyncFlag branch from 77f3602 to db71d2c Compare May 5, 2020 16:17

simplified job fix and corrected formatting

b9e60ed

evasques mentioned this pull request May 11, 2020

MNT-18308 force async acl creation #987

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MNT-18308 - Add flag forceAsyncAclCreation #984

MNT-18308 - Add flag forceAsyncAclCreation #984

evasques commented May 5, 2020

killerboot left a comment

killerboot May 5, 2020

killerboot commented May 5, 2020

evasques commented May 5, 2020

evasques commented May 5, 2020

killerboot commented May 5, 2020

evasques commented May 5, 2020

killerboot commented May 6, 2020

evasques commented May 6, 2020

killerboot commented May 6, 2020

evasques commented May 11, 2020

MNT-18308 - Add flag forceAsyncAclCreation #984

Are you sure you want to change the base?

MNT-18308 - Add flag forceAsyncAclCreation #984

Conversation

evasques commented May 5, 2020

killerboot left a comment

Choose a reason for hiding this comment

killerboot May 5, 2020

Choose a reason for hiding this comment

killerboot commented May 5, 2020

evasques commented May 5, 2020

evasques commented May 5, 2020

killerboot commented May 5, 2020

evasques commented May 5, 2020

killerboot commented May 6, 2020

evasques commented May 6, 2020

killerboot commented May 6, 2020

evasques commented May 11, 2020