Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements bulk_create for create_batch if available #925

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kingbuzzman
Copy link
Contributor

@kingbuzzman kingbuzzman commented May 6, 2022

What this PR is trying to accomplish:

  • There is a lot of test code linked to a series of nested-russian-doll sub factories of 1-to-3-to-3-to-1 * 10 (create_batch(10)) models that get created constantly throughout our code base.
    • If we can avoid creating all these models one by one, and leverage django's bulk_create, we would increase performance, and most importantly; it would save a lot of time.
  • This feature is turned off by default, and can be turned on via Factory._meta.use_bulk_create.
  • Right now i'm targetting postgres
    • I became aware recently that newer versions of sqlite support it now too, that would be fine, but not my main concern

Whats left to do:

  • Get buy-in from the maintainers
  • Fix a lot of the code / clean up -- this is just a demo.
  • Write more tests
  • Fix a weird error in examples. NM all green 👍
  • More clean up
  • Write docs
  • Works with postgres and >= py3.9 sqlite (possibly a django limitation??)

Depends on:

@kingbuzzman
Copy link
Contributor Author

@francoisfreitag bump

Copy link
Member

@francoisfreitag francoisfreitag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refs #107. The main reason this feature was not implemented seems to be the lack of motivated developers.

Django bulk_create has several documented limitations, but none seem deal breakers.

The opt-in option is probably the best way to achieve backward compatibility and incremental adoption. Also gives an escape hatch for bulk_create limitations.

I see no reason to reject the PR, provided the code gets to a mature state and we can get CI on it and good test coverage. I’m actually pretty excited about it.

The idea of collecting "leaf" models and working up the insert chain should work, though we need to be careful of post generation hooks. Perhaps they should be documented as incompatible with bulk_create(), at least as a first step.

I’m not eager to be introducing more DB-specific code in factory_boy, but that seem unavoidable with this feature and can rely on Django attributes to abstract some of the difficulties.

This patch inspects Django models to describe the relationship between objects, where factory_boy is mostly declarative (SubFactory, RelatedFactory and co). I don’t foresee conflicts arising from that yet, and using Django as the source of truth is probably safe.

factory/django.py Outdated Show resolved Hide resolved
@kingbuzzman
Copy link
Contributor Author

@francoisfreitag What do you think of separating some of these files into a different PR to make it easier to review later.

What do you think?

@francoisfreitag
Copy link
Member

Just to confirm, you’re suggesting to open a PR to run the factory boy Django tests against PostgreSQL ?
If so, yes, please go ahead. It is an improvement regardless of this feature, it increases the test coverage for the library.

@kingbuzzman
Copy link
Contributor Author

@francoisfreitag PR #931 is ready for review. Thanks!

@francoisfreitag
Copy link
Member

francoisfreitag commented May 24, 2022 via email

@kingbuzzman kingbuzzman marked this pull request as draft June 3, 2022 14:09
@kingbuzzman kingbuzzman mentioned this pull request Jun 24, 2022
@kingbuzzman
Copy link
Contributor Author

@francoisfreitag When you can, can you start looking at this, I'm slowly cleaning up the tests. My biggest hurtle is this right here -- I find it nasty / dirty. Let me know your thoughts.

Copy link
Contributor Author

@kingbuzzman kingbuzzman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add #s -- lol i thought i was writing a commit message. The github code integration confused me :/ now i can't delete this

@kingbuzzman
Copy link
Contributor Author

@thedrow Thank you for the review

Copy link
Member

@francoisfreitag francoisfreitag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven’t had much time to look at this lately (and probably won’t until the end of summer), but here’s an idea to address the TODO you asked for feedback on (collecting the built instances). The idea is to have the owner of the list allocate it, and have the step builder cooperate and fill that list.

The suggestion also starts taking care of the incompatibility between post generation hooks and this new feature, which will probably be necessary to land it.

diff --git a/factory/builder.py b/factory/builder.py
index ed39ebc..464a680 100644
--- a/factory/builder.py
+++ b/factory/builder.py
@@ -208,14 +208,18 @@ class BuildStep:
             parent_chain = ()
         return (self.stub,) + parent_chain
 
-    def recurse(self, factory, declarations, force_sequence=None):
+    def recurse(self, factory, declarations, force_sequence=None, collect_instances=None):
         from . import base
         if not issubclass(factory, base.BaseFactory):
             raise errors.AssociatedClassError(
                 "%r: Attempting to recursing into a non-factory object %r"
                 % (self, factory))
         builder = self.builder.recurse(factory._meta, declarations)
-        return builder.build(parent_step=self, force_sequence=force_sequence)
+        return builder.build(
+            parent_step=self,
+            force_sequence=force_sequence,
+            collect_instances=collect_instances,
+        )
 
     def __repr__(self):
         return f"<BuildStep for {self.builder!r}>"
@@ -235,9 +239,8 @@ class StepBuilder:
         self.strategy = strategy
         self.extras = extras
         self.force_init_sequence = extras.pop('__sequence', None)
-        self.created_instances = []
 
-    def build(self, parent_step=None, force_sequence=None):
+    def build(self, parent_step=None, force_sequence=None, collect_instances=None):
         """Build a factory instance."""
         # TODO: Handle "batch build" natively
         pre, post = parse_declarations(
@@ -246,13 +249,6 @@ class StepBuilder:
             base_post=self.factory_meta.post_declarations,
         )
 
-        # TODO: come up with a better solution
-        if parent_step:
-            if hasattr(parent_step, 'builder'):
-                self.created_instances = parent_step.builder.created_instances
-            else:
-                self.created_instances = parent_step.created_instances
-
         if force_sequence is not None:
             sequence = force_sequence
         elif self.force_init_sequence is not None:
@@ -275,21 +271,22 @@ class StepBuilder:
             kwargs=kwargs,
         )
 
-        postgen_results = {}
-        for declaration_name in post.sorted():
-            declaration = post[declaration_name]
-            postgen_results[declaration_name] = declaration.declaration.evaluate_post(
+        if collect_instances is None:
+            postgen_results = {}
+            for declaration_name in post.sorted():
+                declaration = post[declaration_name]
+                postgen_results[declaration_name] = declaration.declaration.evaluate_post(
+                    instance=instance,
+                    step=step,
+                    overrides=declaration.context,
+                )
+            self.factory_meta.use_postgeneration_results(
                 instance=instance,
                 step=step,
-                overrides=declaration.context,
+                results=postgen_results,
             )
-        self.factory_meta.use_postgeneration_results(
-            instance=instance,
-            step=step,
-            results=postgen_results,
-        )
-
-        self.created_instances.append(instance)
+        else:
+            collect_instances.append(instance)
 
         return instance
 
diff --git a/factory/django.py b/factory/django.py
index 25d6966..541e19b 100644
--- a/factory/django.py
+++ b/factory/django.py
@@ -246,13 +246,12 @@ class DjangoModelFactory(base.Factory):
                 "is either not set or False." % dict(f=cls.__name__))
 
         models_to_return = []
-        instances_created = []
+        instances = []
         for _ in range(size):
             step = builder.StepBuilder(cls._meta, kwargs, enums.BUILD_STRATEGY)
-            models_to_return.append(step.build())
-            instances_created.extend(step.created_instances)
+            models_to_return.append(step.build(collect_instances=instances))
 
-        for model_cls, objs in dependency_insert_order(instances_created):
+        for model_cls, objs in dependency_insert_order(instances):
             manager = cls._get_manager(model_cls)
             cls._refresh_database_pks(model_cls, objs)
 

factory/django.py Outdated Show resolved Hide resolved
factory/django.py Show resolved Hide resolved
@tony
Copy link
Contributor

tony commented Sep 19, 2022

@francoisfreitag Do you need any help QA'ing or testing this?

@francoisfreitag
Copy link
Member

Hi Tony,

Feel free to share any bugs or weird behavior that you find. Because tests aren’t passing, there will certainly be changes to the implementation. When I get some free time, I’ll give it a deeper look. In the meantime, solving the test failures is a very helpful task, to both understand the code and help others understand it.

@tony
Copy link
Contributor

tony commented Sep 19, 2022

@francoisfreitag @kingbuzzman

Bikeshed: I won't count on it, but it's a long running PR and the commits have built up. It's sometimes easier to read if commits are squashed and it's rebased against master.

I'm not sure if there's a policy on that, or if it's more harm than good, though.

Either way, I backed up 3666f10 on my fork at pr-925-3666f10 in the event it's ever necessary to look back

@kingbuzzman
Copy link
Contributor Author

@tony everything went awry after 8ec88e9 , going from 4369b72 makes all the tests pass. I'm going to work on this a little today as i have a little time.

@francoisfreitag .. what a mess. :/

@kingbuzzman
Copy link
Contributor Author

kingbuzzman commented Oct 25, 2022

@francoisfreitag the code you gave me to avoid collecting all the created objects instead of my "half-baked" instance level solution does not work. I also don't see it feasible to send the collect_instances down 13 levels to collect the models. Think of a crazy model nesting: A->B <-[C1, C2->D]

What else can we do?

Ps a good test that demonstrates the issue is tests/test_django.py::DependencyInsertOrderTest:test_new_m2m

Stack of how far `created_instances` would need to go (with room to grow)
  /factory_boy/tests/test_django.py(324)test_new_m2m()
    323         created_instances = []
--> 324         a1 = step.build(collect_instances=created_instances)
    325         print('*'*30, created_instances)

  /factory_boy/factory/builder.py(279)build()
    278             declaration = post[declaration_name]
--> 279             postgen_results[declaration_name] = declaration.declaration.evaluate_post(
    280                 instance=instance,

  /factory_boy/factory/declarations.py(608)evaluate_post()
    607         )
--> 608         return self.call(instance, step, postgen_context)
    609 

  /factory_boy/factory/declarations.py(710)call()
    709         parent = super()
--> 710         return [
    711             parent.call(instance, step, context)

  /factory_boy/factory/declarations.py(711)<listcomp>()
    710         return [
--> 711             parent.call(instance, step, context)
    712             for i in range(self.size if isinstance(self.size, int) else self.size())

  /factory_boy/factory/declarations.py(689)call()
    688         )
--> 689         return step.recurse(factory, passed_kwargs)
    690 

  /factory_boy/factory/builder.py(218)recurse()
    217         builder = self.builder.recurse(factory._meta, declarations)
--> 218         return builder.build(
    219             parent_step=self,

  /factory_boy/factory/builder.py(264)build()
    263         )
--> 264         step.resolve(pre)
    265 

  /factory_boy/factory/builder.py(201)resolve()
    200         for field_name in declarations:
--> 201             self.attributes[field_name] = getattr(self.stub, field_name)
    202 

  /factory_boy/factory/builder.py(353)__getattr__()
    352                 try:
--> 353                     value = value.evaluate_pre(
    354                         instance=self,

  /factory_boy/factory/declarations.py(48)evaluate_pre()
     47         context = self.unroll_context(instance, step, overrides)
---> 48         return self.evaluate(instance, step, context)
     49 

  /factory_boy/factory/declarations.py(411)evaluate()
    410         force_sequence = step.sequence if self.FORCE_SEQUENCE else None
--> 411         return step.recurse(subfactory, extra, force_sequence=force_sequence)
    412 

  /factory_boy/factory/builder.py(218)recurse()
    217         builder = self.builder.recurse(factory._meta, declarations)
--> 218         return builder.build(
    219             parent_step=self,

> /factory_boy/factory/builder.py(276)build()
    275 
--> 276         postgen_results = {}
    277         for declaration_name in post.sorted():

@kingbuzzman
Copy link
Contributor Author

@francoisfreitag bump

@francoisfreitag
Copy link
Member

I’ll be busy until the end of the month / year. I’ll do my best to carve out some time for this issue, but the ask if not for a few minutes, it’ll take much longer to find a correct integration of the collected objects in the existing machinery.

@tony
Copy link
Contributor

tony commented Nov 8, 2022

Is there going to be a rebase / squash?

@kingbuzzman
Copy link
Contributor Author

@tony does it matter at the moment? Rather keep the history at this time in case I need to bring something back. I'll squash as this comes closer to finish

@tony
Copy link
Contributor

tony commented Nov 8, 2022

@kingbuzzman No worries! Whatever works best for you.

To minimize noise, I'll stay out of the way unless you / another maintainer requests a review (earlier mention).

(Fine to minimize / hide this comment)

@kingbuzzman
Copy link
Contributor Author

@tony last I saw the tests were passing... weird. If you want to take it out for a spin, you're more than welcome!

Any changes you want you can make PRs over at my fork and I'll merge them in so it can be shown here.

@kingbuzzman
Copy link
Contributor Author

@francoisfreitag any input?

@francoisfreitag
Copy link
Member

I don’t have much time to dedicate to open-source these days, I cannot spend time on this PR for now.

The idea sure has merits, but it needs efforts to get the approach right. My understanding is that #925 (comment) needs solving, and without spending some time figuring out a better idea, I won’t be able to make suggestions.

@tony
Copy link
Contributor

tony commented Jun 23, 2023

@kingbuzzman Thanks for keeping this branch alive!

@kingbuzzman
Copy link
Contributor Author

I honestly don't understand why it passed 3 "merge 'master' into master" ago, and now its failing, nothing has changed..

@jurrian
Copy link

jurrian commented May 13, 2024

What is the progress on this? I appreciate the introduction of the true bulk create.
Any help needed at this point?

@cameron-hobbs
Copy link

I am also keen for bulk create implementation to be introduced - glad to help

@ulgens
Copy link

ulgens commented Aug 27, 2024

It looks like something went wrong with the PR and there are lot of unrelated stuff in it. A rebase would be nice but at this point, recreating it will probably be cleaner.

(Merging from main branch into feature branches generally ends like similar results to this one.)

I'd also recommend that instead of using the main branch in your fork, creating a PR from a feature branch to upstream main gives better/cleaner results.

@kingbuzzman
Copy link
Contributor Author

I'll rebase it, i dont want to lose the history. I need to fix the tests when i get a chance. I've been really busy with the day job 😇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants