Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make FieldValueTypeInformation creators take a TypeDescriptor parameter #32081

Merged
merged 6 commits into from
Oct 30, 2024

Conversation

tilgalas
Copy link
Contributor

@tilgalas tilgalas commented Aug 5, 2024

This is a 2nd PR in the series of PRs (based on the now closed #31648, with the first PR being #31785) whose ultimate goal is to add support for generic classes to schema providers.

FieldValueTypeInformation creators will now accept a TypeDescriptor parameter describing the field's containing class, that will let them infer more accurate type information about that field. For example - consider a MyClass<T> class - now with a TypeDescriptor of MyClass<String> and the getter of type T the type resolver can infer the type of field to be String , see the added test class that shows the new functionality


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

github-actions bot commented Aug 5, 2024

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@tilgalas
Copy link
Contributor Author

tilgalas commented Aug 5, 2024

Run Java PreCommit

@tilgalas
Copy link
Contributor Author

tilgalas commented Aug 6, 2024

Run Java_GCP_IO_Direct PreCommit

@tilgalas
Copy link
Contributor Author

tilgalas commented Aug 6, 2024

Run Java PreCommit

@tilgalas
Copy link
Contributor Author

tilgalas commented Aug 6, 2024

Run Java_Hadoop_IO_Direct PreCommit

@tilgalas
Copy link
Contributor Author

tilgalas commented Aug 6, 2024

assign set of reviewers

Copy link
Contributor

github-actions bot commented Aug 6, 2024

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @m-trieu for label java.
R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@damccorm
Copy link
Contributor

@damondouglas could you please take a look at this one?

@damondouglas damondouglas self-requested a review August 14, 2024 19:52
Copy link
Contributor

@damondouglas damondouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tilgalas Thank you for working on this! Would you mind to first remove the "nullness" value in the @SuppressWarnings annotation for the involved classes in this PR? Please consider this suggestion as non-blocking but I would like to see if this is possible with your helpful changes. As a second phase, may we also consider removing the "rawtypes"? Again, non-blocking but it's something we would like to strive for to improve the code quality.

@tilgalas
Copy link
Contributor Author

@tilgalas Thank you for working on this! Would you mind to first remove the "nullness" value in the @SuppressWarnings annotation for the involved classes in this PR? Please consider this suggestion as non-blocking but I would like to see if this is possible with your helpful changes. As a second phase, may we also consider removing the "rawtypes"? Again, non-blocking but it's something we would like to strive for to improve the code quality.

sure, with pleasure!

Copy link
Contributor

Reminder, please take a look at this pr: @m-trieu @damondouglas

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link
Contributor

github-actions bot commented Sep 5, 2024

Reminder, please take a look at this pr: @Abacn @Abacn

@Abacn
Copy link
Contributor

Abacn commented Sep 5, 2024

waiting on author

@github-actions github-actions bot added the aws label Sep 13, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@tilgalas
Copy link
Contributor Author

hi reviewers, I'm going to split this PR and put the nullness and rawness refactoring commit into its own PR

@tilgalas
Copy link
Contributor Author

since there's a #32757, I'm closing this PR

@tilgalas
Copy link
Contributor Author

reopening after discussing offline

@tilgalas tilgalas reopened this Oct 22, 2024
@github-actions github-actions bot added aws and removed aws labels Oct 22, 2024
@@ -39,9 +41,9 @@
public class CachingFactory<CreatedT> implements Factory<CreatedT> {
private transient @Nullable ConcurrentHashMap<TypeDescriptor<?>, CreatedT> cache = null;

private final Factory<CreatedT> innerFactory;
private final @NotOnlyInitialized Factory<CreatedT> innerFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these new tags for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so in RowValueGettersFactory there's this code:

    RowValueGettersFactory(Factory<List<FieldValueGetter<T, Object>>> gettersFactory) {
      this.gettersFactory = gettersFactory;
      this.cachingGettersFactory = new CachingFactory<>(this);
    }

since we're in the constructor, the this parameter is still not fully initialized, so by using this annotation we can temporarily (for the duration of the initalization of the CachingFactory object) allow assigning the innerFactory field a value which itself is under initialization (or indeed whose initialization status is unknown, as the signature of the CachingFactory constructor tells us)

see also https://checkerframework.org/manual/#circular-initialization

@@ -29,7 +30,7 @@
* <p>Implementations of this interface are generated at runtime to map object fields to Row fields.
*/
@Internal
public interface FieldValueGetter<ObjectT, ValueT> extends Serializable {
public interface FieldValueGetter<ObjectT extends @NonNull Object, ValueT> extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have ObjectT extends Object?

Copy link
Contributor Author

@tilgalas tilgalas Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so apparently there's a difference (I don't understand why it would be so, though) between a @NonNull T and T extends @NonNull Object - here's an example, consider:

  public interface I<T extends @NonNull Object> {
      T getT();
  }

with a signature as above, we can only write:

  public static <T extends @NonNull Object> void f(I<T> i) {
    System.out.println(i.getT().getClass());
  }

or alternatively:

  public static <T> void f(I<@NonNull T> i) {
    System.out.println(i.getT().getClass());
  }

The checkerframework will make sure that we propagate the non-nullness of the parameter.
This won't work for example:

  public static <@Nullable T> void f(I<T> i) {
    T t = i.getT();
    if (t != null) {
      System.out.println(t.getClass());
    }
  }

nor will this:

  public static <T> void f(I<@Nullable T> i) {
    T t = i.getT();
    if (t != null) {
      System.out.println(t.getClass());
    }
  }

but it will work if I change the signature of I to:

  public interface I<@NonNull T> {
      T getT();
  }

in other words, the @NonNull T annotation on the interface is not really enforced anywhere, it's more like a suggestion, and so it's not really expressing the concept of the interface only works with @NonNull types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally in Beam I've mostly seen us using @nullable, not @nonnull. I believe that NonNull is assumed unless @nullable is specified. Are you not finding that to be the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, at least not when we're talking type parameters, for example:

  public interface I<T> {
      T getT();
  }

  public static <T> void f(I<T> i) {
    System.out.println(i.getT().getClass());
  }

gives me

error: [dereference.of.nullable] dereference of possibly-null reference i.getT()
  System.out.println(i.getT().getClass());
                           ^

@@ -126,17 +123,26 @@ public static FieldValueTypeInformation forOneOf(
}

public static FieldValueTypeInformation forField(Field field, int index) {
TypeDescriptor<?> type = TypeDescriptor.of(field.getGenericType());
return forField(null, field, index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class wasn't really meant to be public. I would add an @internal tag to the class, and not bother keeping this backwards-compatible method

}

@Override
public Row apply(T input) {
return Row.withSchema(schema).withFieldValueGetters(getterFactory, input);
return Row.withSchema(schema).withFieldValueGetters((Factory) getterFactory, input);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good question - I'm playing with that code right now and I think I could get rid of it actually

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -836,10 +838,10 @@ public int nextFieldId() {
}

@Internal
public Row withFieldValueGetters(
Factory<List<FieldValueGetter>> fieldValueGetterFactory, Object getterTarget) {
public <T extends @NonNull Object> Row withFieldValueGetters(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unclear on this bound as well

Copy link
Contributor Author

@tilgalas tilgalas Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change it to <@NonNull T> I suppose - it compiles, but then my IDE complains that "Non-null type argument is expected" on the T parameter of the FieldValueGetter<T, ?> type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer @nonnull T (is this simply a weakness in the IDE?), but I guess I don't have strong feelings here.

super(schema);
this.getterTarget = getterTarget;
this.getters = getterFactory.create(TypeDescriptor.of(getterTarget.getClass()), schema);
}

@Override
@SuppressWarnings({"TypeParameterUnusedInFormals", "unchecked"})
public <T> @Nullable T getValue(int fieldIdx) {
public @Nullable Object getValue(int fieldIdx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove type parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's probably a matter of taste - my view is that this class can't possibly know what type of value a particular getter is going to return, so it's not commiting to anything by making the method generic - "I only know it's an Object, if you need a specific type, you'll have to cast it yourself" - I should probably also change the type of the getters list to List<FieldValueGetter<T, Object>> but then it would probably open another can of worms of covariant types, that is whether a, say, FieldValueGetter<T, String> is a FieldValueGetter<T, Object> - the current type of List<FieldValueGetter<T, ?>> is slightly incorrect as the ? wildcard means an unknown type, but which is at least shared by all members of the list (which is not true)

I can understand the approach where the caller requests a specific type via a generic type parameter and is happy to accept the method throwing a ClassCastException if they had wrong assumptions about the getter - let me know if you want me to make it back a generic method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I will change the type of the list elements to FieldValueGetter<T, Object> anyway, as I find it more correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class uses a parameter here, so I think for now we should just match the base class.

result = false)
private static boolean isNullOrEmpty(@Nullable String str) {
return str == null || str.isEmpty();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is wrong with Strings.isNullOrEmpty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason the checkerframework doesn't understand the nullness assertion when using it:

/usr/local/google/home/mszwaja/src/beam/sdks/java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/schemas/utils/AvroUtils.java:538: error: [argument] incompatible argument for parameter namespace of AvroUtils.toAvroField.
      org.apache.avro.Schema.Field recordField = toAvroField(field, childNamespace);
                                                                    ^
  found   : @Initialized @Nullable String
  required: @Initialized @NonNull String
/usr/local/google/home/mszwaja/src/beam/sdks/java/extensions/avro/src/main/java/org/apache/beam/sdk/extensions/avro/schemas/utils/AvroUtils.java:541: error: [argument] incompatible argument for parameter name of Schema.createRecord.
    return org.apache.avro.Schema.createRecord(schemaName, "", schemaNamespace, false, fields);
                                               ^
  found   : @Initialized @Nullable String
  required: @Initialized @NonNull String

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this example has to do with isNullOrEmpty (which returns a boolean)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error that appears when I use Strings.isNullOrEmpty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other words, checkerframework doesn't understand that by using this method (and getting a false returned from it) we've just asserted, that the variable is @NonNull - my "polyfill" method uses the checkerframework's annotation:

  @EnsuresNonNullIf(
      expression = {"#1"},
      result = false)

which tells it that if it returns false the first argument is verified to be @NonNull - I don't know why checkerframwork doesn't understand guava method's postconditions here (maybe it's fixed in some newer version?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood

@ahmedabu98
Copy link
Contributor

Fixes #32081

@reuvenlax
Copy link
Contributor

A bunch of tests are failing. PTAL to see if this is due to your PR or just flaky tests.

@github-actions github-actions bot added aws and removed aws labels Oct 29, 2024
@tilgalas
Copy link
Contributor Author

Run Java PreCommit

1 similar comment
@tilgalas
Copy link
Contributor Author

Run Java PreCommit

@tilgalas
Copy link
Contributor Author

Run Java_GCP_IO_Direct PreCommit

@reuvenlax
Copy link
Contributor

lgtm

@reuvenlax reuvenlax merged commit dd556b2 into apache:master Oct 30, 2024
27 of 29 checks passed
@@ -451,8 +484,7 @@ public static Field toBeamField(org.apache.avro.Schema.Field field) {
public static org.apache.avro.Schema.Field toAvroField(Field field, String namespace) {
org.apache.avro.Schema fieldSchema =
getFieldSchema(field.getType(), field.getName(), namespace);
return new org.apache.avro.Schema.Field(
field.getName(), fieldSchema, field.getDescription(), (Object) null);
return new org.apache.avro.Schema.Field(field.getName(), fieldSchema, field.getDescription());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #33102 should fix this (I cherry-picked it to the current release branch as well)

@Abacn
Copy link
Contributor

Abacn commented Nov 15, 2024

This likely breaks DataflowTemplate unit test: https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/2014/checks?check_run_id=33057418339

I'm trying to understand the scope of breaking change

The stacktrace

Caused by: java.lang.IllegalStateException: getters require withGetterTarget.
	at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:512)
	at org.apache.beam.sdk.values.Row$Builder.withFieldValueGetters(Row.java:843)
	at org.apache.beam.sdk.schemas.GetterBasedSchemaProvider$ToRowWithValueGetters.apply(GetterBasedSchemaProvider.java:135)
	at org.apache.beam.sdk.schemas.GetterBasedSchemaProvider$ToRowWithValueGetters.apply(GetterBasedSchemaProvider.java:114)
	at org.apache.beam.sdk.schemas.SchemaCoder.encode(SchemaCoder.java:121)
	at org.apache.beam.sdk.coders.MapCoder.encode(MapCoder.java:92)
	at org.apache.beam.sdk.coders.MapCoder.encode(MapCoder.java:41)
	at org.apache.beam.sdk.coders.MapCoder.encode(MapCoder.java:93)
	at org.apache.beam.sdk.coders.MapCoder.encode(MapCoder.java:41)
	at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:73)
	at org.apache.beam.sdk.coders.KvCoder.encode(KvCoder.java:37)
	at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:86)
	at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:70)
	at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:55)
	at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:168)
	at org.apache.beam.sdk.util.MutationDetectors$CodedValueMutationDetector.<init>(MutationDetectors.java:118)
	at org.apache.beam.sdk.util.MutationDetectors.forValueWithCoder(MutationDetectors.java:49)
	at org.apache.beam.runners.direct.ImmutabilityCheckingBundleFactory$ImmutabilityEnforcingBundle.add(ImmutabilityCheckingBundleFactory.java:115)
	at org.apache.beam.runners.direct.ParDoEvaluator$BundleOutputManager.output(ParDoEvaluator.java:305)

MapCoder.encode -> SchemaCoder.encode ->GetterBasedSchemaProvider.ToRowWithValueGetters passes a null value to withFieldValueGetters somehow, causing IllegalStateException

This PR made substantial change to the related code path. Somehow some encoding worked before now crash with the Exception above

Update: I now suspect it is due to unintended consequence of nullable annotation fix. I checked that where CodedValueMutationDetector.<init>(MutationDetectors.java:118) throws Exception, the element KV Coder trying to encode is sth like

KV{BigQueryTable{project=test-project1, dataset=test-dataset1, tableName=unpartitioned_table,
partitioningColumn=null, partitions=null, lastModificationTime=1731699015913000,
schemaSupplier=com.google.cloud.teleport.v2.utils.SerializableSchemaSupplier@2a2435f9,
dataplexEntityName=unpartitioned_table_entity},
null}, org.apache.beam.sdk.coders.KvCoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants