Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSTORE-1628] Branch 2.0.7 spark 3.5 #17

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
850a13d
Update release version to 2.0.4-spark-3.3
eycho-am Aug 7, 2023
50a5790
Change groupId and publish on archiva (#2)
SirOibaf Sep 15, 2020
d91b861
Adapt profiler for hsfs (#1)
moritzmeister Sep 16, 2020
bfa462e
make tests compile (#3)
moritzmeister Sep 17, 2020
d6753db
Fix tests checkstyle and 4 tests (#4)
moritzmeister Sep 17, 2020
dac9285
Fix test checkstyle (#5)
moritzmeister Sep 17, 2020
0007bf1
Hopsify Deequ 1.1.0
SirOibaf Jun 1, 2021
a090645
Bump deequ hops version
SirOibaf Jun 7, 2021
556d55a
Prepare for 2.0.4.1-SNAPSHOT development
javierdlrm Oct 30, 2023
3fe6188
Fix for NaNs and Infinity values in profile JSON (#7)
tdoehmen Aug 10, 2021
9e10b6c
Fix for NaNs and Infinity values in profile JSON (stylecheck) (#8)
tdoehmen Aug 10, 2021
8791fbb
[HOPSWORKS-2681] Profiling optimization (#6)
tdoehmen Aug 26, 2021
07f15ef
Increase scala-style max method parameters check
SirOibaf Aug 26, 2021
6ce9015
Support for Decimal-type histograms (#10)
tdoehmen Sep 14, 2021
0cd154b
Fix NaN bug for histograms
moritzmeister Dec 1, 2021
d8a78e3
columns need to be filtered also when getting results
moritzmeister Apr 25, 2022
1674da5
fixed NaN issues and improved statistics JSON
tdoehmen May 20, 2022
b21d74e
fixed stylecheck
tdoehmen May 27, 2022
8edc0e9
fixed NaN issues and improved statistics JSON
tdoehmen May 20, 2022
05b4e1a
Resolve conflicts for 2.0.4 - spark3.3
javierdlrm Oct 30, 2023
766d412
Set better artifact name
SirOibaf Nov 18, 2023
c707499
Merge branch 'branch-2.0.7-spark-3.5' into branch-2.0.4
bubriks Dec 2, 2024
cd2ee98
merge hopsworks branch 2.0.4
bubriks Dec 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion deequ-scalastyle.xml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
</check>

<check customId="argcount" level="error" class="org.scalastyle.scalariform.ParameterNumberChecker" enabled="true">
<parameters><parameter name="maxParameters"><![CDATA[10]]></parameter></parameters>
<parameters><parameter name="maxParameters"><![CDATA[20]]></parameter></parameters>
</check>

<check level="error" class="org.scalastyle.scalariform.NoFinalizeChecker" enabled="true"></check>
Expand Down
148 changes: 133 additions & 15 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,12 @@
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.amazon.deequ</groupId>
<artifactId>deequ</artifactId>
<groupId>com.logicalclocks</groupId>
<artifactId>deequ_${scala.major.version}</artifactId>
<!-- <groupId>com.amazon.deequ</groupId>-->
<!-- <artifactId>deequ</artifactId>-->
<version>2.0.7-spark-3.5</version>

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>

<scala.major.version>2.12</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>${scala.major.version}</artifact.scala.version>
<scala-maven-plugin.version>4.8.1</scala-maven-plugin.version>

<spark.version>3.5.0</spark.version>
</properties>

<name>deequ</name>
<description>Deequ is a library built on top of Apache Spark for defining "unit tests for data",
which measure data quality in large datasets.
Expand Down Expand Up @@ -67,6 +56,48 @@
<url>https://github.com/awslabs/deequ</url>
</scm>


<!-- awslabs/[email protected] does only have the following properties. There are not multiple profiles -->
<!-- <properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>

<scala.major.version>2.12</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>${scala.major.version}</artifact.scala.version>
<scala-maven-plugin.version>4.8.1</scala-maven-plugin.version>

<spark.version>3.3.0</spark.version>
</properties> -->

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>

<!-- Scala -->
<scala.major.version>${scala-212.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<scala-211.major.version>2.11</scala-211.major.version>
<scala-212.major.version>2.12</scala-212.major.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
<scala-maven-plugin.version>4.8.1</scala-maven-plugin.version>

<!-- Spark -->
<spark.version>${spark-35.version}</spark.version>
<spark-22.version>2.2.2</spark-22.version>
<spark-23.version>2.3.2</spark-23.version>
<spark-24.version>2.4.2</spark-24.version>
<spark-30.version>3.0.0</spark-30.version>
<spark-31.version>3.1.1.0</spark-31.version>
<spark-33.version>3.3.0</spark-33.version>
<spark-35.version>3.5.0</spark-35.version>
<artifact.spark.version></artifact.spark.version>
<spark.scope>provided</spark.scope>
</properties>

<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
Expand All @@ -86,12 +117,14 @@
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.major.version}</artifactId>
<version>${spark.version}</version>
<scope>${spark.scope}</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.major.version}</artifactId>
<version>${spark.version}</version>
<scope>${spark.scope}</scope>
</dependency>

<dependency>
Expand Down Expand Up @@ -422,6 +455,91 @@
</plugins>
</build>
</profile>

<!-- In logicalclocks/[email protected] we have multiple profiles. They are not anymore in awslabs/deequ-->

<!-- <profile>
<id>spark-2.2-scala-2.11</id>
<properties>
<spark.version>${spark-22.version}</spark.version>
<scala.major.version>${scala-211.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile>
<profile>
<id>spark-2.3-scala-2.11</id>
<properties>
<spark.version>${spark-23.version}</spark.version>
<scala.major.version>${scala-211.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile>
<profile>
<id>spark-2.4-scala-2.11</id>
<properties>
<spark.version>${spark-24.version}</spark.version>
<scala.major.version>${scala-211.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile>
<profile>
<id>spark-3.0-scala-2.12</id>
<properties>
<spark.version>${spark-30.version}</spark.version>
<scala.major.version>${scala-212.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile>
<profile>
<id>spark-3.1-scala-2.12</id>
<properties>
<spark.version>${spark-31.version}</spark.version>
<scala.major.version>${scala-212.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile>
<profile>
<id>spark-3.3-scala-2.12</id>
<properties>
<spark.version>${spark-330.version}</spark.version>
<scala.major.version>${scala-212.major.version}</scala.major.version>
<scala.version>${scala.major.version}.10</scala.version>
<artifact.scala.version>_scala-${scala.major.version}</artifact.scala.version>
<artifact.spark.version>_spark-${spark.version}</artifact.spark.version>
</properties>
</profile> -->
</profiles>

<repositories>
<repository>
<id>Hops</id>
<name>Hops Repo</name>
<url>https://archiva.hops.works/repository/Hops/</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>

<distributionManagement>
<repository>
<id>Hops</id>
<name>Hops Repo</name>
<url>https://archiva.hops.works/repository/Hops/</url>
</repository>
</distributionManagement>

</project>
4 changes: 2 additions & 2 deletions src/main/scala/com/amazon/deequ/analyzers/Completeness.scala
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

package com.amazon.deequ.analyzers

import com.amazon.deequ.analyzers.Preconditions.{hasColumn, isNotNested}
import com.amazon.deequ.analyzers.Preconditions.{hasColumn}
import org.apache.spark.sql.functions.sum
import org.apache.spark.sql.types.{IntegerType, StructType}
import Analyzers._
Expand Down Expand Up @@ -45,7 +45,7 @@ case class Completeness(column: String, where: Option[String] = None,
}

override protected def additionalPreconditions(): Seq[StructType => Unit] = {
hasColumn(column) :: isNotNested(column) :: Nil
hasColumn(column) :: Nil
}

override def filterCondition: Option[String] = where
Expand Down
1 change: 1 addition & 0 deletions src/main/scala/com/amazon/deequ/analyzers/DataType.scala
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ object DataTypeInstances extends Enumeration {
val Integral: Value = Value(2)
val Boolean: Value = Value(3)
val String: Value = Value(4)
val Decimal: Value = Value(5)
}

case class DataTypeHistogram(
Expand Down
2 changes: 1 addition & 1 deletion src/main/scala/com/amazon/deequ/analyzers/Histogram.scala
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ case class Histogram(
}

object Histogram {
val NullFieldReplacement = "NullValue"
val NullFieldReplacement = "-null-"
val MaximumAllowedDetailBins = 1000
val count_function = "count"
val sum_function = "sum"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ object DeequFunctions {

/** Standard deviation with state */
def stateful_stddev_pop(column: Column): Column = withAggregateFunction {
StatefulStdDevPop(column.expr)
StatefulStdDevPop(column.expr, true)
}

/** Approximate number of distinct values with state via HLL's */
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ package com.amazon.deequ.analyzers.runners

import com.amazon.deequ.analyzers.{Analyzer, KLLParameters, KLLSketch, KLLState, QuantileNonSample, State, StateLoader, StatePersister}
import com.amazon.deequ.metrics.Metric
import org.apache.spark.sql.types.{ByteType, DoubleType, FloatType, IntegerType, LongType, ShortType, StructType}
import org.apache.spark.sql.types.{ByteType, DecimalType, DoubleType, FloatType, IntegerType, LongType, ShortType, StructType}
import org.apache.spark.sql.{DataFrame, Row}

@SerialVersionUID(1L)
Expand Down Expand Up @@ -84,6 +84,13 @@ class FloatQuantileNonSample(sketchSize: Int, shrinkingFactor: Double)
override def itemAsDouble(item: Any): Double = item.asInstanceOf[Float].toDouble
}

@SerialVersionUID(1L)
class DecimalQuantileNonSample(sketchSize: Int, shrinkingFactor: Double)
extends UntypedQuantileNonSample(sketchSize, shrinkingFactor) with Serializable {
override def itemAsDouble(item: Any): Double = item.asInstanceOf[java.math.BigDecimal]
.doubleValue()
}

object KLLRunner {

def computeKLLSketchesInExtraPass(
Expand Down Expand Up @@ -139,6 +146,7 @@ object KLLRunner {
case ShortType => new ShortQuantileNonSample(sketchSize, shrinkingFactor)
case IntegerType => new IntQuantileNonSample(sketchSize, shrinkingFactor)
case LongType => new LongQuantileNonSample(sketchSize, shrinkingFactor)
case DecimalType() => new DecimalQuantileNonSample(sketchSize, shrinkingFactor)
// TODO at the moment, we will throw exceptions for Decimals
case _ => throw new IllegalArgumentException(s"Cannot handle ${schema(column).dataType}")
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ private[examples] object DataProfilingExample extends App {
any shuffles. */
val result = ColumnProfilerRunner()
.onData(rawData)
.nonOptimized()
.run()

/* We get a profile for each column which allows to inspect the completeness of the column,
Expand Down
Loading