- The official code repository has been moved from GitHub to Wikimedia GitLab (T304544).
- The annoying Pandas warning triggered by
mariadb.run
andhive.run
has been suppressed (T324135). - Setuptools is now specified as the build system in
pyproject.toml
, complying with PEP-517 and PEP-518 (T378254). - The package version is now available at runtime from the
__version__
attribute (T356708) - Conda-Pack (used in the
conda
module) and Tabulate (used inutils.df_to_remarkup
) are now properly specified as dependencies (T370718, T378430).
- The
presto
module now follows the DNS aliasanalytics-presto.eqiad.wmnet
to connect to the current Presto coordinator, rather than being hardcoded to connect to a particular coordinator. This allows Wmfdata to automatically adapt when the coordinator role is switched to a new server (T345482). - The version pin and warning handling code for Urllib3 has been removed, thanks to the updated certificate bundle added in 2.2.0.
- The CA bundle that is used for establishing a TLS connection with presto has been updated to the new combined bundle. This supports certificates signed by the legacy Puppet 5 built-in certificate authority, as well as the newer certificates signed by the WMF PKI system.
- Improve formatting of
df_to_remarkup
.
- Urllib3 is now pinned below 2.0 to avoid errors when querying Presto (T345309).
- Matplotlib is no longer specified as a dependency, since the
charting
module was removed in 2.0. utils.pd_display_all
uses a different method to disable Pandas'sdisplay.max_colwidth
in order to support Pandas 2.0 and higher.- Various documentation improvements.
-
🚨 Breaking change:
spark.get_session
andspark.get_custom_session
have been renamed tospark.create_session
andspark.create_custom_sesson
.If a session already exists, the functions will now stop it and create a new session rather than returning the existing one and silently ignoring the passed settings. Use the new
get_active_session
function if you just want to retrieve the active session. -
🚨 Breaking change:
spark.run
no longer has the ability to specify Spark settings; thesession_type
andextra_settings
parameters have been removed. If no session exists, a default "yarn-regular" session will be created. If you want to customize the session, usecreate_session
orcreate_custom_session
first. -
Breaking change:
hive.run
only provides results as a Pandas dataframe. The "raw" format and theformat
parameter have been removed. -
Wmfdata no longer tries to close Spark sessions after 30 minutes of inactivity. Please be conscious of your resource use and shut down notebooks when you are done with them.
-
Breaking change: the internal
get_application_id
,cancel_session_timeout
,stop_session
, andstart_session_timeout
functions used to automatically close sessions have been removed. -
Spark 3 will automatically be used instead of Spark 2 inside new
conda-analytics
environments.
- Breaking change:
hive.run
only provides results as a Pandas dataframe. The "raw" format and theformat
parameter have been removed. - Breaking change:
hive.run_cli
has been removed. Usehive.run
instead. - Breaking change: the
heap_size
andengine
parameters have been removed fromhive.run
.
- Breaking change:
mariadb.run
only provides results as a Pandas dataframe. The "raw" format and theformat
parameter have been removed.
- Breaking change: The
charting
module has been removed.
- Warnings and notices have been streamlined and improved.
- A completely overhauled quickstart notebook provides a comprehensive introduction to Wmfdata's features.
mariadb.run
now uses the MariaDB Python connector library rather than the MySQL one, which fixes several errors (T319360).utils
now includes ansql_tuple
function which makes it easy to format a Python list or tuple for use in an SQL IN clause.
- Improve handling of nulls and blobs in binary fields (#29)
- Switch to pyhive as a hive interface (#24)
mariadb.run
now returns binary text data as Python strings rather than bytearrays when using version 8.0.24 or higher ofmysql-connector-python
.- Spark session timeouts now run properly, rather than failing because
spark.stop_session
is called without a required argument. This fixes a bug introduced in version 1.1.
- The recommended install and update command now includes the
--ignore-installed
flag to avoid an error caused by Pip attempting the uninstall the version the package in the read-onlyanaconda-wmf
base environment.
- A new integration test script (
wmfdata_tests/tests.py
) makes it easy to verify that this package and the analytics infrastructure function together to allow users to access data. - Spark session timeouts now run on a daemon thread so that scripts using the
spark
module can exit correctly. spark.run
no longer produces an error when an SQL command without output is run.- The documentation now provides detailed instructions for setting up the package in development mode given the new Conda setup on the analytics clients.
- With the help of a new
conda
module, thespark
module has been revised to take advantage of the capabilities of the new conda-based Jupyter setup currently in development. Most significantly, the module now supports shipping a local conda environment to the Spark workers if custom dependencies are needed. - The on-import update check now times out after 1 second to prevent long waits when the web request fails to complete (as when the
https_proxy
environment variable is not set). hive.load_csv
now properly defines the file format of the new table. The previous behavior started to cause errors after Hive's default file format was changed from text to Parquet.
- The new
presto
module supports querying the Data Lake using Presto. - The
spark
module has been refactored to support local and custom sessions. - A new
utils.get_dblist
function provides easy access to wiki database lists, which is particularly useful withmariadb.run
. - The
hive.run_cli
function now creates its temp files in a standard location, to avoid creating distracting new entries in the current working directory.
- The code and documentation now reflect the repository's new location (github.com/wikimedia/wmfdata-python).
- The repository now contains a pointer to the code of conduct for Wikimedia technical spaces.
- The new version notice that display on import now shows the correct URL for updating.
- The MariaDB module now looks in the appropriate place for the database credentials based on the user's access group.
- The minimum required pandas version has been set to 0.20.1, which introduced the
errors
module (the version preinstalled on the Analytics Clients is 0.19.2). - The submodules are now imported in a different order to avoid dependency errors.
- You can now run SQL using Hive's command line interface with
hive.run_cli
. - The
hive.run
function is now a thin wrapper aroundhive.run_cli
.- Breaking change: the
spark_master
,spark_settings
, andapp_name
parameters ofhive.run
have been removed, since the function no longer uses Spark. - Breaking change: The
fmt
paramater ofhive.run
has been renamed toformat
for consistency with the otherrun
functions.
- Breaking change: the
hive.load_csv
now drops and recreates the destination table if necessary, so the passed field specification is always used.
- Spark SQL functionality has been moved from
hive.run
tospark.run
so that Spark-specific and Hive CLI-specific settings parameters are not mixed in one function. spark.get_session
andspark.run
now use a preconfigured "regular" session by default, with the ability to switch to a preconfigured "large" session or override individual settings if necessary.
- Breaking change: the deprecated
mariadb.multirun
function has been removed. Pass a list of databases tomariadb.run
instead. - Breaking change: In
mariadb.run
, the format name"tuples"
has been removed, leaving only its synonym"raw"
for greater consistency with the otherrun
functions. - The
mariadb
module is now available afterimport wmfdata
.
- The on-import check for newer versions of the package has been improved.
- A version message is only shown if a newer version is available.
- The check emits a warning instead of raising an error if it fails, making it more robust to future changes in file layout or code hosting.
charting.set_mpl_style
no longer causes an import error.- Many other programming, code documentation, and code style improvements.
- If you call
spark.run
orspark.get_session
while a Spark session is already running, passing different settings will have no effect. You must restart the kernel before selecting different settings.