forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
3688 lines (3210 loc) · 195 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
========
5.5.1
========
----------------
New Features & Enhancements
----------------
* `BertForMultipleChoice` Transformer Added. Enhanced BERT’s capabilities to handle multiple-choice tasks such as standardized test questions and survey or quiz automation.
* Integrated New Tasks and Documentation:
* Added support and documentation for the following tasks:
* Automatic Speech Recognition
* Dependency Parsing
* Image Captioning
* Image Classification
* Landing Page
* Question Answering
* Summarization
* Table Question Answering
* Text Classification
* Text Generation
* Text Preprocessing
* Token Classification
* Translation
* Zero-Shot Classification
* Zero-Shot Image Classification
* `PromptAssembler` Annotator Introduced. Introduced a new annotator that constructs prompts for LLMs using a chat template and a sequence of messages. Accepts an array of tuples with roles (“system”, “user”, “assistant”) and message texts. Utilizes llama.cpp as a backend for template parsing, supporting basic template applications.
----------------
Bug Fixes
----------------
* Resolved Pretrained Model Loading Issue on DBFS Systems.
* Fixed a bug where pretrained models were not found when running AutoGGUF model pipelines on Databricks due to incorrect path handling of gguf files.
========
5.5.0
========
----------------
New Features & Enhancements
----------------
* Introduced QWEN2Transformer (#14188)
* Introduced MiniCPM (#14205)
* Introduced NLLB (#14209)
* Implemented Nomic embeddings (#14217)
* Introduced CamemBertForZeroShotClassification annotator (#14354)
* Implemented Mxbai Embeddings (#14355)
* Introduced AlbertForZeroShotClassification (#14361)
* Introduced Phi-3 (#14373)
* Implemented Starcoder2 for causal language modeling (#14358)
* Integrated llama.cpp (#14364)
* Implemented SnowFlake (#14353)
* Introduced ONNX support to vision annotators (#14356)
* Introduced ONNX and OpenVINO support to Missing Annotators (#14359)
* Added OpenVINO install instructions (#14382)
* Exported notebooks for release candidate (#14393)
========
5.4.2
========
----------------
New Features & Enhancements
----------------
* Added demo notebook for Image Classification Annotators
* Added aggressiveMatching parameter to DateMatcher and MultiDateMatcher annotators
* Added aggressiveMatching parameter to DocumentSimilarityRanker annotator
========
5.4.1
========
----------------
New Features & Enhancements
----------------
* Added support for loading duplicate models in Spark NLP, allowing multiple models from the same annotator to be loaded simultaneously.
* Updated the README for better coherence and added new pages to the website.
* Added support for a stop IDs list to halt text generation in Phi, Mistral, and Llama annotators.
----------------
Bug Fixes
----------------
* Fixed the default model names for Phi2 and Mistral AI annotators.
========
5.4.0
========
----------------
New Features & Enhancements
----------------
* Added OpenVINO Runtime integration for various models, enabling enhanced inference performance. (#14246)
* Added Python APIs to incorporate OpenVINO support. (#14242)
* Introduced support for ONNX models and average pooling in ONNX-based annotators. (#14245)
* Implemented MPNet for token classification. (#14244)
* Added support for MistralAI LLM and LLAMA2. (#14243)
* Improved caching mechanisms in Streamlit demos. (#14241)
* Enhanced models' card and README documentation for Models Hub. (#14240)
* Added OpenVINO GPU dependencies. (#14236)
* Locked macOS version for runners and added missing SBT setup. (#14235)
----------------
Bug Fixes
----------------
* Fixed bugs in Colab notebooks. (#14239)
* Resolved issues with BERT backend and broken annotators. (#14238)
* Corrected LLAMA2 position ID and generation bug. (#14237)
========
5.3.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduce UAEEmbeddings for sentence embeddings using Universal AnglE Embedding, aimed at improving semantic textual similarity tasks
* Introduce critical enhancements and optimizations to the processing of the CoNLL-U format for Dependency Parsers training, including enhanced multiword token handling and improved handling of missing uPos values
* Add example notebook for `DocumentCharacterTextSplitter`
* Add example notebook for `DeBertaForZeroShotClassification`
* Add example notebooks for `BGEEmbeddings` and `MPNetEmbeddings`
* Add example notebook for `MPNetForQuestionAnswering`
* Add example notebook for `MPNetForSequenceClassification`
* Implement cache mechanism for `metadata.json`, enhancing efficiency by avoiding unnecessary downloads
----------------
Bug Fixes
----------------
* Address a bug with serializing ONNX models that lack a `.onnx_data` file, ensuring better reliability in model serialization processes
* Delete redundant `Multilingual_Translation_with_M2M100.ipynb` notebook entries
* Fix Colab link for the M2M100 notebook
========
5.3.2
========
----------------
Bug Fixes
----------------
* Fix and add notebooks to import models from Hugging Face
* Add ONNX and TensorFlow notebooks
* Fix XlnetForSeqeunceClassification and added XlnetForTokenClassificaiton
* Rename DistilBertForZeroShotClassification
* Add missing notebooks
* Add MPNetEmbeddings to annotator
* Fix XLMRoBertaForQuestionAnswering, XLMRoBertaForTokenClassification, and XLMRoBertaForSequenceClassification: Reverted the change in tfFile naming that was causing exceptions while loading and saving the models
* Fix documentation for sparknlp.start()
========
5.3.1
========
----------------
Bug Fixes
----------------
* Fix M2M100 not working on the second run (closing the ONNX Session by mistake)
* Fix ONNX models failing in clusters like Databricks
* Fix `ZeroShotNerClassification` issue with NerConverter
* adding colab notebook for M2M100
========
5.3.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing Llama-2 and all the models fine-tuned based on this architecutre. This our very first CasualLM annotator in ONNX and it comes with support for quantization in INT4 and INT8 for CPUs.
* **NEW:** Introducing `MPNetForSequenceClassification` annotator for sequence classification tasks. This annotator is based on the MPNet architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Introducing `MPNetForQuestionAnswering` annotator for question answering tasks. This annotator is based on the MPNet architecture and is designed to answer questions based on a given context.
* **NEW:** Introducing `M2M100` state-of-the-art multilingual translation. M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation. The model can directly translate between the 9,900 directions of 100 languages.
* **NEW:** Introducing a new `DeBertaForZeroShotClassification` annotator for zero-shot classification tasks. This annotator is based on the DeBERTa architecture and is designed to classify sequences of text into a set of predefined classes.
* **NEW:** Implement retreival feature in our `DocumentSimilarity`annotator. The new DocumentSimilarity ranker is a powerful tool for ranking documents based on their similarity to a given query document. It is designed to be efficient and scalable, making it ideal for a variety of RAG applications/
* Add ONNNX support for `BertForZeroShotClassification` annotator.
* Add support for in-memory use of `WordEmbeddingsModel` annotator in server-less cluster. We initially introduced in-memory feature for this annotator for users inside Kubernetes cluster without any `HDFS`, however, today it runs without any issue `locally`, Google `Colab`, `Kaggle`, `Databricks`, `AWS EMR`, `GCP`, and `AWS Glue`.
* New Whisper Large and Distil models.
* Update ONNX Runtime to 1.17.0
* Support new Databricks Runtimes of 14.2, 14.3, 14.2 ML, 14.3 ML, 14.2 GPU, 14.3 GPU
* Support new EMR 6.15.0 and 7.0.0 versions
* Add nobteook to fine-tune a BERT for Sentence Embeddings in Hugging Face and import it to Spark NLP
* Add notebook to import BERT for Zero-Shot classification from Hugging Face
* Add notebook to import DeBERTa for Zero-Shot classification from Hugging Face
* Update EntityRuler documentation
* Improve SBT project and resolve warnings (almost!)
----------------
Bug Fixes
----------------
* Fix Spark NLP Configuration's to set `cluster_tmp_dir` on Databricks' DBFS via `spark.jsl.settings.storage.cluster_tmp_dir` https://github.com/JohnSnowLabs/spark-nlp/issues/14129
* Fix score calculation in `RoBertaForQuestionAnswering` annotator https://github.com/JohnSnowLabs/spark-nlp/pull/14147
* Fix optional input col validations https://github.com/JohnSnowLabs/spark-nlp/pull/14153
* Fix notebooks for importing DeBERTa classifiers https://github.com/JohnSnowLabs/spark-nlp/pull/14154
* Fix GPT2 deserialization over the cluster (Databricks) https://github.com/JohnSnowLabs/spark-nlp/pull/14177
========
5.2.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
* Refactoring AWS SDK use in Spark NLP to reduce the overal size of the library. We have dropped the use of `bundle` and started to directly using `S3` SDK. This will also minimize incompatibilities with other libraries that use AWS SDKs
* Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
* Add a new `DocumentTokenSplitter` notebook
* Add a new trainig NER notebook by using DeBerta Embeddings
* Add a new trainig text classification notebook by using INSTRUCTOR Embeddings
* Update `RoBertaForTokenClassification` notebook
* Update `RoBertaForSequenceClassification` notebook
* Update `OpenAICompletion` notebook with new `gpt-3.5-turbo-instruct` model
----------------
Bug Fixes
----------------
* Fix `BGEEmbeddings` not downloading in Python
========
5.2.2
========
----------------
Enhancements
----------------
* Update `aws-java-sdk-bundle` dependency to a version without any CVEs
----------------
Bug Fixes
----------------
* Fix the missing `BGEEmbeddings` from annotator in Python
* Add a new BGE notebook to import models into Spark NLP
* Upload the new true `BGE` models to Spark NLP for text embeddings
========
5.2.1
========
----------------
New Features & Enhancements
----------------
* Add support for Spark and PySpark 3.5 major release
* Support Databricks Runtimes of 14.0, 14.1, 14.2, 14.0 ML, 14.1 ML, 14.2 ML, 14.0 GPU, 14.1 GPU, and 14.2 GPU
* **NEW:** Introducing the `BGEEmbeddings` annotator for Spark NLP. This annotator enables the integration of `BGE` models, based on the BERT architecture, into Spark NLP. The `BGEEmbeddings` annotator is designed for generating dense vectors suitable for a variety of applications, including `retrieval`, `classification`, `clustering`, and `semantic search`. Additionally, it is compatible with `vector databases` used in `Large Language Models (LLMs)`.
* **NEW:** Introducing support for ONNX Runtime in DeBertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DeBertaForQuestionAnswering annotator
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with TensorFlow format
* Add a new notebook to show how to import any model from `T5` family into Spark NLP with ONNX format
* Add a new notebook to show how to import any model from `MarianNMT` family into Spark NLP with ONNX format
----------------
Bug Fixes
----------------
* Fix serialization issue in `DocumentTokenSplitter` annotator failing to be saved and loaded in a Pipeline
* Fix serialization issue in `DocumentCharacterTextSplitter` annotator failing to be saved and loaded in a Pipeline
========
5.2.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `CLIPForZeroShotClassification` for Zero-Shot Image Classification using OpenAI's CLIP models
* **NEW:** Introduceding the `DocumentTokenSplitter` which allows users to split large documents into smaller chunks to be used in RAG with LLM models
* **NEW:** Introducing support for ONNX Runtime in T5Transformer annotator
* **NEW:** Introducing support for ONNX Runtime in MarianTransformer annotator
* **NEW:** Introducing support for ONNX Runtime in BertSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in XlmRoBertaSentenceEmbeddings annotator
* **NEW:** Introducing support for ONNX Runtime in CamemBertForQuestionAnswering, CamemBertForTokenClassification, and CamemBertForSequenceClassification annotators
* Adding a caching support for newly imported T5 models in TF format to improve the performance to be competitive to ONNX version
* Improve ZIP util and add tests for both ZipArchiveUtil and OnnxWrapper
* Refactor ONNX and add OnnxSession to broadcast
* Update ONNX Runtime to 1.16.3
* Add a new notebook fro structure streaming
----------------
Bug Fixes
----------------
* Fix random dimension mismatch in E5Embeddings and MPNetEmbeddings due to a missing average_pool after last_hidden_state in the output
* Fix batching exception in E5 and MPNet embeddings annotators failing when sentence is used instead of document
* Fix chunk construction when an entity is found
* Fix a bug in library's version in Scala
* Fix Whisper models not downloading due to wrong library's version
* Fix and refactor saving best model based on given metrics during NerDL training
========
5.1.4
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introduceding the `DocumentCharacterTextSplitter` which allows users to split large documents into smaller chunks. `DocumentCharacterTextSplitter` takes a list of separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
* **NEW:** Introducing support for ONNX Runtime in RobertaForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in RobertaForQuestionAnswering annotator
* Adding an example to load a model directly from Azure using .load() method. This example helps users to understand how to set Spark NLP to load models from Azure
----------------
Bug Fixes
----------------
* Fix a bug with in `Whisper` annotator, that would not allow every model to be imported
* Fix BPE Tokenizer to include a flag whether or not to always prepend a space before words (previous behavior for embeddings)
* Fix BPE Tokenizer to correctly convert and tokenize non-latin and other special characters/words
* Fix `RobertaForQuestionAnswering` to produce the same logits and indexes as the implementation in Transformer library
* Fix the return order of logits in `BertForQuestionAnswering` and `DistilBertForQuestionAnswering` annotators
========
5.1.3
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in BertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in BertForQuestionAnswering annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in DistilBertForQuestionAnswering annotator
* **NEW:** Setting ONNX configuration such as GPU device id, execution mode, etc. via Spark NLP configs
* Update Whisper documentation with minimum required version of Spark/PySpark (3.4)
----------------
Bug Fixes
----------------
* Fix `module 'sparknlp.annotator' has no attribute 'Token2Chunk'` error in Python when using `Token2Chunk` annotator inside loaded PipelineModel
========
5.1.2
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing VisionEncoderDecoder annotator to generate captions from images
* Add missing enteries in the docs and update them with the new features
* Improve beam search results in BART Transformer
========
5.1.1
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in MPNet embedding annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForTokenClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForSequenceClassification annotator
* **NEW:** Introducing support for ONNX Runtime in AlbertForQuestionAnswering annotator
* Implement `getVectors` feature in Word2VecModel, Doc2VecModel, and WordEmbeddingsModel annotators. This new feature allows access to the entire tokens and their vectors in the loaded model.
----------------
Bug Fixes
----------------
* Fix how to save and load `Whisper` models
* Fix saving ONNX model on Windows operating system
========
5.1.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing WhisperForCTC annotator for Automatic Speech Recognition (ASR)
* **NEW:** Introducing OpenAICompletion and OpenAIEmbeddings annotators
* **NEW:** Introducing MPNet Text Embeddings annotators
* **NEW:** Introducing a new BART for Zero-Shot Text Classification annotator
* **NEW:** Adding ONNX support to E5 Embeddings annotator
* **NEW:** New full support for GCP and Azure distributed storages
* New 150+ MPNet models
* New Databricks 13.3 runtime support
* New EMR 6.12.0 version support
----------------
Bug Fixes
----------------
* Fix max sentence length issue in E5Embeddings
========
5.0.2
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in ALBERT, CamemBERT, and XLM-RoBERTa annotators
* **NEW:** Implement ZeroShotNerModel annotator for zero-shot NER based on XLM-RoBERTa architecture
----------------
Bug Fixes
----------------
* Fix MarianTransformers annotator breaking with `java.lang.ClassCastException` in Python
* Fix out of 0.0/1.0 accuracy in SentenceDetectorDL and MultiClassifierDL annotators
* Fix BART issue with low temperature value that only occurred when there are no non infinite logits satisfying the low temperature and top_k values
* Add missing E5Embeddings and InstructorEmbeddings annotators to `annotators` in Scala for easy all-in-one import
========
5.0.1
========
----------------
Bug Fixes & Enhancements
----------------
* Fix `multiLabel` param issue in `XXXForSequenceClassitication` and `XXXForZeroShotClassification` annotators
* Add the missing `threshold` param to all `XXXForSequenceClassitication` in Python
* Fix issue with passing `spark.driver.cores` config as a param into start() function in Python and Scala
* Add new notebooks to export BERT, DistilBERT, RoBERTa, and DeBERTa models to ONNX format
========
5.0.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing support for ONNX Runtime in Spark NLP. ONNX Runtime is a high-performance inference engine for machine learning models in the ONNX format. ONNX Runtime has proved to considerably increase the performance of inference for many models.
* **NEW:** Introducing **InstructorEmbeddings** annotator in Spark NLP 🚀. `InstructorEmbeddings` can load new state-of-the-art INSTRUCTOR Models inherited from T5 for Text Embeddings.
* **NEW:** Introducing **E5Embeddings** annotator in Spark NLP 🚀. `E5Embeddings` can load new state-of-the-art E5 Models inherited from BERT for Text Embeddings.
* **NEW:** Introducing **DocumentSimilarityRanker** annotator in Spark NLP 🚀. `DocumentSimilarityRanker` is a new annotator that uses LSH techniques present in Spark ML lib to execute approximate nearest neighbours search on top of sentence embeddings, It aims to capture the semantic meaning of a document in a dense, continuous vector space and return it to the ranker search.
----------------
Bug Fixes
----------------
* Fix BART issue with maxInputLength
========
4.4.4
========
----------------
New Features & Enhancements
----------------
* Add `Warmup` stage to loading all Transformers for word embeddings: ALBERT, BERT, CamemBERT, DistilBERT, RoBERTa, XLM-RoBERTa, and XLNet. This helps reducing the first inference time and also validate importing external models from HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/13851
* Add new notebooks to import ZeroShot Classifiers for Bert, DistilBERT, and RoBERTa fine-tuned based on NLI datasets https://github.com/JohnSnowLabs/spark-nlp/pull/13845
----------------
Bug Fixes
----------------
* Fix not being able to save models from XXXForSequenceClassification and XXXForZeroShotClassification annotators https://github.com/JohnSnowLabs/spark-nlp/pull/13842
========
4.4.3
========
----------------
New Features & Enhancements
----------------
* New `multilabel` parameter to switch from multi-class to multi-label on all Classifiers in Spark NLP: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, XlnetForSequenceClassification, BertForZeroShotClassification, DistilBertForZeroShotClassification, and RobertaForZeroShotClassification
* Refactor protected Params and Features to avoid unwanted exceptions during runtime https://github.com/JohnSnowLabs/spark-nlp/pull/13797
* Add proper documentation and instructions for ZeroShot classifiers: BertForZeroShotClassification, DistilBertForZeroShotClassification, and RobertaForZeroShotClassification https://github.com/JohnSnowLabs/spark-nlp/pull/13798
* Extend support for downloading models/pipelines directly by given name or S3 path in ResourceDownloader https://github.com/JohnSnowLabs/spark-nlp/pull/13796
----------------
Bug Fixes
----------------
* Fix pretrained pipelines that stopped working since 4.4.2 release on PySpark 3.0 and 3.1 versions (adding 123 new pipelines were added) https://github.com/JohnSnowLabs/spark-nlp/pull/13805
* Fix pretrained pipelines that stopped working since 4.4.2 release on PySpark 3.2 and 3.3 versions (adding 120 new pipelines) https://github.com/JohnSnowLabs/spark-nlp/pull/13811
* Fix Java compatibility issue caused by SystemUtils dependency https://github.com/JohnSnowLabs/spark-nlp/pull/13806
========
4.4.2
========
----------------
New Features & Enhancements
----------------
* Implement a new Zero-Shot Text Classification for RoBERTa annotator called `RobertaForZeroShotClassification`
* Support Apache Spark 3.4
* Omptize BART models for memory efficiency
* Introducing `cache` feature in BartTransformer
* Improve error handling for max sequence length for transformers in Python
* Improve `MultiDateMatcher` annotator to return multiple dates
----------------
Bug Fixes
----------------
* Fix a bug in Tapas due to exceeding the maximum rank value
* Fix loading Transformer models via loadSavedModel() method from DBFS on Databricks
========
4.4.1
========
----------------
New Features & Enhancements
----------------
* Implement a new Zero-Shot Text Classification for DistilBERT annotator called `DistilBertForZeroShotClassification`
* Adding `threshold` param to `AlbertForSequenceClassification`, `BertForSequenceClassification`, `BertForZeroShotClassification`, `DistilBertForSequenceClassification`, `CamemBertForSequenceClassification`, `DeBertaForSequenceClassification`, LongformerForSequenceClassification`, RoBertaForQuestionAnswering`, `XlmRoBertaForSequenceClassification`, and `XlnetForSequenceClassification` annotators
* Add new notebooks to import models for `SwinForImageClassification` and `ConvNextForImageClassification` annotators for Image Classification
========
4.4.0
========
----------------
New Features
----------------
* Implement a new Zero-Shot Text Classification for BERT annotator called `BertForZeroShotClassification`
* Implement a new ConvNextForImageClassification annotator
* Introducing BART Transformer for text-to-text generation tasks like translation and summarization
* Set custom entity name in Data2Chunk via `setEntityName` param
* Add a new `nerHasNoSchema` param for NerConverter when labels coming from NerDLMOdel and NerCrfModel don't have any schema
----------------
Bug Fixes & Enhancements
----------------
* Fix loading `WordEmbeddingsModel` bug when loading a model from S3 via `cache_folder` config
* Fix `WordEmbeddingsModel` bug failing when it's used with `setEnableInMemoryStorage` set to `True` and LightPipeline
* Remove deprecated parameter enablePatternRegex from EntityRulerApproach & EntityRulerModel
* Deprecate Python 3.6
========
4.3.2
========
----------------
New Features & Enhancements
----------------
* Add S3 support for CoNLL(), POS(), CoNLLU() training classes https://github.com/JohnSnowLabs/spark-nlp/pull/13596
* Add support for non-schema NER (`I-` or `B-`) tags in NerConverter annotator https://github.com/JohnSnowLabs/spark-nlp/pull/13642
* Improve self-hosted examples with better documentation, Docker examples, no broken links, and more https://github.com/JohnSnowLabs/spark-nlp/pull/13575
* Improve error handling for validation evaluation in ClassifierDL and MultiClassifierDL trainable annotators https://github.com/JohnSnowLabs/spark-nlp/pull/13615
----------------
Bug Fixes
----------------
* Fix `Date2Chunk` and `Chunk2Doc` annotators compatibility with PipelineModel https://github.com/JohnSnowLabs/spark-nlp/pull/13609
* Fix `DependencyParserModel` predicting all Chunks as `<no-type>` https://github.com/JohnSnowLabs/spark-nlp/pull/13620
* Removed `calculationsCol` parameter from MultiDocumentAssembler in Python that doesn't actually exist https://github.com/JohnSnowLabs/spark-nlp/pull/13594
========
4.3.1
========
----------------
New Features
----------------
* Easily use external Tokenizers such as spaCy in Spark NLP pipeline
* Implement `params` parameter which can supply custom configurations to the SparkSession
----------------
Bug Fixes & Enhancements
----------------
* Add `entity` field to the metadata in Date2Chunk
* Fix ViT models & pipelines examples in Models Hub
========
4.3.0
========
----------------
New Features
----------------
* Implement HubertForCTC annotator for automatic speech recognition
* Implement SwinForImageClassification annotator for Image Classification
* Introducing CamemBERT for Question Answering annotator
* Implement ZeroShotNerModel annotator for zero-shot NER based on RoBERTa architecture
* Implement Date2Chunk annotator
* Enable params argument in spark_nlp start() function
* Allow doc_id reading CoNLL file datasets
----------------
Bug Fixes & Enhancements
----------------
* Relocating all notebooks back to examples directory
* Improve download/loading models & pipelines from AWS and GCP. When setting `cache_pretrained` directory to AWS and GCP will avoid copying existing models/pipelines
* Improve GitHub templates for Bug reports, documentation, and feature request
* Add documentation to ResourceDownloader
* Refactor `ml` package to allow another DL engine in future
* Apache Spark 3.3.1 is now the base version of Spark NLP
* Spark NLP supports M2 in addition to M1. Therefore, we are renaming `spark-nlp-m1` to `spark-nlp-silicon` on Maven
* Fix calculating delimiter id in CamemBERT
* Fix loadSavedModel for private buckets
========
4.2.8
========
----------------
Bug Fixes & Enhancements
----------------
* Fix the issue with optional keys (labels) in metadata when using XXXForSequenceClassitication annotators. This fixes `Some(neg) -> 0.13602075` as `neg -> 0.13602075` to be in harmony with all the other classifiers. https://github.com/JohnSnowLabs/spark-nlp/pull/13396
* Introducing a config to skip `LightPipeline` validation for `inputCols` on the Python side for projects depending on Spark NLP. This toggle should only be used for specific annotators that do not follow the convention of predefined `inputAnnotatorTypes` and `outputAnnotatorType`.
========
4.2.7
========
----------------
Bug Fixes & Enhancements
----------------
* Fix `outputAnnotatorType` issue in pipelines with `Finisher` annotator. This change adds `outputAnnotatorType` to `AnnotatorTransformer` to avoid loading `outputAnnotatorType` attribute when a stage in pipeline does not use it.
* Fix the wrong sentence index calculation in metadata by annotators in the pipeline when `setExplodeSentences` param was set to `true` in SentenceDetector annotator
* Fix the issue in `Tokenizer` when a custom pattern is used with `lookahead/-behinds` and it has `0 width` matches. This led to indexes not being calculated correctly
* Fix missing to output embeddings in `.fullAnnotate()` method when `parseEmbeddings` param was set to `True/true`
* Fix broken links to the Python API pages, as the generation of the PyDocs was slightly changed in a previous release. This makes the Python APIs accessible from the Annotators and Transformers pages like before
* Change default values of `explodeEntities` and `mergeEntities` parameters to `true`
* Better error handling when there are empty paths/relations in `GraphExtraction`annotator. New message will better guide the user on how to configure `GraphExtraction` to output meaningful relationships
* Removed the duplicated definition of method `setWeightedDistPath` from `ContextSpellCheckerApproach`
========
4.2.6
========
----------------
Enhancements
----------------
* Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation
----------------
Bug Fixes
----------------
* Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)
* Fix the broken Python API documentation
========
4.2.5
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **CamemBertForSequenceClassification** annotator in Spark NLP 🚀. `CamemBertForSequenceClassification` can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForSequenceClassification` for PyTorch or `TFCamembertForSequenceClassification` for TensorFlow in HuggingFace 🤗
* **NEW:** Add `AnnotatorType` validation in Spark NLP `LightPipeline`. Currently, a misconfiguration of `inputCols` in an annotator in a pipeline raises an exception when using `transform` method, but in `LightPipeline` it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in `LightPipeline` too.
* Add outputAnnotatorType for all annotators in Python
* Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from `AnnotatorApproach` and `AnnotatorModel`
* Adding AnnotatorType validation in `LightPipeline`
* Add validation for the number and type of columns set in `TFNerDLGraphBuilder` annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
* Add more details to Alphabet error message in `EntityRuler` annotator to better guide users
* Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
* Refactor and implement a better error handling in ResourceDownloader. This change removes `getObjectFromS3` allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
* Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
* UpdateUpgrade `sbt-assembly` to `1.2.0` that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
* Update `sbt` to `1.8.0` with improvements and bug fixes, but mostly for CVEs fixes:
* Updates to Coursier 2.1.0-RC1 to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Updates to Ivy 2.3.0-sbt-a8f9eb5bf09d0539ea3658a2c2d4e09755b5133e to address [https://github.com/advisories/GHSA-wv7w-rj2x-556x](https://github.com/advisories/GHSA-wv7w-rj2x-556x "https://github.com/advisories/GHSA-wv7w-rj2x-556x")
* Use the new withIncludeScala in assemblyOption instead of value
----------------
Bug Fixes
----------------
* Fix an issue with the `BigTextMatcher` Annotator, where it would not match entities with overlapping definitions. For Example, if both `lung` and `lung cancer` are defined, `lung` would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the `BigTextMatcher` during construction of the underlying data structure
* Fix indexing issue for `RegexTokenizer` annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
* Refactor the `Resolvers` object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new `sbt`
========
4.2.4
========
----------------
New Features & Enhancements
----------------
* Introduce support for GCP storage to be allowed as `cache_pretrained` directory for keeping all downloaded models and pipelines
* Update to TensorFlow 2.7.4 with bug and CVEs fixes
* Update documentation on how to use `testDataset` param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach
* Update installation instructions for Apple M1 chip
* Improve error handling while importing external TensorFlow models into Spark NLP
* Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS
* Add support for future decoder-encoder models (2 separate models)
----------------
Bug Fixes
----------------
* Add missing setPreservePosition in NerConverter
* Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators
* Fix all wrong example codes provided for LemmatizerModel in Models Hub
* Fix provided notebook to import Longformer models from HF: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/transformers/HuggingFace%20in%20Spark%20NLP%20-%20Longformer.ipynb
* Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+
========
4.2.3
========
----------------
New Features & Enhancements
----------------
* Implement a new control over number of accepted columns in Python. This will sync the behavior between Scala and Python where user sets more columns than allowed inside setInputCols
* Adding metadata sentence key parameter in order to select which metadata field to use as sentence for CoNLLGenerator annotator
* Include escaping in CoNLLGenerator annotator when writing to csv and preserve special char tokens
* Add documentation for new `IAnnotation` feature for Scala users
* Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
```python
regexMatcher = RegexMatcher() \
.setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
.setDelimiter(",") \
.setInputCols(["sentence"]) \
.setOutputCol("regex") \
.setStrategy("MATCH_ALL")
```
----------------
Bug Fixes
----------------
* Fix NotSerializableException when WordEmbeddings is used over K8s cluster while `setEnableInMemoryStorage` is set to `true`
* Fix a bug in RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
* Fix training modul failing on EMR due to a bad Apache Spark version detection. The following classes were fixed: `CoNLL()`, `CoNLLU()`, `POS()`, and `PubTator()`
* Fix a bug in CoNLLGenerator annotator where token has non-int metadata
* Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
* Fix `NaNs` result in some ViTForImageClassification models/pipelines
========
4.2.2
========
----------------
New Features & Enhancements
----------------
* Add support for importing TensorFlow SavedModel from remote storages like DBFS, S3, and HDFS
* Add support for `fullAnnotate` in `LightPipeline` for path of images in Scala
* Add `fullAnnotate` method in `PretrainedPipeline` for Scala
* Add `fullAnnotateJava` method in `PretrainedPipeline` for Java
* Add `fullAnnotateImage` to `PretrainedPipeline` for Scala
* Add `fullAnnotateImageJava` to `PretrainedPipeline` for Java
* Add support for QA in `fullAnnotate` method in `PretrainedPipeline`
* Add `Predicted Entities` to all Vision Transformers (ViT) models and pipelines
----------------
Bug Fixes
----------------
* Unify `annotatorType` name in Python and Scala for Spark schema in Annotation, AnnotationImage and AnnotationAudio
* Fix missing indexes in `RecursiveTokenizer` annotator
========
4.2.1
========
----------------
New Features & Enhancements
----------------
* Support for multi-lingual WordSegmenter. Add `enableRegexTokenizer` feature in WordSegmenter to support word segmentation within mixed and multi-lingual content https://github.com/JohnSnowLabs/spark-nlp/pull/12854
* Add support for Audio/ASR (Wav2Vec2) support to LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12895
* Add support for Double type in addition to Float type to AudioAssembler annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12904
* Improve error handling in fullAnnotateImage for LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12868
* Add SpanBertCoref annotator to all docs https://github.com/JohnSnowLabs/spark-nlp/pull/12889
----------------
Bug Fixes
----------------
* Fix feeding `fullAnnotate` in Lightpipeline with a list that started to fail in 4.2.0 release
* Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true https://github.com/JohnSnowLabs/spark-nlp/pull/12875
* Fix exception in Chunker annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12901
========
4.2.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **Wav2Vec2ForCTC** annotator in Spark NLP 🚀. `Wav2Vec2ForCTC` can load `Wav2Vec2` models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using `Wav2Vec2ForCTC` for **PyTorch** or `TFWav2Vec2ForCTC` for **TensorFlow** models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)
* **NEW:** Introducing **TapasForQuestionAnswering** annotator in Spark NLP 🚀. `TapasForQuestionAnswering` can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using `TapasForQuestionAnswering` for **PyTorch** or `TFTapasForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **CamemBertForTokenClassification** annotator in Spark NLP 🚀. `CamemBertForTokenClassification` can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `CamembertForTokenClassification` for PyTorch or `TFCamembertForTokenClassification` for TensorFlow in HuggingFace 🤗
(https://github.com/JohnSnowLabs/spark-nlp/pull/12752)
* Implementing `setTestDataset` to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: `ClassifierDLApproach`, `SentimentDLApproach`, and `MultiClassifierDLApproach` (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)
* Refactoring and improving `EntityRuler` annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using `EntityRuler` https://github.com/JohnSnowLabs/spark-nlp/pull/12634
* Add support for S3 storage in the `cache_folder` where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)
* Implementing `lookaround` functionalities in `DocumentNormalizer` annotator. Currently, `DocumentNormalizer` has both `lookahead` and `lookbehind` functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the `lookaround` feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)
* Implementing `setReplaceEntities` param to `NerOverwriter` annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)
----------------
Bug Fixes
----------------
* Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the `TFGraphBuilder` annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by `TFGraphBuilder` won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)
* Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)
* Add support for a list of questions and context in LightPipeline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to `fullAnnotate` and `annotate` to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)
* Fix division by zero exception in the `GPT2Transformer` annotator when the `setDoSample` param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)
========
4.1.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **ViTForImageClassification** annotator in Spark NLP 🚀. `ViTForImageClassification` can load Vision Transformer `ViT` Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by using `ViTForImageClassification` for **PyTorch** or `TFViTForImageClassification` for **TensorFlow** models in HuggingFace 🤗
* Provide support for AWS Graviton processors and ARM64 processors with architecture greater than ARMv8
* Introducing **TFNerDLGraphBuilder** annotator. `TFNerDLGraphBuilder` can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets.
* Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. From this release it is possible to access the confidence scores coming from the following annotators via NerConverter: AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification
* Introducing PushToHub Python class to easily push public models/pipelines to Models Hub
* Introducing fullAnnotateImage to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline.
========
4.0.2
========
----------------
New Features
----------------
* SentenceDetector now comes with a new parameter `customBoundsStrategy` for returning custom bounds https://github.com/JohnSnowLabs/spark-nlp/pull/10567
----------------
Bug Fixes
----------------
* Fix bug that attempts to create spark session on executors when using GraphExtraction https://github.com/JohnSnowLabs/spark-nlp/pull/9905
========
4.0.1
========
----------------
New Features
----------------
* Full support for Apache Spark & PySpark 3.3.0
* Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
* New `-g` option for Google Colab and Kaggle setup on GPU device to upgrade `libcudnn8` to 8.1.0 to solve the issue on GPU
* Support for Databricks Runtime 11.0
----------------
Bug Fixes
----------------
* Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
* Fix and re-upload Dependency and Type Dependency parser pre-trained models
* Update pre-trained pipelines with issues on PySpark 3.2 and 3.3
========
4.0.0
========
----------------
New Features & Enhancements
----------------
* **NEW:** Introducing **AlbertForQuestionAnswering** annotator in Spark NLP 🚀. `AlbertForQuestionAnswering` can load `ALBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `AlbertForQuestionAnswering` for **PyTorch** or `TFAlbertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **BertForQuestionAnswering** annotator in Spark NLP 🚀. `BertForQuestionAnswering` can load `BERT` & `ELECTRA` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `BertForQuestionAnswering` and `ElectraForQuestionAnswering` for **PyTorch** or `TFBertForQuestionAnswering` and `TFElectraForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DeBertaForQuestionAnswering** annotator in Spark NLP 🚀. `DeBertaForQuestionAnswering` can load `DeBERTa` v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForQuestionAnswering` for **PyTorch** or `TFDebertaV2ForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **DistilBertForQuestionAnswering** annotator in Spark NLP 🚀. `DistilBertForQuestionAnswering` can load `DistilBERT` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `DistilBertForQuestionAnswering` for **PyTorch** or `TFDistilBertForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForQuestionAnswering** annotator in Spark NLP 🚀. `LongformerForQuestionAnswering` can load `Longformer` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `LongformerForQuestionAnswering` for **PyTorch** or `TFLongformerForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **RoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `RoBertaForQuestionAnswering` can load `RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `RobertaForQuestionAnswering` for **PyTorch** or `TFRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForQuestionAnswering** annotator in Spark NLP 🚀. `XlmRoBertaForQuestionAnswering` can load `XLM-RoBERTa` Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForQuestionAnswering` for **PyTorch** or `TFXLMRobertaForQuestionAnswering` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **MultiDocumentAssembler** annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
* Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations result in performance improvements from +50% to +700% (more details in Benchmarks section)
* **NEW:** Introducing **SpanBertCorefModel** annotator for Coreference Resolution on BERT and SpanBERT models based on [BERT for Coreference Resolution: Baselines and Analysis](https://arxiv.org/abs/1908.09091) paper. An implementation of a SpanBert based coreference resolution model.
* Support for 2 inputs in LightPipeline for with MultiDocumentAssembler
* Migrate T5Transformer to TensorFlow v2 architecture with re-uploading all the existing models
* Official support for Apple silicon M1 on macOS devices. From Spark NLP 4.0.0 you can use `spark-nlp-m1` package that supports Apple silicon M1 on your macOS machine
* Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
* Unifying all supported Apache Spark packages on Maven into `spark-nlp` for CPU, `spark-nlp-gpu` for GPU, and `spark-nlp-m1` for new Apple silicon M1 on macOS. The need for Apache Spark specific package like `spark-nlp-spark32` has been removed.
* Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (`m1=True`)
* Update Colab, Kaggle, and SageMaker scripts
* Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
* Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
* Allow change of case sensitivity. Currently, user cannot set setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
* Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
* Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)
* Refactor the entire Python module in Spark NLP to make the development and maintenance easier
* Refactor unit tests in Python and migrate to pytest
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.4 LTS
* Databricks 10.4 LTS ML
* Databricks 10.4 LTS ML GPU
* Databricks 10.5
* Databricks 10.5 ML
* Databricks 10.5 ML GPU
* Welcoming a new EMR 6.x series to our Spark NLP family:
* EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
* Upgrade TensorFlow to 2.7.1 and start supporing Apple silicon M1
* Upgrade RocksDB with new enhancements and support for Apple silicon M1
* Upgrade SentencePiece tokenizer TF ops to 2.7.1
* Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
* Upgrade to Scala 2.12.15
----------------
Bug Fixes
----------------
* Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
* Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
* Fix WordSegmenterModel outputting wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
* Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
* Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
* Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
* Remove non-existing parameters from DocumentAssembler in Python
----------------
Backward Compatibility
----------------
* Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 https://github.com/JohnSnowLabs/spark-nlp/pull/8319
* The start() functions in Python and Scala will no longer have `spark23`, `spark24`, and `spark32` parameters. The default `sparknlp.start()` works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need of any Spark related flags
* Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
* Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build
========
3.4.4
========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForTokenClassification** annotator in Spark NLP 🚀. `DeBertaForTokenClassification` can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2ForTokenClassification` for **PyTorch** or `TFDebertaV2ForTokenClassification` for **TensorFlow** models in HuggingFace
* **NEW:** Introducing **CamemBertEmbeddings** annotator in Spark NLP 🚀
* Add support for BatchAnnotate to UniversalSentenceEncoder
----------------
Bug Fixes & Enhancements
----------------
* Optimizing Tokenizer performance up to 400% when there is exceptions list
* Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts
* Removing trove4j dependency
* Fix bug that caused get input/output/LazyAnnotator to return None
* Fix DeBertaForSequenceClassification in Python failing to load pretrained models
========
3.4.3
========
----------------
New Features
----------------
* **NEW:** Introducing **DeBertaForSequenceClassification** annotator in Spark NLP 🚀. `DeBertaForSequenceClassification` can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `DebertaForSequenceClassification` for **PyTorch** or `TFDebertaForSequenceClassification` for **TensorFlow** models in HuggingFace
* New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification
* New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL
* New impossiblePenultimates feature in SentenceDetectorDLModel
* New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol
* New formCol and lemmaCol parameters in Lemmatizer annotator
* Add new functionality to download and extract models from S3 via direct link
----------------
Bug Fixes & Enhancements
----------------
* Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
* Update SentenceDetector documentation
* Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation
========
3.4.2
========
----------------
New Features
----------------
* Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
This annotator is compatible with all the models trained/fine-tuned by using `DebertaV2Model` for **PyTorch** or `TFDebertaV2Model` for **TensorFlow** models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
* Introducing a new param enableCaching in Doc2VecApproach and Word2VecApproach which if enabled speeds up the training
* Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
* Support EMR emr-5.34.0 and emr-6.5.0
----------------
Bug Fixes
----------------
* Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978
========
3.4.1
========
----------------
New Features & Enhancements
----------------
* Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773
* Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749
* Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806
* Add a new `setSentenceMatchAdd` param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841
* Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822
----------------
Bug Fixes
----------------
* Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741
* Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748
* Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750
* Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799
* Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845
* Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849
* Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867
* Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868
* Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1. (this option is now available to choose which metric to be tracked)
* Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767
========
3.4.0
========
----------------
Major features and improvements
----------------
* **NEW:** Introducing **GPT2Transformer** annotator in Spark NLP 🚀. OpenAI GPT2 - huggingface `TFGPT2LMHeadModel`
* **NEW:** Introducing **RoBertaForSequenceClassification** annotator in Spark NLP 🚀. `RoBertaForSequenceClassification` can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `RobertaForSequenceClassification` for **PyTorch** or `TFRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlmRoBertaForSequenceClassification** annotator in Spark NLP 🚀. `XlmRoBertaForSequenceClassification` can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLMRobertaForSequenceClassification` for **PyTorch** or `TFXLMRobertaForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **LongformerForSequenceClassification** annotator in Spark NLP 🚀. `LongformerForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `LongformerForSequenceClassification` for **PyTorch** or `TFLongformerForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **AlbertForSequenceClassification** annotator in Spark NLP 🚀. `AlbertForSequenceClassification` can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `AlbertForSequenceClassification` for **PyTorch** or `TFAlbertForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing **XlnetForSequenceClassification** annotator in Spark NLP 🚀. `XlnetForSequenceClassification` can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using `XLNetForSequenceClassification` for **PyTorch** or `TFXLNetForSequenceClassification` for **TensorFlow** models in HuggingFace 🤗
* **NEW:** Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML
* Support for Apache Spark and PySpark 3.2.x on Scala 2.12
* Introducing `useBestModel` param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
* Welcoming 6x new Databricks runtimes to our Spark NLP family:
* Databricks 10.0
* Databricks 10.0 ML GPU
* Databricks 10.1
* Databricks 10.1 ML GPU
* Databricks 10.2
* Databricks 10.2 ML GPU
* Welcoming 3x new EMR 6.x series to our Spark NLP family:
* EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
* EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
* EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)
* Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (`spark32=True`)
* Add new scripts/notebook to generate custom TensroFlow graphs for `ContextSpellCheckerApproach` annotator
* Add a new `graphFolder` param to `ContextSpellCheckerApproach` annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
* Support DBFS file system in `graphFolder` param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
* Add new feature to all classifiers (`ForTokenClassification` and `ForSequenceClassification`) to retrieve classes from the pretrained models
* Add `inputFormats` param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
* Enable batch processing in T5Transformer and MarianTransformer annotators
* Add Schema to `readDataset` in CoNLL() class