[#228] Implemented processing of xls files without transformation to csv #285

MSattrtand · 2024-12-12T11:53:21Z

Resolves #228.

.xls files are now being processed separately, without conversion to .csv. This is preliminary code, I'll work on reducing the amount of duplicate code using adapters and refractor it.

MSattrtand · 2024-12-15T20:46:22Z

XLS and CSV files are now processed using adapters, will fix the HTML processing soon

MSattrtand · 2024-12-15T23:53:38Z

Processing HTML tables works again, though it still has its inelegant solution with a conversion to CSV.

MSattrtand · 2024-12-18T00:09:49Z

HTML tables are now being processed directly, without conversion. However, I'm not sure that won't introduce more bugs because HTML doesn't store info of all the cells in every row, so I have to use a workaround to remember if we should have merged cells in every place.

blcham

See my suggestions.

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/FileReaderAdapter.java

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

blcham · 2024-12-18T12:02:47Z

...s-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/XLSFileReaderAdapter.java

+    }
+
+    @Override
+    public String[] getHeader() throws IOException {


should support skipHeader = true

...s-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/XLSFileReaderAdapter.java

blcham · 2024-12-18T12:18:12Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

-                logMissingQuoteError();
-                return getExecutionContext(inputModel, outputModel);
-            }
+            fileReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()), sourceResourceFormat, processTableAtIndex);


Before it was implemented like:

ICsvListReader listReader = getCsvListReader(csvPreference); if (listReader == null) { logMissingQuoteError(); return getExecutionContext(inputModel, outputModel); }

but there is no reason to have

if (listReader == null) { logMissingQuoteError(); return getExecutionContext(inputModel, outputModel); }

outside of getCsvListReader() method.

I did not find it anywhere

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

blcham

see my comments

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

...es-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/StreamReaderAdapter.java

blcham · 2025-01-07T12:17:35Z

...es-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/StreamReaderAdapter.java

+import java.util.List;
+
+public interface StreamReaderAdapter {
+    void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex, boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException;


not sure if it makes sense to have acceptInvalidQuoting for other adapters beside CSV ?

blcham · 2025-01-07T12:18:41Z

...es-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/StreamReaderAdapter.java

+public interface StreamReaderAdapter {
+    void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex, boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException;
+    String[] getHeader(Boolean skipHeader) throws IOException;
+    boolean hasNextRow() throws IOException;


why thowing exception?

blcham · 2025-01-07T12:19:32Z

...es-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/StreamReaderAdapter.java

+    boolean hasNextRow() throws IOException;
+    List<String> getNextRow() throws IOException;
+    List<Region> getMergedRegions();
+    String getSheetLabel() throws IOException;


should it throw exeption?

blcham · 2025-01-07T12:21:18Z

...modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/CSVStreamReaderAdapter.java

+    @Override
+    public void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex,
+                           boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException {
+        //listReader = new CsvListReader(new InputStreamReader(inputStream), csvPreference);


blcham · 2025-01-07T12:21:55Z

...modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/XLSStreamReaderAdapter.java

+        return sheet.getSheetName();
+    }
+
+    public String fixNumberFormat (String cellValue){


blcham · 2025-01-07T12:45:28Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

-                logMissingQuoteError();
-                return getExecutionContext(inputModel, outputModel);
-            }
+            streamReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()),


does not make sense to have acceptInvalidQuoting here.

I suggest to remove parts that are not common for all adapters in initialize method.

blcham · 2025-01-07T12:52:22Z

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java

-                logMissingQuoteError();
-                return getExecutionContext(inputModel, outputModel);
-            }
+            fileReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()), sourceResourceFormat, processTableAtIndex);


I did not find it anywhere

blcham · 2025-01-07T12:53:25Z

...modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/CSVStreamReaderAdapter.java

+        this.acceptInvalidQuoting = acceptInvalidQuoting;
+        this.inputCharset = inputCharset;
+        this.sourceResource = sourceResource;
+        listReader = getCsvListReader(csvPreference);


I do not see logMissingQuoteError();

blcham · 2025-01-07T12:58:04Z

...modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/CSVStreamReaderAdapter.java

+    }
+
+    @Override
+    public String[] getHeader(Boolean skipHeader) throws IOException {


Suggested change

public String[] getHeader(Boolean skipHeader) throws IOException {

public String[] getHeader(boolean skipHeader) throws IOException {

skipHeader == null should be handled on different level -- i.e. default value of the field skipHeader.

blcham · 2025-01-07T13:04:00Z

...modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/CSVStreamReaderAdapter.java

+
+    @Override
+    public boolean hasNextRow() throws IOException {
+        return ((listReader.read() != null) || (firstRow != null));


This cannot work properly because calling hasNextRow mutlipe times will result in reading different parts of the file.

??Maybe?? the idea should be that we will read oneline upfront.

blcham · 2025-01-07T17:31:17Z

@MSattrtand please describe this in documentation of the module:

SH+ = skipHeader ano 
SH-  = skipHeader ne 
S+ = schema ano
S- = schema ne



- S+SH+
    - not calling getHeader()
    - assume that data looks like in schema    
- S+SH-
    - calling getHeader()
    - adapt schema to match header of the file
        - if ordering is not specified use ordering or the header
        - reuse column IRIs from Schema
- S-SH+
    - not calling getHeader()
    - create column names column_1, column_2, etc.
- S-SH-
    - calling getHeader()
    - create schema entirely based on the header

blcham requested changes Dec 18, 2024

View reviewed changes

MSattrtand force-pushed the 228-tabular-data-processing branch from 1f22c86 to 940f458 Compare December 22, 2024 00:56

MSattrtand closed this Dec 22, 2024

MSattrtand force-pushed the 228-tabular-data-processing branch from 940f458 to d53c74c Compare December 22, 2024 00:59

MSattrtand reopened this Dec 22, 2024

blcham force-pushed the 228-tabular-data-processing branch 2 times, most recently from a4077a9 to d2c8e84 Compare January 6, 2025 17:42

blcham self-requested a review January 6, 2025 22:16

blcham requested changes Jan 6, 2025

View reviewed changes

s-pipes-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/TabularModule.java Outdated Show resolved Hide resolved

...es-modules/module-tabular/src/main/java/cz/cvut/spipes/modules/util/StreamReaderAdapter.java Outdated Show resolved Hide resolved

MSattrtand force-pushed the 228-tabular-data-processing branch from d2c8e84 to 2c14733 Compare January 7, 2025 09:46

MSattrtand requested a review from blcham January 7, 2025 10:23

blcham added 3 commits January 7, 2025 17:44

Reformat

1c84956

[#228] Refactor using CSVStreamReaderAdapter

a1b4b52

[#228] Tabular Module now uses adapters

03a9272

blcham force-pushed the 228-tabular-data-processing branch from aba244e to 03a9272 Compare January 7, 2025 17:24

blcham reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#228] Implemented processing of xls files without transformation to csv #285

[#228] Implemented processing of xls files without transformation to csv #285

MSattrtand commented Dec 12, 2024 •

edited by blcham

Loading

MSattrtand commented Dec 15, 2024

MSattrtand commented Dec 15, 2024

MSattrtand commented Dec 18, 2024

blcham left a comment

blcham Dec 18, 2024

MSattrtand Jan 7, 2025

blcham Dec 18, 2024

MSattrtand Jan 7, 2025

blcham Jan 7, 2025

blcham left a comment

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham Jan 7, 2025

blcham commented Jan 7, 2025 •

edited

Loading

	public String[] getHeader(Boolean skipHeader) throws IOException {
	public String[] getHeader(boolean skipHeader) throws IOException {

[#228] Implemented processing of xls files without transformation to csv #285

Are you sure you want to change the base?

[#228] Implemented processing of xls files without transformation to csv #285

Conversation

MSattrtand commented Dec 12, 2024 • edited by blcham Loading

MSattrtand commented Dec 15, 2024

MSattrtand commented Dec 15, 2024

MSattrtand commented Dec 18, 2024

blcham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blcham left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blcham commented Jan 7, 2025 • edited Loading

MSattrtand commented Dec 12, 2024 •

edited by blcham

Loading

blcham commented Jan 7, 2025 •

edited

Loading