Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#228] Implemented processing of xls files without transformation to csv #285

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MSattrtand
Copy link
Collaborator

@MSattrtand MSattrtand commented Dec 12, 2024

Resolves #228.

.xls files are now being processed separately, without conversion to .csv. This is preliminary code, I'll work on reducing the amount of duplicate code using adapters and refractor it.

@MSattrtand
Copy link
Collaborator Author

XLS and CSV files are now processed using adapters, will fix the HTML processing soon

@MSattrtand
Copy link
Collaborator Author

Processing HTML tables works again, though it still has its inelegant solution with a conversion to CSV.

@MSattrtand
Copy link
Collaborator Author

HTML tables are now being processed directly, without conversion. However, I'm not sure that won't introduce more bugs because HTML doesn't store info of all the cells in every row, so I have to use a workaround to remember if we should have merged cells in every place.

Copy link
Contributor

@blcham blcham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my suggestions.

}

@Override
public String[] getHeader() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should support skipHeader = true

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

logMissingQuoteError();
return getExecutionContext(inputModel, outputModel);
}
fileReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()), sourceResourceFormat, processTableAtIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before it was implemented like:

ICsvListReader listReader = getCsvListReader(csvPreference);

            if (listReader == null) {
                logMissingQuoteError();
                return getExecutionContext(inputModel, outputModel);
            }

but there is no reason to have

            if (listReader == null) {
                logMissingQuoteError();
                return getExecutionContext(inputModel, outputModel);
            }

outside of getCsvListReader() method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it anywhere

@MSattrtand MSattrtand force-pushed the 228-tabular-data-processing branch from 1f22c86 to 940f458 Compare December 22, 2024 00:56
@MSattrtand MSattrtand closed this Dec 22, 2024
@MSattrtand MSattrtand force-pushed the 228-tabular-data-processing branch from 940f458 to d53c74c Compare December 22, 2024 00:59
@MSattrtand MSattrtand reopened this Dec 22, 2024
@blcham blcham force-pushed the 228-tabular-data-processing branch 2 times, most recently from a4077a9 to d2c8e84 Compare January 6, 2025 17:42
@blcham blcham self-requested a review January 6, 2025 22:16
Copy link
Contributor

@blcham blcham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

@MSattrtand MSattrtand force-pushed the 228-tabular-data-processing branch from d2c8e84 to 2c14733 Compare January 7, 2025 09:46
@MSattrtand MSattrtand requested a review from blcham January 7, 2025 10:23
@blcham blcham force-pushed the 228-tabular-data-processing branch from aba244e to 03a9272 Compare January 7, 2025 17:24
import java.util.List;

public interface StreamReaderAdapter {
void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex, boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it makes sense to have acceptInvalidQuoting for other adapters beside CSV ?

public interface StreamReaderAdapter {
void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex, boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException;
String[] getHeader(Boolean skipHeader) throws IOException;
boolean hasNextRow() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why thowing exception?

boolean hasNextRow() throws IOException;
List<String> getNextRow() throws IOException;
List<Region> getMergedRegions();
String getSheetLabel() throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it throw exeption?

@Override
public void initialise(InputStream inputStream, ResourceFormat sourceResourceFormat, int tableIndex,
boolean acceptInvalidQuoting, Charset inputCharset, StreamResource sourceResource) throws IOException {
//listReader = new CsvListReader(new InputStreamReader(inputStream), csvPreference);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

return sheet.getSheetName();
}

public String fixNumberFormat (String cellValue){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

logMissingQuoteError();
return getExecutionContext(inputModel, outputModel);
}
streamReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not make sense to have acceptInvalidQuoting here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to remove parts that are not common for all adapters in initialize method.

logMissingQuoteError();
return getExecutionContext(inputModel, outputModel);
}
fileReaderAdapter.initialise(new ByteArrayInputStream(sourceResource.getContent()), sourceResourceFormat, processTableAtIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it anywhere

this.acceptInvalidQuoting = acceptInvalidQuoting;
this.inputCharset = inputCharset;
this.sourceResource = sourceResource;
listReader = getCsvListReader(csvPreference);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see logMissingQuoteError();

}

@Override
public String[] getHeader(Boolean skipHeader) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public String[] getHeader(Boolean skipHeader) throws IOException {
public String[] getHeader(boolean skipHeader) throws IOException {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipHeader == null should be handled on different level -- i.e. default value of the field skipHeader.


@Override
public boolean hasNextRow() throws IOException {
return ((listReader.read() != null) || (firstRow != null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot work properly because calling hasNextRow mutlipe times will result in reading different parts of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??Maybe?? the idea should be that we will read oneline upfront.

@blcham
Copy link
Contributor

blcham commented Jan 7, 2025

@MSattrtand please describe this in documentation of the module:

SH+ = skipHeader ano 
SH-  = skipHeader ne 
S+ = schema ano
S- = schema ne



- S+SH+
    - not calling getHeader()
    - assume that data looks like in schema    
- S+SH-
    - calling getHeader()
    - adapt schema to match header of the file
        - if ordering is not specified use ordering or the header
        - reuse column IRIs from Schema
- S-SH+
    - not calling getHeader()
    - create column names column_1, column_2, etc.
- S-SH-
    - calling getHeader()
    - create schema entirely based on the header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Excel/html files should be processed directly in tabular module
2 participants