Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Ignoring Non-Textual files in Directory-level scans #30

Open
atharv-phadnis opened this issue Jan 13, 2023 · 2 comments
Open

Ignoring Non-Textual files in Directory-level scans #30

atharv-phadnis opened this issue Jan 13, 2023 · 2 comments
Assignees

Comments

@atharv-phadnis
Copy link

Hello,

We were trying to use the tool for directory-level scans (using --dir) over a bunch of cloned repositories. For instance, we tried scanning gitea, it results into following:

$ license-scanner --dir gitea/
Error: failed to normalize data: invalid input text with control characters

We had a similar observation on a few more directories containing some non-textual files such as UI assets, binaries, etc.

Will it be possible to get a Warning for such file occurrences, and those files being ignored, and the scanner continuing to scan the remaining files? Or perhaps a command-line argument to set such a behavior by the tool?

@markstur
Copy link
Collaborator

markstur commented Jan 13, 2023

I had a workaround for this. There is a bit more to it that I need to untangle (probably not specific to this issue), but basically here (below) is where the error can be changed to log-and-continue.

I'll assign this to me. There is some a pending PR and some repo moving again that might delay this though.

===================================================================

diff --git a/normalizer/normalizer.go b/normalizer/normalizer.go
--- a/normalizer/normalizer.go	
+++ b/normalizer/normalizer.go	
@@ -151,7 +151,13 @@
 	// Check if the text contains control characters indicative of binary or non-text files.
 	// match against /[\u0000-\u0007\u000E-\u001B]/
 	if ControlCharactersRE.MatchString(n.OriginalText) {
-		return fmt.Errorf("failed to normalize data: invalid input text with control characters")
+		if n.IsTemplate {
+			return fmt.Errorf("failed to normalize data: invalid input text with control characters")
+		} else {
+			Logger.Errorf("failed to normalize data: invalid input text with control characters")
+			n.NormalizedText = ""
+			return nil // continue
+		}
 	}

@markstur markstur self-assigned this Jan 13, 2023
@atharv-phadnis
Copy link
Author

Hey @markstur, thanks for the prompt reply.

Tested your workaround, seemed to be sorting the issue for now. Also ran across another issue with similar outcome:
Error: file too large (4986500 > 1000000)

I tried changes similar to what you suggested for the earlier issue, like so:

diff --git a/identifier/identifier.go b/identifier/identifier.go
index 4750fa7..7bb47bd 100644
--- a/identifier/identifier.go
+++ b/identifier/identifier.go
@@ -109,7 +109,8 @@ func IdentifyLicensesInFile(filePath string, options Options, licenseLibrary *li
                return IdentifierResults{}, err
        }
        if fi.Size() > 1000000 {
-               return IdentifierResults{}, fmt.Errorf("file too large (%v > 1000000)", fi.Size())
+               Logger.Errorf("file too large (%v > 1000000)", fi.Size())
+               return IdentifierResults{}, nil
        }
 
        b, err := ioutil.ReadFile(filePath)

Could you confirm if this is the right way of handling the problem, or should it have been something else? And also if it is possible to incorporate this change as well?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants