Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming strings for advanced datasets #39

Open
ghost opened this issue Sep 13, 2020 · 1 comment
Open

Trimming strings for advanced datasets #39

ghost opened this issue Sep 13, 2020 · 1 comment

Comments

@ghost
Copy link

ghost commented Sep 13, 2020

After some digging around I found a way to trim the unformatted strings (containing '\r', '\v', '\f', '\n', '\t', ' ') this library returns when parsing HTML files. For example a file with multiple spaces etc can be very annoying when you for example try to train a ML algortigh that gets data from libcurl. So this function 'reduce' will tranform the string:

You can modify the text in the box to the left any way you like, and                               ss
        then click the "Show Page" button below the box to display the
        result here. Go ahead and do this as often and as long as you like.

To something like this:

You can modify the text in the box to the left any way you like, and ss then click the "Show Page" button below the box to display the result here. Go ahead and do this as often and as long as you like.

The code:

std::string trim(
    const std::string& str,
    const std::string& whitespace = " \t \n \r \v \f"
){
    const auto strBegin = str.find_first_not_of(whitespace);
    if (strBegin == std::string::npos)
        return ""; // no content

    const auto strEnd = str.find_last_not_of(whitespace);
    const auto strRange = strEnd - strBegin + 1;

    return str.substr(strBegin, strRange);
}

std::string reduce(
    const std::string& str,
    const std::string& fill = " ",
    const std::string& whitespace = " \t \n \r \v \f")
{
    // trim first
    auto result = trim(str, whitespace);

    // replace sub ranges
    auto beginSpace = result.find_first_of(whitespace);
    while (beginSpace != std::string::npos)
    {
        const auto endSpace = result.find_first_not_of(whitespace, beginSpace);
        const auto range = endSpace - beginSpace;

        result.replace(beginSpace, range, fill);

        const auto newStart = beginSpace + fill.length();
        beginSpace = result.find_first_of(whitespace, newStart);
    }

    return result;
}

I go this from a reddit post, but it did not have an author.

@ghost
Copy link
Author

ghost commented Sep 13, 2020

This was the test HTML file:

<html>
<head>
<title>Something</title>
<style type="text/css">

</style>
</head>
<body bgcolor = "#ffffcc" text = "#000000">

<div id="ly-title">
    <h1>Hello, World!</h1>
</div>

<div id="ly-body">

    <p> 
        You can modify the text in the box to the left any way you like, and                               ss
        then click the "Show Page" button below the box to display the
        result here. Go ahead and do this as often and as long as you like.
    </p>
        
    <p> 
        You can also use this page to test your Javascript functions and local
        style declarations. Everything you do will be handled entirely by your own
        browser; nothing you type into the text box will be sent back to the
        server.
    </p>

</div>    

</body>
</html>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants