Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PgEntity convert null value into empty string #75

Open
tlode opened this issue Feb 15, 2017 · 7 comments
Open

PgEntity convert null value into empty string #75

tlode opened this issue Feb 15, 2017 · 7 comments

Comments

@tlode
Copy link
Contributor

tlode commented Feb 15, 2017

In some cases, when using a composite field type and PgEntity converter, a nullable field will be converted to an empty string. This is the case when the nullable field is a string type like varchar, text, uuid, ...

The problem is that PgEntity uses str_getcsv to split the composite row value, essential like this

$data = '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",';    
$values = str_getcsv($data);

foreach ($values as $val) {
  var_dump($val) . PHP_EOL;
}

Unfortunately this will return an empty string, where it should not be.

string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
string(4) "test"
string(0) ""
string(0) ""

Depending on the type declared in the RowStructure definition, an empty string may be interpreted as null value within the hydration plan (e.g. in case of boolean, numeric, int, ...), but it is not possible to determine the correct value for string types just from the empty string.

As PgComposite also uses str_getcsv I would assume the same behaviour for that converter. But I've not tested that.

Any idea how to avoid str_getcsv?

@chanmix51 chanmix51 added this to the 2.1 milestone Feb 15, 2017
@chanmix51 chanmix51 added the bug label Feb 15, 2017
@chanmix51
Copy link
Member

Unfortunately, the actual implementation seems to be a best effort / most efficient solution. Here is the question I left on stack-overflow.

@tlode
Copy link
Contributor Author

tlode commented Feb 20, 2017

I've also experimented with the approach of using a pcre as a specific composite literal parser. Best I've got so far is the one I've found here https://regex101.com/library/oU3yV6

The idea was to keep the quotes in the result fields to distinguish between the empty string or a null value.

With the following modified PgEntity::transformData method, all tests pass, except the one with the escaped hstore value (34,,"",,"{4,3}","""pika"" => ""\\\\\\"chu, rechu""")

private function transformData($data, Projection $projection)
    {
        preg_match_all('/"(?:[^"]|"")*"|[^,\n]+|(?=,)(?<=,)|^|(?<=,)$/', $data, $values);

        $definition     = $projection->getFieldNames();
        $out_values     = [];
        $values_count   = count($values[0]);

        for ($index = 0; $index < $values_count; $index++) {
            $value = '' !== $values[0][$index]
                ? strtr(trim($values[0][$index], '"'), ['""' => '"'])
                : null;

            $out_values[$definition[$index]] = preg_match(':^{.*}$:', $value)
                ? stripcslashes($value)
                : $value;
        }

        return $out_values;
    }

@chanmix51
Copy link
Member

chanmix51 commented Feb 21, 2017

This reminds me the dark age of Pomm 1.x with the performance disaster associated with PHP’s preg_match (which simply segfault when the field gets over 8kb).

@tlode
Copy link
Contributor Author

tlode commented Feb 21, 2017

I agree. This was more of an experiment I wanted to share. So what could be a solution then?

It is obvious to me that str_getcsv isn't the right solution either, because we don't deal with csv in this context. It's close though. A composite literal must have a well defined format. Using a better, performant regular expression to find the outer most field boundaries, might not be completely off track.

@chanmix51
Copy link
Member

The idea raised by Daniel Vérité is to create a state machine for that case. It’s imho a better idea than using a regular expression (which I did in the previous version of Pomm).

@chanmix51
Copy link
Member

Ok, I’ll sum up this thread of Stack-Overflow here about this bug.

There is a mind gap between PHP and Postgres community. To handle polymorphism properly, Pg does type NULL (no defined value) so NULL::text or NULL::int4 do mean no defined value for this type. For PHP, NULL is not of a type because it is undefined. As the CSV files return strings, theses pieces of string cannot be null because they are typed hence ("a","",,"b") means ['a', '', '', 'b'].

This behavior complies with the RFC4180. Even though this RFC is only informational and has been drew to make the mime-type text/csv compatible with Ms/Excel, it is the model for how PHP parses CSV lines.

There is (almost) no hope to patch the code of the underlying C function.

Choices are now:

  • either to code a state machine to parse CSV in PHP with the performance issues and bugs one can expect
  • or to let this bug as is and document it
  • or to remove this functionality with its bug from Pomm

Any thoughts ?

@malenkiki
Copy link

malenkiki commented Jul 4, 2019

I try a short code to have that without regex.

Not perfect, but it is a starting point.

Examples at the end.

Not tested yet onPgEntity::transformData, I wrote just a proof of concept.

class Runner
{
    const CHAR_SEP = ',';
    const CHAR_DBQ = '"';

    protected $inCell      = false;
    protected $cellDbq     = false;
    protected $endCellDbq  = false;
    protected $cell        = null;
    protected $lastIdx     = 0;


    public function __construct(int $lastIdx)
    {
        $this->lastIdx = $lastIdx;
    }

    protected function reinit() {
        $this->cellDbq     = false;
        $this->inCell      = false;
        $this->endCellDbq  = false;
    }

    protected function isLastChar(int $idx): bool
    {
        return $idx === $this->lastIdx;
    }

    public function look(int $idx, string $char)
    {

        if ($this->inCell) {
            // Previous state IN CELL
            if ($this->cellDbq) {
                if (self::CHAR_DBQ === $char) {
                    $this->endCellDbq = true;
                } elseif ($this->endCellDbq && self::CHAR_SEP === $char) {
                    $this->reinit();
                } else {
                    $this->cell .= $char;
                }
            } else {
                if (self::CHAR_SEP !== $char) {
                    $this->cell .= $char;
                }
    
                if (self::CHAR_SEP === $char) {
                    $this->reinit();
                }
            }
        } else {
            // Previous state NOT IN CELL
            if (self::CHAR_SEP === $char) {
                $this->reinit();
            } else {
                $this->inCell = true;
                if (self::CHAR_DBQ === $char) {
                    $this->cellDbq = true;
                    $this->cell = '';
                } else {
                    $this->cellDbq = false;
                    $this->cell .= $char;
                }
            }
        }
    }

    public function canRelease(): bool
    {
        return !$this->inCell && !$this->cellDbq && !$this->endCellDbq;
    }

    public function getCell()
    {
        $out = $this->cell;
        $this->cell = null;

        return $out;
    }
}




class CompositeValueExtractor
{
    protected $encoding;

    public function __construct(string $forceEncoding = '')
    {
        if (!empty($forceEncoding)) {
            $this->encoding = $forceEncoding;
        } else {
            $this->encoding = mb_internal_encoding();
        }
    }

    public function __invoke(string $str): array
    {
        $out = [];

        $len = mb_strlen($str, $this->encoding);
        $end = $len - 1;
        $runner = new Runner($end);

        for ($i = 0; $i <= $end; $i++) {
            $char = mb_substr($str, $i, 1);

            $runner->look($i, $char);

            if ($runner->canRelease()) {
                $out[] = $runner->getCell();
            }
        }

        // end of string, so getting last cell content
        $out[] = $runner->getCell();
        

        return $out;
    }
}


$tests = [
    'cas"tordu,"in","",one,,"pika, chu",,"out"',
    ',"something",',
    ',',
    ',"",',
    '28c2cf5f-4df9-4e6f-a5d6-a41eaba35285,"test","",'
];

$cve = new CompositeValueExtractor();

foreach ($tests as $test) {
    $res = $cve($test);

    var_dump($res);
}

Results are:

array(8) {
  [0]=>
  string(9) "cas"tordu"
  [1]=>
  string(2) "in"
  [2]=>
  string(0) ""
  [3]=>
  string(3) "one"
  [4]=>
  NULL
  [5]=>
  string(9) "pika, chu"
  [6]=>
  NULL
  [7]=>
  string(3) "out"
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(9) "something"
  [2]=>
  NULL
}
array(2) {
  [0]=>
  NULL
  [1]=>
  NULL
}
array(3) {
  [0]=>
  NULL
  [1]=>
  string(0) ""
  [2]=>
  NULL
}
array(4) {
  [0]=>
  string(36) "28c2cf5f-4df9-4e6f-a5d6-a41eaba35285"
  [1]=>
  string(4) "test"
  [2]=>
  string(0) ""
  [3]=>
  NULL
}

What do you think about that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants