Skip to content

Quote top-level string scalars whose bare form re-parses as a container#2656

Open
barry3406 wants to merge 1 commit intomikefarah:masterfrom
barry3406:fix/quote-bare-string-scalar
Open

Quote top-level string scalars whose bare form re-parses as a container#2656
barry3406 wants to merge 1 commit intomikefarah:masterfrom
barry3406:fix/quote-bare-string-scalar

Conversation

@barry3406
Copy link
Copy Markdown

The yaml encoder has a fast-path for unwrapping top-level scalars: when UnwrapScalar is set (the default for yaml output) it writes node.Value verbatim and skips the regular encoder. The fast-path never checks whether the bare form round-trips, so any !!str whose value happens to be a valid YAML mapping or sequence is silently reinterpreted on the next read.

Reproducer:

printf '"this: should really work"\n' | yq -p yaml -o yaml

Before this fix the second invocation parses the output as a single-key map, destroying the original string; the reporter's second example (yq .foo on a foo: | block containing a: a / b: b) hits the same path through the literal style.

Guard the fast-path with bareStringNeedsQuoting, which re-parses node.Value with yaml.v4 and reports whether the result is something other than a scalar. When it returns true the encoder falls through to the normal encoder.Encode path, which picks up LiteralStyle or DoubleQuotedStyle as appropriate. The check is deliberately narrow:

  • only !!str nodes are examined, so existing tag tests that emit !!int/!!str as a bare scalar are untouched;
  • only structural reinterpretations (map/sequence/alias) trigger the fallback, so plain strings that re-read as integers, booleans, or nulls keep their current unwrapped form;
  • an unparseable value (e.g. one containing NUL) stays on the fast-path, so the NUL-aware printer in printer.go still sees the raw bytes.

The new table-driven test TestYamlEncoderUnwrapScalarRoundtripSafety covers the colon, dash, and multiline-map cases from the report and asserts that safe plain strings, digits, null, and !!int still emit bare. The decode yaml document base64 scenario previously asserted the buggy a: apple\n bare output; it now expects the block-literal form that round-trips back to the same string.

Fixes #2608

When UnwrapScalar is enabled (the default for yaml output), the yaml
encoder writes node.Value verbatim as a bare line. Any string whose
content is itself a valid YAML mapping, sequence, or alias then round
trips as that container instead of as a string. For example, the input
document `"this: should really work"` previously re-emitted as the bare
line `this: should really work`, which the next reader parses as a one
key map, destroying the original scalar. The same problem surfaces
whenever a multiline string literal happens to contain `key: value`
lines, which is the form the bug report uses for its second reproducer.

Guard the fast-path by re-parsing node.Value with yaml.v4: if the bare
form decodes to a non-scalar, fall through to the regular encoder so it
can apply the quoting style required by the YAML spec. The check is
limited to `!!str` nodes and to structural reinterpretations, so tag
expressions such as `!!int` and plain strings that re-read as integers,
booleans, or nulls are unaffected. An unparseable value (e.g. one
containing NUL) stays on the fast-path so downstream NUL-aware writers
still see the raw bytes.

Updates the base64 "decode yaml document" scenario whose expected
output was `a: apple\n` bare; it is now emitted as a block literal,
which round-trips back to the same string.

Reproducer:

```
printf '"this: should really work"\n' | yq -p yaml -o yaml
```

Before this fix the second run of yq parses the output as a map;
after, it remains the original string.

Fixes mikefarah#2608
// The decoded payload ("a: apple\n") would re-parse as a map if
// emitted bare, so the yaml encoder keeps it as a block literal to
// preserve roundtrip safety. See issue #2608.
expected: "|\n a: apple\n",
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what worries me:
a) how would you now emit this as a bare string if you wanted to;
b) this change could break a whole bunch of scripts / CI/CD pipelines for people

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a valid concern. A couple thoughts:

(a) Users can still get the raw value with -o=raw / --outputformat=raw, which bypasses the yaml encoder entirely. So there is an escape hatch.

(b) On breakage — this only kicks in for strings whose bare form re-parses as a map or sequence (like a: b or - item). Plain strings, numbers, bools are untouched. So it's a pretty narrow change, but I get that even narrow changes can bite someone.

If you'd prefer a flag to opt into this behavior instead of making it the default, I can rework it that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A single string scalar as output is not quoted if necessary (breaking roundtrip safety)

2 participants