Skip to content

Encoding errors on ASCII-8BIT strings (eg: any string from the mysql adapter) #30

Closed
@kindjar

Description

@kindjar

The sanitizer seems to have issues when its input is a string in ASCII-8BIT encoding:

irb(main):006:0* Rails::Html::WhiteListSanitizer.new.sanitize("tooth".encode('ASCII-8BIT'))
output error : unknown encoding ASCII-8BIT
=> ""
irb(main):007:0>

While ASCII-8BIT isn't the default encoding these days, it seems that strings coming from the mysql adapter (but not the mysql2 adapter) are always in ASCII-8BIT encoding, even when the table is using charset utf8:

irb(main):004:0> Day.connection.charset
=> "utf8"
irb(main):005:0> Day.last.notes.encoding
=> #<Encoding:ASCII-8BIT>

This means that using the sanitizer on any string from the database when using the mysql adapter will result in errors. I chased the error down to Nokogiri's NodeSet#to_s method, but wasn't sure what the right approach was for addressing the issue.

Switching to the mysql2 adapter makes the issue go away, since it produces all strings in UTF-8. However, folks who've been using the mysql gem (for legacy reasons or whatever) could run into headaches trying to upgrade to Rails 4.2 because of this (it hit me by way of the highlight method in ActionView::Helpers::TextHelper).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions