Coding Problem :
I need all the original HTML entities of a paragraph, especially the accents, the methods I know recover only some entities, as in the example below where “>” is correctly coded but “ç” is not.
It is important that the code can differentiate accents generated or not by entities (as in
çã ) because the content comes from an external source and can come without a defined pattern
<p>situação > ativo</p>
Notes: As the accepted response of @mgibsonbr is not possible, the solution adopted was to use the
DOMDocument::saveHTML , it interprets entities in the same way as the browser, so that the data is the same on both the server and the client.
Answer 1 :
188.8.131.52 Tokenizing character references
The behavior depends on the identity of the next character (the one immediately after the AMENDERSAND U + 0026), as follows:
“#” (U + 0023)
Consume the U + 0023 NUMBER SIGN.
Consume all characters that match the range of characters listed above (hexadecimal ASCII digits or ASCII digits).
Otherwise, if the next character is a SEMICOLON U + 003B, consume it as well. If it is not, it is a parsing error.
Otherwise, return a character token for the Unicode character whose code point is that number.
Consume as many characters as possible as long as the characters consumed match one of the identifiers in the first column of the named character referencing (case-sensitive).
Return one or two character tokens for the character (s) corresponding to the character name in the reference (given by the second column of the
Answer 2 :
Answer 3 :
Answer 4 :
Answer 5 :
Answer 6 :
Answer 7 :
Answer 8 :
Answer 9 :