HTML entities are one of those topics developers learn once, use constantly, and somehow never fully nail down. When do you actually need them? What's the difference between & and &? And why does skipping them in the wrong place turn a web form into an attack surface?
This article cuts through the noise. You'll get the five characters you must always encode, a full reference table to bookmark, a clear explanation of how encoding blocks XSS, and the context-specific rules that most tutorials skip.
What an HTML Entity Actually Is
An HTML entity is a text sequence that represents a single character. Every entity starts with & and ends with ;. Between those delimiters is either a name (&) or a number (& or &).
The browser's HTML parser recognises these sequences and replaces them with the corresponding character before rendering. So < in source becomes a visible < on screen — it's shown as text, not treated as a tag opener.
There are two reasons to use entities:
- Safety — to prevent characters with HTML meaning from being parsed as markup when they should display as text.
- Convenience — to insert characters that are hard or impossible to type directly, like © or —.
The Five Characters You Must Always Encode
These five characters have structural meaning in HTML. Any time they appear in text content or attribute values that come from user input or an external source, they must be encoded. Skipping even one is a path to broken markup or an XSS hole.
A quick breakdown of why each one matters:
- & starts every entity sequence. Unencoded ampersands in text cause parsing ambiguity and invalid HTML. In URLs inside href attributes, an unencoded
&between query parameters is treated as an entity opener, not a separator — use&. - < opens HTML tags. Any unencoded
<in content can cause the parser to start reading a tag, breaking layout or creating an injection point. - > closes HTML tags. Less dangerous than
<in practice, but still needs encoding for spec compliance and to prevent edge-case parser issues. - " terminates double-quoted attribute values. Inside
href="..."or any attribute, an unencoded"ends the value early. - ' terminates single-quoted attribute values. Same issue for
href='...'.
A URL with query parameters like ?name=Alice&city=Berlin inside an href attribute must be written as href="?name=Alice&city=Berlin". Without encoding the &, validators flag invalid HTML and some parsers misread the parameter boundary.
Named vs Numeric Entities
Every HTML entity can be written two ways: by name or by Unicode code point number.
Named: © ← readable, human-friendly Decimal: © ← Unicode code point in base 10 Hex: © ← Unicode code point in base 16 All three render as: ©
Named entities are easier to read and write. You see — and immediately know it's an em dash. The downside is that not every character has a name — only a few hundred do, compared to over a million Unicode code points.
Numeric entities work for any Unicode character, named or not. If you need a specific emoji or obscure symbol, use &#[codepoint];. They're less readable but universally applicable.
' for the apostrophe was defined in XML and XHTML but was technically not part of HTML4. It's fully supported in HTML5 and all modern browsers. Use it freely — just be aware that very old or obscure parsers from pre-2008 might not recognise it (use ' as a numeric fallback in those cases).
How HTML Encoding Stops XSS
Cross-site scripting (XSS) happens when an attacker's JavaScript ends up running in another user's browser. The most common path: a user submits text containing <script> tags or event handler attributes, and the server returns that text inside an HTML page without encoding it.
User input: <script>document.cookie</script> Inserted into page: <p><script>document.cookie</script></p> Result: script executes, steals cookies
User input (same): <script>document.cookie</script> After encoding: <p><script>document.cookie </script></p> Result: displays as text, does nothing
Encoding works because <script> is text content — the browser displays the characters <script> on screen rather than creating a script element. The attack is neutralised at the rendering stage.
HTML encoding alone is not sufficient in every context:
- JavaScript strings — data inside
<script>tags needs JavaScript escaping (\",\',\\), not HTML entities - CSS values — data inside
styleattributes needs CSS escaping - URL parameters — values in href/src attributes need URL encoding (
%20, etc.) in addition to HTML encoding of the surrounding attribute - JavaScript event handlers — avoid putting untrusted data directly in onclick, onmouseover, etc.
When You Don't Need to Encode
Modern UTF-8 HTML documents served with <meta charset="UTF-8"> can contain most Unicode characters directly — accented letters, emoji, currency symbols, math operators — without any encoding. You don't need to write é for é if your document is UTF-8 and your text editor saves in UTF-8.
The only characters you always must encode, regardless of charset, are the five structural ones: &, <, >, ", and '.
Use named entities for these cases:
- Your CMS or template system mangles certain characters on save
- You're working in a legacy ASCII-only environment
- The character is difficult to type and the entity name is more readable (
—for —) - You want explicit documentation that a space is intentionally non-breaking (
)
Typography Entities Worth Knowing
These are the most useful typography-related entities — the ones that separate professional copy from sloppy copy.
Quick usage rules:
- Em dash (—) is used for a strong parenthetical break or interruption — like this — with no spaces on either side in American style, or with thin spaces in British style.
- En dash (–) is used for ranges (pages 12–18, 2020–2026) and as a minus sign in text.
- Curly quotes (" " ' ') are the typographically correct form. Straight quotes (
"and') are typewriter artifacts. Use curly quotes in published content. - Ellipsis (…) is a single character, not three periods. It spaces differently and doesn't break across lines.
- Non-breaking space ( ) prevents a line break between two words. Use it for measurements (10 kg), titles (Dr. Smith), and units.
Full Entity Reference Table
The entities developers actually use — grouped by category. Use our HTML Encoder to quickly encode any of these into your markup.
| Char | Named | Numeric | Description |
|---|---|---|---|
| Essential — always encode | |||
| & | & | & | Ampersand |
| < | < | < | Less-than sign |
| > | > | > | Greater-than sign |
| " | " | " | Double quotation mark |
| ' | ' | ' | Apostrophe / single quote |
| Spaces & punctuation | |||
| |   | Non-breaking space | |
| — | — | — | Em dash |
| – | – | – | En dash |
| … | … | … | Horizontal ellipsis |
| • | • | • | Bullet point |
| « | « | « | Left double angle quote |
| » | » | » | Right double angle quote |
| Typographic quotes | |||
| " | “ | “ | Left double quotation mark |
| " | ” | ” | Right double quotation mark |
| ' | ‘ | ‘ | Left single quotation mark |
| ' | ’ | ’ | Right single quotation mark (apostrophe) |
| Symbols & intellectual property | |||
| © | © | © | Copyright sign |
| ® | ® | ® | Registered trademark |
| ™ | ™ | ™ | Trade mark sign |
| Currency | |||
| € | € | € | Euro sign |
| £ | £ | £ | Pound sterling |
| ¥ | ¥ | ¥ | Yen / Yuan sign |
| ¢ | ¢ | ¢ | Cent sign |
| Math & science | |||
| ° | ° | ° | Degree sign |
| ± | ± | ± | Plus-minus sign |
| × | × | × | Multiplication sign |
| ÷ | ÷ | ÷ | Division sign |
| ½ | ½ | ½ | Vulgar fraction one half |
| ¼ | ¼ | ¼ | Vulgar fraction one quarter |
| ¾ | ¾ | ¾ | Vulgar fraction three quarters |
| ∞ | ∞ | ∞ | Infinity |
| √ | √ | √ | Square root |
| ∑ | ∑ | ∑ | N-ary summation |
| π | π | π | Greek small letter pi |
| Arrows | |||
| → | → | → | Rightward arrow |
| ← | ← | ← | Leftward arrow |
| ↑ | ↑ | ↑ | Upward arrow |
| ↓ | ↓ | ↓ | Downward arrow |
| ↔ | ↔ | ↔ | Left right arrow |
| Card suits | |||
| ♠ | ♠ | ♠ | Black spade suit |
| ♥ | ♥ | ♥ | Black heart suit |
| ♦ | ♦ | ♦ | Black diamond suit |
| ♣ | ♣ | ♣ | Black club suit |
Using Entities in Code
Here's how to handle HTML encoding in the languages and frameworks developers use most in 2026.
JavaScript (vanilla)
function htmlEncode(str) {
return str
.replace(/&/g, '&')
.replace(//g, '>')
.replace(/"/g, '"')
.replace(/'/g, ''');
}
// Use it before inserting untrusted content
element.innerHTML = htmlEncode(userInput);When you just need to display text, assign to element.textContent rather than element.innerHTML. The browser handles encoding automatically — no manual escaping needed, and there's no risk of accidentally creating a script context.
JavaScript (decode using the DOM)
function htmlDecode(str) {
const el = document.createElement('textarea');
el.innerHTML = str;
return el.value;
}
htmlDecode('<p>Hello & welcome</p>')
// → '<p>Hello & welcome</p>'PHP
// Encode the 5 essential characters htmlspecialchars($str, ENT_QUOTES | ENT_HTML5, 'UTF-8'); // Encode ALL named entities (rarely needed in UTF-8 docs) htmlentities($str, ENT_QUOTES | ENT_HTML5, 'UTF-8'); // Decode htmlspecialchars_decode($encoded, ENT_QUOTES | ENT_HTML5); html_entity_decode($encoded, ENT_QUOTES | ENT_HTML5, 'UTF-8');
Python
import html
# Encode (escapes & < > " ')
html.escape('<p>Hello & World</p>')
# → '<p>Hello & World</p>'
# Encode without quoting apostrophes
html.escape(s, quote=False)
# Decode (handles named, decimal, and hex entities)
html.unescape('<p>Hello & World</p>')
# → '<p>Hello & World</p>'Template engines (React, Vue, Django, Rails)
Most modern template systems auto-escape output by default:
// React — JSX auto-encodes
<p>{userInput}</p> ← safe, encoded automatically
<p dangerouslySetInnerHTML={{__html: userInput}}/> ← UNSAFE
// Vue — double curly auto-encodes
{{ userInput }} ← safe
v-html="userInput" ← UNSAFE
// Django templates — auto-escape by default
{{ user_input }} ← safe
{{ user_input|safe }} ← UNSAFE — only for trusted HTML
// Rails ERB — auto-escape by default
<%= user_input %> ← safe
<%= raw user_input %> ← UNSAFEThe pattern is consistent: the default interpolation syntax is safe; the "raw" or "unsafe HTML" bypass is the thing to avoid with untrusted data.
The Misuse Problem
Non-breaking space is probably the most misused entity. Developers sometimes add multiple characters in a row to create visual indentation or padding. This is a bad pattern for several reasons:
- Accessibility — screen readers may announce each non-breaking space, or create awkward pauses
- Maintenance — spacing in markup is brittle; CSS handles it more cleanly
- Semantics —
means "these two things should not be separated", not "add space here"
Use only for its intended purpose: preventing a line break between two specific words or a number and its unit. Use padding, margin, or gap in CSS for visual spacing.