Recently, a browser implementer asked me for examples of SVG. He was having trouble finding good examples of SVG in use, particularly as parts of an HTML document. This question has come up again and again, actually, and it always vexes me. I’ve been active in the SVG community for close to a decade, and I’ve seen thousands of amazing SVG files (and many more of mediocre to average quality), but somehow they seem to have disappeared or bitrotted over the years. Some of those files only worked with the slightly-unstandard Adobe SVG Viewer, or didn’t quite work with Firefox’s incomplete support, I know, but surely not all of them. Where is all the great SVG content I remember, the games and GUIs and design and development? Where are all those files to be found?
I hear some browser implementers say that people just don’t use SVG. Intuitively, this feels false to me, based on my own experience. But could it be true?
The statistical insignificance of SVG is often cited by some people in the WHATWG, based on a large dataset of Web content indexed by Google. In the WHATWG, where HTML5 started, great stock is placed on statistics, particularly those conducted by the editor, Ian Hickson, a Google employee.
There’s no question about it: HTML is the king of the Web. I did some rough calculations, similar to claims I’ve heard before, by counting the number of returns for HTML files versus SVG files. A search for the filetype “.svg” yields around 18,165,500 hits on Google. (Note that this doesn’t count the false hits on the word “SVG” from St. Vincent and the Grenadines, Stan Van Gundy, the Sexy Valley Girls, or any of the numerous other bizarrities that the acronym stands for.) SVG content makes up just 0.106% of all Web content, by my rough estimation. Flash is almost 5 times as common as SVG. That’s pretty grim for SVG.
But wait, let’s put that into perspective. Flash is about 4.8 times more common than SVG. HTML is roughly 838 times more common than SVG. 838 times. Flash content comprises approximately 0.52% of all Web content, and HTML is roughly 189 times more common than Flash. So, Flash is clearly much more popular than SVG (even when you consider that some large percentage of Flash content is actually just encapsulated video content, these days). But that doesn’t mean that nobody’s using SVG. Nearly 20 million documents is pretty impressive, actually, especially given the fact that SVG has been hindered by a lack of native support in browsers for most of its existence (and more recently, even poor support by the Adobe plugin for IE), and a lack of common authoring tools for dynamic content (Inkscape is an excellent vector editor, but it doesn’t yet do animation or interactivity).
Eighteen million documents. That’s a lot of files. So, given that, why is it so hard to find SVG content?
Maybe because the most popular search engine in the world, Google, doesn’t index SVG.
A long time ago, back in 2002, I made a page discussing my experiments with text search and translation. The results were not very encouraging, but I reckoned it was just a matter of time. I optimistically wrote to Google to encourage them to enable text search and translation of SVG files.
8 years down the line, things don’t seem to have changed much on that front.
To be fair, many SVG files don’t contain any text at all, not even a <title> element, so indexing them might not yield much. But many other files do have at least a title, and SVG infographics and webapps usually have at least labels that might be meaningful as search terms. Often SVG files are even text-heavy.
It’s not that Google doesn’t take note of the files… obviously, you can search for the filetype, or in the worst case, the specific file URL, and normally get back positive results. But Google doesn’t seem to search the contents of the SVG files and present them in the relevant result set. To test this, I tried searching for a few files that I knew to have indexable text content.
As an example, I looked for some SVG files on my little (long out-of-date) SVG promotion site, SVG-Whiz.com. First, I searched for a file I knew to have a cogent block of text, my explanation of the distinctions between ‘display’, ‘visibility’, and ‘opacity’, called HideShow.svg:
This file has been hosted on my site since 2003, I’ve gotten several positive comments about it, and a direct search for that file URL turns up a few hits linking to it, so it’s seems like a reasonable candidate for indexing. But what are the results of my in-site Google search for the word ‘opacity’? Okay, that just turned up the explanation page linking to the SVG file in question. Fair enough, maybe Google doesn’t treat SVG as a “document” file, only as an image. So, how about an image search for the same term? Nada. So, maybe Google doesn’t consider SVG to be either a “document” nor an “image”… let’s search for the word ‘opacity’ in the site ‘svg-whiz.com’ with the filetype ‘svg’. As specific as that is, at the time of writing, I got not a single resulting hit.
Google can find the files… why doesn’t it do something with them?
Comparison of File Extension Frequency
So, what criteria does Google use to decide which file types it is going to index?
The Google FAQ on search filetypes lists 23 file extensions that it indexes, and says:
There are 13 main file types searched by Google in addition to standard web formatted documents in HTML. The most common formats are PDF, PostScript, Microsoft Office formats […] Google is also scouring the Web for additional file types that are very rare. You may see them pop up in your results from time to time. […] PDF formatted files are the most popular after HTML files. PostScript and Microsoft Word files are also fairly common. The other file types are relatively uncommon by comparison.
So, I took the liberty of conducting my own survey of the relative frequency of various filetypes, as collected by Google itself, by using the “filetype:extension” query term. I’m not totally convinced this is at all an accurate means to collect and analyze the data, but it’s what I had at hand.
I put together a table that compares the different file types that Google explicitly mentions. (I thought about representing the data as an SVG barchart, but I was afraid it wouldn’t be indexed… just kidding, the sheer volume of HTML files would make every other bar just a blip.)
I also threw in some other filetypes of interest, including some with functional similarity to SVG, such as Illustrator, PhotoShop, and Silverlight. I expected non-Web filetypes such as Illustrator’s “*.ai” to be disproportionately underrepresentated in the results compared to their actual usage, and that was indeed borne out; it’s hard to know what percentage of SVG files are intended for and presented on the Web (I’ve spoken to many Inkscape users who only use SVG for print or local hard-drive, which surprised me), but I would guess that it is far, far more heavily tilted toward Web usage… but I still thought it would be interesting to compare.
What did surprise me was how “*.svg” compared to such ubiquitous file extensions as “*.txt”, and those for Excel, PowerPoint, and the venerable PostScript. To be frank, the results make me question my methodology, or perhaps the accuracy of Google’s reporting.
|File Type||File Extension||Number of Results||Introduction|
|HyperText Markup Language||total||16,574,700,000||1991|
|Adobe Portable Document Format||281,000,000||1993|
|Scalable Vector Graphics||total||18,165,500||1999|
|Rich Text Format||rtf||5,130,000||1987|
|All Web Content||17,081,255,598||1989|
Caveat: I originally compiled this information a few months ago, and when rechecking it for accuracy, I got a surprising result. Normally, the Google search for filetype:svg returned 18,100,000 hits, but late one night, it returned only 2,000,000 hits; now, my most recent check showed around 4,300,000 hits. Jumping around in the results for an explanation, I noticed that there is a lot of duplication of Wikipedia content, and since Wikipedia uses SVG, that might account for some discrepancy. One possibility is that the lower figure represents 2 million unique documents, which are duplicated in a lot of places; the same should be said of any HTML content, and probably to a lesser extent of Flash content. I don’t know if this is the right conclusion, but it would be an interesting data point. Even with the much more modest figure of 2 million documents, I still think that represents an impressive body of work, particularly in light of the fact that SVG documents are normally authored individually, not through forum or blog software, or exporting or reformatting of email and text content as HTML.
I don’t think this is some grand conspiracy by Google to suppress SVG. Simple neglect is much more plausible. They don’t seem to see the value in indexing SVG. But the end result is the same: SVG seems to be statistically underrepresented in terms of access through Google searches, and thus, it is harder to find SVG content.
Relying on the results of a search engine that doesn’t index SVG, to draw conclusions about how many people are using SVG is not statistically sound. This is a bit like conducting a phone survey of English speakers in China, and concluding that nobody speaks with a Southern US accent. I reckon y’all might could see the problem with that methodology if you lived here in North Carolina.
SVG is at least as plentiful on the Web, by Google’s own reckoning, as most other file types that Google does index. Search engines, Google included, should index SVG files. They should read the text inside the file, and if the file is referenced in an HTML page, they should associate those keywords with the SVG file, just as they do with raster images. SVG files should display in image searches, as well. Here’s a list of the kind of useful content that can be gleaned from most SVGs:
- file name: while pretty primative, many files give some clue as to their contents in their name
- text elements: there are several elements in SVG that contain text to be rendered to the screen, including <text>, <tspan>, <textPath>, and <textArea>, and the content should be indexed as if it were text in any other format
- embedded HTML: HTML (and other markup) can be embedded inline in SVG, and search engines should look for that and index it as they would standalone HTML content
- links: Google, and probably most other modern search engines, give weight to files that are linked from other files, and files referenced from SVG content should benefit in the same way; the @xlink:title and @rel can help define the relationship between the files
- descriptive elements: like HTML, SVG has a <title> element that doesn’t display, but adds to the information about its parent file or element, and SVG also has a <desc> element for a longer description
- metadata: SVG can contain RDF, RDFa, microformats, and ARIA markup, which search engines are starting to pick up on these days; these metadata can reveala lot about a file, from its license information to structured content (like calendars or dates or contact info) to intent (such as ARIA roles, which will soon be expanded to include things like different chart types)
And the SVG Working Group would be happy to work with any search engine developers to make improvements to SVG 2 to help make indexing SVG content easier or more fruitful.
I’m not trying to pick on Google here (though I do note that a Bing search for ‘svg opacity svg-whiz.com’ listed the SVG file as the first hit), I’m just noting a discrepancy and an opportunity for improving the experience that people have on the Web with regards to SVG. At the very least, SVG should be recognized by Google as a legitimate file type, rather than a formata non grata.
Rob Russell delivered great news to us in his SVG Open keynote. As of August 31, 2010, Google now indexes SVG and delivers it in some search results. Kudos to Google for stepping up! I’m very pleased… solid results only 6 weeks later. (I guess I should thank Slashdot, too.)