Google Analytics: Simple RegExp for Advanced Filtration

Photo by miheco on Flickr, https://www.flickr.com/photos/miheco/8043987177/

Just a little bit of special syntax for describing patterns can greatly increase the flexibility of your filters in Google Analytics. This post is to give you that bit.

What are we working with?

In Google Analytics you can filter using what I’ll call the basic filtration box, that input box with the magnifying glass button above the table of data, and the advanced filtration area which opens if you click the “advanced” link next to the basic filtration box.

I’ll assume in this post that we’re looking at my craft blog’s analytics, specifically the Behavior > Site Content > All Pages report, with the default primary dimension of Page.

The basic filtration box will give you generic pattern-matching: typing “crochet” will give you all URLs that have “crochet” anywhere from the beginning to the end. In the advanced area you can further specify that the URL begin with, exactly match, or end with your search string. In both locations you can use regular expressions.

Regular expressions are a way to describe a pattern to be matched. In full generality the language is extensive and can express very complex patterns. We don’t need the full language (and GA doesn’t support all parts of it anyway), but a little RegExp goes a long way toward easily filtering to the data you’re interested in.

Your first batch of syntax

Regular expressions work by having a collection of reserved characters, symbols that hold special meaning in the RegExp context.

The most useful in GA is | (pipe), found above the return key along with backslash. It means “or.” For example, I did a series about embroidery on crochet where the introductory post’s slug is embroider-crochet and the later posts’ slugs begin embroidery-crochet. I can capture both together with
embroider-crochet|embroidery-crochet

Portions of a regular expression can be enclosed in parentheses. This does nothing by itself, but can be combined with other operations. Enclosing an “or” expression in parentheses lets you make it part of a longer expression. This lets me shorten my previous filter, such as to
(embroidery|embroider)-crochet

Since regular expressions are their own singular option in the advanced filters, you have to use RegExp symbols to get “begins with,” “ends with,” and “exactly matches” filters (unless otherwise specified RegExps match like “contains”). Preceding your expression with ^ means “begins with” and following your expression with $ means “ends with.” Using both gives you “exactly matches.”

For example, if I filtered by /embroidery, I would get both posts in the embroidery category (they begin with /embroidery) and the posts in the “embroidery on crochet” series (which contain /embroidery but begin /crochet). To limit myself to posts in the embroidery category I can filter with ^/embroidery. If for some reason I wanted to filter to just the main blog page, which shows up as /, I could filter with ^/$.

Summary

exp1|exp2 : matches strings matching exp1 or exp2
^exp1 : matches strings beginning with a match to exp1
exp1$ : matches strings ending with a match to exp1
(exp1) : allows exp1 to be part of a longer pattern

Special characters versus ordinary characters

What if you need to use a reserved character literally? Very few reserved characters would ever appear in a URL, but they could in page titles and elsewhere.

There is a straightforward means to get your regular expression to interpret a character as the ordinary version and not the special RegExp version: precede it with a backslash. This is called escaping the character. For example, \( and \) get you literal parentheses.

Characters that need to be escaped are: \ ^ $ . | ? * + ( ) [ {

I have a Related Posts plugin on the craft blog that adds query parameters to its links. If I put /?related into the filtration box, it wouldn’t give me what I was expecting. The ? needs to be escaped: /\?related.

Cautionary notes

In the basic filtration box, you always need to escape reserved characters since it assumes you’ve typed a regular expression by default (though GA is smart enough to interpret a lone or leading ?, say, as a literal character – meaning in our last example filtering on ?related without the / would work just fine).

In the advanced filtration area, the match type drop-down must be set to “Matching RegExp” for the filter to be interpreted as a regular expression. In that case you must escape special characters, but in any other case the backslash will be interpreted literally and break your filter.

A second batch of syntax

What’s above may meet all of your needs. However, you may find situations in which you can’t quite get where you need to be with pipe, parens, caret and dollar sign, or where filters based on those are cumbersome.

The wildcard

A period in a regular expression will match any single character. For example, /page/./ will match /page/2/ but not /page/10/. /page/../ will match /page/10/ but not /page/2/, unless it happened to actually be /page/2//. Since I know my data doesn’t include any URLs with double slashes, I can see ultra-deep dives into content by filtering on /page/../ to get only pages 10 and up.

Repeats

Instead of typing some large number of periods to match a longer string that varies, we can use characters that indicate repetition. This also allows us to match when the varying string does not always have the same length.

Repetition is indicated by one of three “suffix” characters: question mark, asterisk, or plus sign. They mean, respectively, 0 or 1 repeat, 0 or more repeats, 1 or more repeats. For an example:
A.? matches A, AB, A5; does not match ABC, AB12
A.* matches A, AB, A5, ABC, AB12
A.+ matches AB, A5, ABC, AB12; does not match A
(the lists of strings matched or not matched is representative, not comprehensive)

Going back to the page number example, I’d like to look at engagement with pages 2 and later of all category archives. I know the URL structure will be /category/[category-name]/page/[number]/, and that the part from “page” on doesn’t exist on the first page.

Basically I need /category/ and /page/ with something in between, so here is my RegExp:
/category/.+/page/
.* could be used interchangeably with .+ here, because there won’t be a match to category//page.

All three modifiers – ?, +, and * – can be used on any character, not just the period. This lets us simplify our “embroidery on crochet” filter even further. The only different between embroidery-crochet and embroider-crochet is the y, so embroidery?-crochet will match both. It will not match embroiders-crochet, though either embroider.?-crochet or embroider(y|s)?-crochet would match all three.

Summary

. : matches any single character
? : indicates the part of the pattern preceding it can occur 0 or 1 times
* : indicates the part of the pattern preceding it can occur 0 or more times
+ : indicates the part of the pattern preceding it can occur 1 or more times

One little side note

All of my regular expressions so far have matched the case of the URLs I was trying to filter down to. By default, though, Google Analytics makes matches in a case-insensitive manner, meaning “thread” would match “Thread” and “THREAD” as well as the all-lowercase version. This generally is a helpful simplification but if capitalization is meaningful for your site, be aware you can’t filter for it simply by capitalizing in your RegExp.

The full reference list

Characters that need to be escaped (preceded with a backslash) to be interpreted literally:
\ ^ $ . | ? * + ( ) [ {

| or exp1|exp2 matches strings matching exp1 or exp2
^ beginning ^exp1 matches strings beginning with a match to exp1
$ end exp1$ matches strings ending with a match to exp1
() enclosure (exp1) allows exp1 to be part of a longer pattern
. wildcard . matches any single character
? optional AB? matches A and AB
* unlimited AB* matches A, AB, ABB, ABBB, ABBBB, …
+ at least 1 AB+ matches AB, ABB, ABBB, ABBBB, … but not A

Coffee photo by miheco on Flickr.

Leave a Reply

Your email address will not be published. Required fields are marked *