Recent news about goverment agencies pushing legislation through to collect web history data got me interested in analysing this information myself.
Your web history says a lot about yourself. First of all it conveys intent. When you search for something you are actively saying “I am interested in this”.
Secondly there is a freedom associated with the internet. With most of the world’s knowledge at your fingertips, it allows people to freely explore ideas. The path you choose through the web uniquely identifies you and represents how your mind processes information.
Getting the data
Firefox (and also Chrome) store their history in flat file sqlite databases.
These are stored in your home directory. On Linux, the Firefox database is stored in ~/.mozilla/firefox/<random>.default/places.sqlite
. Locations for other
operating systems are here.
Below is an example of using the RSQLite R package to run queries on the history database. The main tables of interest are the moz_places
table with all pages visited with counts, the moz_visits
table with every visit for each page (place), and the moz_hosts
table with a summary of page visits for each domain.
library(DBI)
library(RSQLite)
con <- dbConnect(SQLite(), "places.sqlite")
alltables = dbListTables(con)
hosts = dbGetQuery( con,'select * from moz_hosts order by frecency desc' )
places = dbGetQuery( con,'select * from moz_places order by visit_count desc' )
visits = dbGetQuery(con, 'select datetime(last_visit_date/1000000, \'unixepoch\', \'localtime\') as last_visit,
v.id as visit_id, p.id as place_id, v.from_visit, v.visit_type, p.url,
p.title from moz_historyvisits v left join moz_places p on v.place_id = p.id
where p.visit_count > 0')
Text mining
The page titles provide a good source for text mining. Using the tm
package
I created a document term matrix with frequencies of each word. I used this article as a source which goes into more detail into text mining in R.
library(tm)
docs = Corpus(VectorSource(places$title))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, c("bad","words"))
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
head(freq, 14)
wf <- data.frame(word=names(freq), freq=freq)
Plotting the data
Using ggplot2
and wordcloud
packages you can plot and visualise the data. I found that the worldcloud produced a suprisingly accurate summary of topics I had been interested in recently.`
library(ggplot2)
library(wordcloud)
p <- ggplot(subset(wf, freq>40), aes(word, freq))
p <- p + geom_bar(stat="identity")
p
wordcloud(names(freq), freq, min.freq=10)