Recent news about goverment agencies pushing legislation through to collect web history data got me interested in analysing this information myself.
Your web history says a lot about yourself. First of all it conveys intent. When you search for something you are actively saying "I am interested in this".
Secondly there is a freedom associated with the internet. With most of the world's knowledge at your fingertips, it allows people to freely explore ideas. The path you choose through the web uniquely identifies you and represents how your mind processes information.
Getting the data
Firefox (and also Chrome) store their history in flat file sqlite databases.
These are stored in your home directory. On Linux, the Firefox database is stored in
~/.mozilla/firefox/<random>.default/places.sqlite. Locations for other
operating systems are here.
Below is an example of using the RSQLite R package to run queries on the history database. The main tables of interest are the
moz_places table with all pages visited with counts, the
moz_visits table with every visit for each page (place), and the
moz_hosts table with a summary of page visits for each domain.
library(DBI) library(RSQLite) con <- dbConnect(SQLite(), "places.sqlite") alltables = dbListTables(con) hosts = dbGetQuery( con,'select * from moz_hosts order by frecency desc' ) places = dbGetQuery( con,'select * from moz_places order by visit_count desc' ) visits = dbGetQuery(con, 'select datetime(last_visit_date/1000000, \'unixepoch\', \'localtime\') as last_visit, v.id as visit_id, p.id as place_id, v.from_visit, v.visit_type, p.url, p.title from moz_historyvisits v left join moz_places p on v.place_id = p.id where p.visit_count > 0')
The page titles provide a good source for text mining. Using the
I created a document term matrix with frequencies of each word. I used this article as a source which goes into more detail into text mining in R.
library(tm) docs = Corpus(VectorSource(places$title)) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, tolower) docs <- tm_map(docs, removeWords, c("bad","words")) docs <- tm_map(docs, removeWords, stopwords("english")) docs <- tm_map(docs, stripWhitespace) docs <- tm_map(docs, PlainTextDocument) dtm <- DocumentTermMatrix(docs) freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE) head(freq, 14) wf <- data.frame(word=names(freq), freq=freq)
Plotting the data
wordcloud packages you can plot and visualise the data. I found that the worldcloud produced a suprisingly accurate summary of topics I had been interested in recently.`
library(ggplot2) library(wordcloud) p <- ggplot(subset(wf, freq>40), aes(word, freq)) p <- p + geom_bar(stat="identity") p wordcloud(names(freq), freq, min.freq=10)