Web data time series

For a research project, I’m looking for some real-world time series data. Time-series are an interesting thing to study, however it is hard to get access to interesting non-trivial real-world data.

I was wondering if some people could contribute me some summarized web access data; no URLs or IP addresses.

The data I’d like to get can best be explained by the preprocessing step:

... | perl -ne '/\[(\d+\/\w+\/\d{4}:\d\d):\d\d/
&& print $1."\n";' | sort | uniq -c

(Sorry if you aren’t fluent in regexp - it extracts the date and hour out of an Apache default log file, nothing else. These lines are then summarized by counting their unique appearances.)

That should produce 24 lines per day (one per hour), looking like this:

  count day-of-month/month/year:hour

It would be cool if you could send me some series for a couple of sites, if you happen to be in the position to provide this data. The data should cover at least a few weeks, the longer the better even up to a few years.

Too small sites are however not very useful but might be too noisy (so probably not the personal home page of your mom). If you are providing a larger number of series, you are of course free to include them.

I don’t care much about what the site actually contains, I’d just ask you to give a tiny amount of meta information:

Server timezone if not UTC
Typical user timezone (if applicable, that mostly applies to ‘regional’ sites)
Coarse classification of site (e.g. “product website”, “web service”, “search engine”, “company site”, “OSS software site” or something like that …)
Redistribution permission (if possible; most likely I’ll only use data series where redistribution is allowed, since some conferences ask you kindly to provide such material)
Your email address will be withheld, to increase anonymity of the data

Data use:

The main project idea is to evaluate different distance metrics in their capability of separting the different data sources, assuming that there is some difference in the shape in these curves. A different problem can be constructed by breaking the series into chunks covering approximately a day and then trying do separate different days, starting hours of the series (or offset server timezone vs. user timezone) and/or weekdays from weekends.

In our experiments, we’ve come to the conclusion that the experimental results are most interesting when there is a sufficient number of classes; so I’d like to get like 20 different interesting data series. At the same time, the series should be long enough, so I can break them into multiple chunks to have a reasonable number of ‘sub-series’ per class. If I have really long series, or e.g. series covering the same site but from multiple servers, I could even experiment with taking sample of different length from these sets.

(Say I have series covering 2 years, that is ~17k samples, from 3 servers, then I can take 51 disjoint sub-series of length 1000, or 102 of length 500, …)

But it’s obviously not possible for me to collect this data myself - I don’t operate two dozen of such sites myself …

An extra project I’ve been considering some time is some peak prediction for web accesses. Say you’re running some fast growing site, wouldn’t it be useful to have a prediction when the number of accesses will likely hit some magical limit (and e.g. overload your server) so you can increase your capacity on time? Of course it would be more sensible to apply this prediction e.g. onto CPU usage, e.g. predicting when your system might hit 90% load average over a 5 minute window in regular operation. Network bandwidth and disk IO also come into mind. You get the idea.

Please send them via email to erich.schubert AT gmail com

Thank you.

[P.S. Already recieved the first series, thank you! I can take care of sorting myself, no need to worry about that. And yes, I’m aware that the series will probably all be quite similar - common computer usage patterns such as work hours - but that is common in real world data and part of the challenge. Separting apples from dinosaurs is not a challenge.]