Vitavonni

Beware of trolls - do not feed

A particularly annoying troll has been on his hate crusade against systemd for months now.
Unfortunately, he's particularly active on Debian mailing lists (but apparently also on Ubuntu and the Linux Kernel mailing list) and uses a tons of fake users he keeps on setting up. Our listmasters have a hard time blocking all his hate, sorry.
Obviously, this is also the same troll that has been attacking Lennart Poettering.
There is evidence that this troll used to go by the name "MikeeUSA", and has quite a reputation with anti-feminist hate for over 10 years now.
Please, do not feed this troll.
Here are some names he uses on YouTube: Gregory Smith, Matthew Bradshaw, Steve Stone.
Blacklisting is the best measure we have, unfortunately.
Even if you don't like the road systemd is taking or Lennart Poetting personall - the behaviour of that troll is unacceptable to say the least; and indicates some major psychological problems... also, I wouldn't be surprised if he is also involved in #GamerGate.
See this example (LKML) if you have any doubts. We seriously must not tolerate such poisonous people.
If you don't like systemd, the acceptable way of fighting it is to write good alternative software (and you should be able to continue using SysV init or openRC, unless there is a bug, in Debian - in this case, provide a bug fix). End of story.
2014-10-18 18:41 — Categories: English DebianPermaLink & Comments

Google Earth on Linux

Google Earth for Linux appears to be largely abandoned by Google, unfortunately. The packages available for download cannot be installed on a modern amd64 Debian or Ubuntu system due to dependency issues.
In fact, the adm64 version is a 32 bit build, too. The packages are really low quality, the dependencies are outdated, locales support is busted etc.
So here are hacky instructions how to install nevertheless. But beware, these instructions are a really bad hack.
  1. These instructions are appropriate for version 7.1.2.2041-r0. Do not use them for any other version. Things will have changed.
  2. Make sure your system has i386 architecture enabled. Follow the instructions in section "Configuring architectures" on the Debian MultiArch Wiki page to do so
  3. Install lsb-core, and try to install the i386 versions of these packages, too!
  4. Download the i386 version of the Google Earth package
  5. Install the package by forcing dependencies, via
    sudo dpkg --force-depends -i google-earth-stable_current_i386.deb
    
  6. As of now, your package manager will complain, and suggest to remove the package again. To make it happy, we have to hack the installed packages list. This is ugly, and you should make a backup. You can totally bust your system this way... Fortunately, the change we're doing is rather simple. As admin, edit the file /var/lib/dpkg/status. Locate the section Package: google-earth-stable. In this section, delete the line starting with Depends:. Don't add in extra newlines or change anything else!
  7. Now the package manager should believe the dependencies of Google Earth are fulfilled, and no longer suggest removal. But essentially this means you have to take care of them yourself!
Some notes on using Google Earth:
  • Locales are busted. Use LC_NUMERIC=en_US.UTF-8 google-earth to start it. Otherwise, it will fail parsing coordinates, if you are in a locale that uses a different number format.
  • You may need to install the i386 versions of some libraries, in particular of your OpenGL drivers! I cannot provide you with a complete list.
  • Search doesn't work sometimes for me.
  • Occassionally, it reports "unknown" network errors.
  • If you upgrade Nvidia graphics drivers, you will usually have to reboot, or you will see graphics errors.
  • Some people have removed/replaced the bundled libQt* and libfreeimage* libraries, but that did not work for me.
2014-10-17 15:59 — Categories: English Web DebianPermaLink & Comments

Analyzing Twitter - beware of spam

This year I started to widen up my research; and one data source of interest was text because of the lack of structure in it, that makes it often challenging. One of the data sources that everybody seems to use is Twitter: it has a nice API, and few restrictions on using it (except on resharing data). By default, you can get a 1% random sample from all tweets, which is more than enough for many use cases.
We've had some exciting results which a colleague of mine will be presenting tomorrow (Tuesday, Research 22: Topic Modeling) at the KDD 2014 conference:
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
You can also explore some (static!) results online at signi-trend.appspot.com

In our experiments, the "news" data set was more interesting. But after some work, we were able to get reasonable results out of Twitter as well. As you can see from the online demo, most of these fall into pop culture: celebrity deaths, sports, hip-hop. Not much that would change our live; and even less that wasn't before captured by traditional media.
The focus of this post is on the preprocessing needed for getting good results from Twitter. Because it is much easier to get bad results!
The first thing you need to realize about Twitter is that due to the media attention/hype it gets, it is full of spam. I'm pretty sure the engineers at Twitter already try to reduce spam; block hosts and fraud apps. But a lot of the trending topics we discovered were nothing but spam.
Retweets - the "like" of Twitter - are an easy source to see what is popular, but are not very interesting if you want to analyze text. They just reiterate the exact same text (except for a "RT " prefix) than earlier tweets. We found results to be more interesting if we removed retweets. Our theory is that retweeting requires much less effort than writing a real tweet; and things that are trending "with effort" are more interesting than those that were just liked.
Teenie spam. If you ever searched for a teenie idol on Twitter, say this guy I hadn't heard of before, but who has 3.88 million followers on Twitter, and search for Tweets addressed to him, you will get millions over millions of results. Many of these tweets look like this: Now if you look at this tweet, there is this odd "x804" at the end. This is to defeat a simple spam filter by Twitter. Because this user did not tweet this just once: instead it is common amongst teenie to spam their idols with follow requests by the dozen. Probably using some JavaScript hack, or third party Twitter client. Occassionally, you see hundreds of such tweets, each sent within a few seconds of the previous one. If you get a 1% sample of these, you still get a few then...
Even worse (for data analysis) than teenie spammers are commercial spammers and wannabe "hackers" that exercise their "sk1llz" by spamming Twitter. To get a sample of such spam, just search for weight loss on Twitter. There is plenty of fresh spam there, usually consisting of some text pretending to be news, and an anonymized link (there is no need to use an URL shortener such as bit.ly on Twitter, since Twitter has its own URL shortener t.co; and you'll end up with double-shortened URLs). And the hacker spam is even worse (e.g. #alvianbencifa) as he seems to have trojaned hundreds of users, and his advertisement seems to be a nonexistant hash tag, which he tries to get into Twitters "trending topics".
And then there are the bots. Plenty of bots spam Twitter with their analysis of trending topics, reinforcing the trending topics. In my opinion, bots such as trending topics indonesia are useless. No wonder there are only 280 followers. And of the trending topics reported, most of them seem to be spam topics...

Bottom line: if you plan on analyzing Twitter data, spend considerable time on preprocessing to filter out spam of various kind. For example, we remove singletons and digits, then feed the data through a duplicate detector. We end up discarding 20%-25% of Tweets. But we still get some of the spam, such as that hackers spam.
All in all, real data is just messy. People posting there have an agenda that might be opposite to yours. And if someone (or some company) promises you wonders from "Big data" and "Twitter", you better have them demonstrate their use first, before buying their services. Don't trust visions of what could be possible, because the first rule of data analysis is: Garbage in, garbage out.
2014-08-25 12:40 — Categories: English Web ResearchPermaLink & Comments

Kernel-density based outlier detection and the need for customization

Outlier detection (also: anomaly detection, change detection) is an unsupervised data mining task that tries to identify the unexpected.
Most outlier detection methods are based on some notion of density: in an appropriate data representation, "normal" data is expected to cluster, and outliers are expected to be further away from the normal data.
This intuition can be quantified in different ways. Common heuristics include kNN outlier detection and the Local Outlier Factor (which uses a density quotient). One of the directions in my dissertation was to understand (also from a statistical point of view) how the output and the formal structure of these methods can be best understood.
I will present two smaller results of this analysis at the SIAM Data Mining 2014 conference: instead of the very heuristic density estimation found in above methods, we design a method (using the same generalized pattern) that uses a best-practise from statistics: Kernel Density Estimation. We aren't the first to attempt this (c.f. LDF), but we actuall retain the properties of the kernel, whereas the authors of LDF tried to mimic the LOF method too closely, and this way damaged the kernel.
The other result presented in this work is the need to customize. When working with real data, using "library algorithms" will more often than not fail. The reason is that real data isn't as nicely behaved - it's dirty, it seldom is normal distributed. And the problem that we're trying to solve is often much narrower. For best results, we need to integrate our preexisting knowledge of the data into the algorithm. Sometimes we can do so by preprocessing and feature transformation. But sometimes, we can also customize the algorithm easily.
Outlier detection algorithms aren't black magic, or carefully adjusted. They follow a rather simple logic, and this means that we can easily take only parts of these methods, and adjust them as necessary for our problem at hand!
The article persented at SDM will demonstrate such a use case: analyzing 1.2 million traffic accidents in the UK (from data.gov.uk) we are not interested in "classic" density based outliers - this would be a rare traffic accident on a small road somewhere in Scotland. Instead, we're interested in unusual concentrations of traffic accidents, i.e. blackspots.
The generalized pattern can be easily customized for this task. While this data does not allow automatic evaluation, many outliers could be easily verified using Google Earth and search: often, historic imagery on Google Earth showed that the road layout was changed, or that there are many news reports about the dangerous road. The data can also be nicely visualized, and I'd like to share these examples with you. First, here is a screenshot from Google Earth for one of the hotspots (Cherry Lane Roundabout, North of Heathrow airport, which used to be a double cut-through roundabout - one of the cut-throughs was removed since):
Screenshot of Cherry Lane Roundabout hotspot
Google Earth is best for exploring this result, because you can hide and show the density overlay to see the crossroad below; and you can go back in time to access historic imagery. Unfortunately, KML does not allow easy interactions (at least it didn't last time I checked).
I have also put the KML file on Google Drive. It will automatically display it on Google Maps (nice feature of Drive, kudos to Google!), but it should also allow you to download it. I've also explored the data on an Android tablet (but I don't think you can hide elements there, or access historic imagery as in the desktop application).
With a classic outlier detection method, this analysis would not have been possible. However, it was easy to customize the method; and the results are actually more meaningful: instead of relying on some heuristic to choose kernel bandwidth, I opted for choosing the bandwidth by physical arguments: 50 meters is a reasonable bandwidth for a crossroad / roundabout, and for comparison a radius of 2 kilometers is used to model the typical accident density in this region (there should other crossroads within 2 km in Europe).
Since I advocate reproducible science, the source code of the basic method will be in the next ELKI release. For the customization case studies, I plan to share them as a how-to or tutorial type of document in the ELKI wiki; probably also detailing data preprocessing and visualization aspects. The code for the customizations is not really suited for direct inclusion in the ELKI framework, but can serve as an example for advanced usage.
Reference:
E. Schubert, A. Zimek, H.-P. Kriegel
Generalized Outlier Detection with Flexible Kernel Density Estimates
In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
So TLDR of the story: A) try to use more established statistics (such as KDE), and B) don't expect an off-the-shelf solution to do magic, but customize the method for your problem.
P.S. if you happen to know nice post-doc positions in academia:
I'm actively looking for a position to continue my research. I'm working on scaling these methods to larger data and to make them work with various real data that I can find. Open-source, modular and efficient implementations are very important to me, and one of the directions I'd like to investigate is porting these methods to a distributed setting, for example using Spark. In order to get closer to "real" data, I've started to make these approaches work e.g. on textual data, mixed type data, multimedia etc. And of course, I like teaching; which is why I would prefer a position in academia.
2014-04-22 14:39 — Categories: English ResearchPermaLink & Comments

Google Summer of Code 2014 - not participating

Google Summer of Code 2014 is currently open for mentoring organizations to register.

I've decided to not apply for GSoC with my data mining open source project ELKI anymore.

  • You don't get any feedback for your application. So unless you were accepted, you have no idea if it is worth trying again (I tried twice).
  • As far as I can tell, you only have a chance if there is someone at Google advocating your project. For example, someone involved in your project actually working at Google.
  • I don't really trust Google anymore. They've been too much exploiting their market dominance, leveraging YouTube and Android to force people into Google+ etc. (see Google turning evil blog post).
  • I'll be looking for a Post-Doc position in fall, so the timing of the GSoC isn't ideal anyway.
2014-02-13 17:17 — Categories: English Coding ResearchPermaLink & Comments

Debian chooses systemd as default init

It has been all over the place. The Debian CTTE has chosen systemd over upstart as default init system by chairman call. This decision was overdue, as it was pretty clear that the tie will not change, and thus it will be up to chairman. There were no new positions presented, and nobody was being convinced of a different preference. The whole discussion had been escalating, and had started to harm Debian.

Some people may not want to hear this, but another 10 ballots and options would not have changed this outcome. Repating essentially the same decision (systemd, upstart, or "other things nobody prefers") will do no good, but turn out the same result, a tie. Every vote counting I saw happen would turn out this tie. Half of the CTTE members prefer systemd, the other half upstart.

The main problems, however, are these:

  • People are not realizing this is about the default init for jessie, not about the only init to support. Debian was and is about choice, but even then you need to make something default... If you prefer SysV init or openRC, you will still be able to use it (it will be just fewer people debugging these startup scripts).
  • Some people (not part of the CTTE) are just social inept trolls, and have rightfully been banned from the list.
  • The discussion has been so heated up, a number of imporant secondary decisions have not yet been discussed in a civil manner. For example, we will have some packages (for example, Gnome Logs), which are specific to a particular init system, but not part of the init system. A policy needs to be written with these cases in mind that states when and how a package may depend on a particular init system. IMHO, a package that cannot support other init systems without major changes should be allowed to depend on that system, but then meta-packages such as the Gnome meta packages must not require this application.

So please, everybody get back to work now. If there ever is enough reason to overturn this decision, there are formal ways to do this, and there are people in the CTTE (half of the CTTE, actually) that will take care of this.

Until then, live with the result of a 0.1 votes lead for systemd. Instead of pursuing a destructive hate campaign, why not improve your favorite init system instead.

Oh, and don't forget about the need to spell out a policy for init system support requirements of packages.

2014-02-11 22:07 — Categories: English Linux DebianPermaLink & Comments

Minimizing usage of debian-multimedia packages

How to reduce your usage of debian-multimedia packages:

As you might be aware, the "deb-multimedia" packages have seen their share of chaos.

Originally, they were named "debian-multimedia", but it was then decided that they should better be called "deb-multimedia" as they are not an official part of Debian. The old domain was then grabbed by cybersquatters when it expired.

While a number of packages remain indistributable for Debian due to legal issues - such as decrypting DVDs - and thus "DMO" remains useful to many desktop users, please note that for many of the packages, a reasonable version exists within Debian "main".

So here is a way to prevent automatically upgrading to DMO versions of packages that also exist in Debian main:

We will use a configuration option known as apt pinning. We will modify the priority of DMO package to below 100, which means they will not be automatically upgraded to; but they can be installed when e.g. no version of this package exists within Debian main. I.e. the packages can be easily installed, but it will prefer using the official versions.

For this we need to create a file I named /etc/apt/preferences.d/unprefer-dmo with the contents:

Package: *
Pin: release o=Unofficial Multimedia Packages
Pin-Priority: 123

As long as the DMO archive doesn't rename itself, this pin will work; it will continue to work if you use a mirror of the archive, as it is not tied to the URL.

It will not downgrade packages for you. If you want to downgrade, this will be a trickier process. You can use aptitude for this, and start with a filter (l key, as in "limit") of ?narrow(~i,~Vdmo). This will only show packages installed where the version number contains dmo. You can now patrol this list, enter the detail view of each package, and check the version list at the end for a non-dmo version.

I cannot claim this will be an easy process. You'll probably have a lot of conflicts to clear up, due to different libavcodec* API versions. If I recall correctly, I opted to uninstall some dmo packages such as ffmpeg that I do not really need. In other cases, the naming is slightly different: Debian main has handbrake, while dmo has handbrake-gtk.

A simple approach is to consider uninstalling all of these packages. Then reinstall as needed; since installing Debian packages is trivial, it does not hurt to deinstall something, does it? When reinstalling, Debian packages will be preferred over DMO packages.

I prefer to use DFSG-free software wherever possible. And as long as I can watch my favorite series (Tatort, started in 1970 and airing episode 900 this month) and the occasional DVD movie, I have all I need.

An even more powerful aptitude filter to review installed non-Debian software is ?narrow(~i,!~ODebian), or equivalently ?narrow(?installed,?not(?origin(Debian))). This will list all package versions, which cannot be installed from Debian sources. In particular, this includes versions that are no longer available (they may have been removed because of security issues!), software that has been installed manually from .deb files, or any other non-Debian source such as Google or DMO. (You should check the output of aptitude policy that no source claims to be o=Debian that isn't Debian though).

This filter is a good health check for your system. Debian packages receive a good amount of attention with respect to packaging quality, maintainability, security and all of these aspects. Third party packages usually failed the Debian quality check in at least one respect. For example, Sage isn't in Debian yet; mostly because it has so many dependencies, that making a reliable and upgradeable installation is a pain. Similarly, there is no Hadoop in Debian yet. If you look at Hadoop packaging efforts such as Apache Bigtop, they do not live up to Debian quality yet. In particular, the packages have high redundancy, and re-include various .jar copies instead of sharing access to an independently packaged dependency. If a security issue arises with any of these jars, all the Hadoop packages will need to be rebuilt.

As you can see, it is usually not because of Debian being too strict or too conservative about free software when software is not packaged. More often than not, it's because the software in question itself currently does not live up to Debian packaging quality requirements. And of course sometimes because there is too little community backing the software in question, that would improve the packaging. If you want your software to be included in Debian, try to make it packaging friendly. Don't force-include copies of other software, for example. Allow the use of system-installed libraries where possible, and provide upgrade paths and backwards compatibility. Make dependencies optional, so that your software can be pacakged even if a dependency is not yet available - and does not have to be removed if the dependency becomes a security risk. Maybe learn how to package yourself, too. This will make it easier for you to push new features and bug fixes to your userbase: instead of requiring your users to manually upgrade, make your software packaging friendly. Then your users will be auto-upgraded by their operating system to the latest version.

Update: If you are using APT::Default-Release or other pins, these may override above pin. You may need to use apt-cache policy to find out if the pins are in effect, and rewrite your default-release pin into e.g.

Package: *
Pin: release a=testing,o=Debian
Pin-Priority: 912
for debugging, use a different priority for each pin, so you can see which one is in effect. Notice the o=Debian in this pin, which makes it apply only to Debian repositories.

2014-02-09 12:35 — Categories: English Linux DebianPermaLink & Comments

Definition of Data Science

Everything is "big data" now, and everything is "data science". Because these terms lack a proper, falsifiable definition.
A number of attempts to define them exist, but they usually only consist of a number of "nice to haves" strung together. For Big Data, it's the 3+ V's, and for Data Science, this diagram on Wikipedia is a typical example.
This is not surprising: effectively these term are all marketing, not scientific attempts at definiting a research domain.
Actually, my favorite definition is this, except that it should maybe read pink pony in the middle, instead of unicorn.
Data science has been called "the sexiest job" so often, this has recently led to an integer overflow.
The problem with these definitions is that they are open-ended. They name some examples (like "volume") but they essentially leave it open to call anything "big data" or "data science" that you would like to. This is, of course, a marketers dream buzzword. There is nothing saying that "picking my nose" is not big data science.

If we ever want to get to a usable definition and get rid of all the hype, we should consider a more precise definition; even when this means making it more exclusive (funnily enough, some people already called above open-ended definitions "elitist" ...).
Big data:
  • Must involve distributed computation on multiple servers
  • Must intermix computation and data management
  • Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
  • Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
  • Must be disruptively more data-driven
Data science:
  • Must incorporate domain knowledge (e.g. business, geology, etc.).
  • Must take computational aspects into account (scalability etc.).
  • Must involve scientific techniques such as hypothesis testing and result validation.
  • Results must be falsifiable.
  • Should involve more mathematics and statistics than earlier approaches.
  • Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
  • Should involve machine learning, AI or knowledge discovery algorithms.
  • Should involve visualization and rapid prototyping for software development.
  • Must satisfy at least one of these shoulds in a disruptive level.
But this is all far from a proper definition. Partially because these fields are so much in flux; but largely because they're just too ill-defined.
There is a lot of overlap, that we should try to flesh out. For example, data science is not just statistics. Because it is much more concerned with how data is organized and how the computations can be made efficiently. Yet often, statistics is much better at integrating domain knowledge. People coming from computation, on the other hand, usually care too little about the domain knowledge and falsifiability of their results - they're happy if they can compute anything.
Last but not least, nobody will be in favor of such a rigid definition and requirements. Because most likely, you will have to strip that "data scientist" label off your business card - and why bite the hand that feeds? Most of what I do certainly would not qualify as data science or big data anymore with an "elitist" definition. While this doesn't lessen my scientific results, it makes them less marketable.
Essentially, this is like a global "gentlemans agreement". Buzz these words while they last, then move on to the next similar "trending topic".

Maybe we should just leave these terms to the marketing folks, and let them bubble them till it bursts. Instead, we should just stick to the established and better defined terms...
  • When you are doing statistics, call it statistics.
  • When you are doing unsupervised learning, call it machine learning.
  • When your focus is distributed computation, call it distributed computing.
  • When you do data management, continute to call it data management and databases.
  • When you do data indexing, call it data indexing.
  • When you are doing unsupervised data mining, call it cluster analysis, outlier detection, ...
  • Whatever it is, try to use a precise term, instead of a buzzword.
Thank you.
Of course, sometimes you will have to play Buzzword Bingo. Nobody is going to stop you. But I will understand that you are doing "playing buzzword bingo", unless you get more precise.
Once you then have results that are so massively better, and really disrupted science, then you can still call it "data science" later on.
You have been seeing, I've been picking on the word "disruptive" a lot. As long as you are doing "business as usual", and focusing on off-the-shelf solution, it will not be disruptive. And it then won't be big data science, or a big data approach that yields major gains. It will be just "business as usual" with different labels, and return results as usual.
Let's face it. We don't just want big data or data science. What everybody is looking for is disruptive results, which will require a radical approach, not a slight modification involving slightly more computers of what you have been doing all along.
2014-01-24 16:19 — Categories: English ResearchPermaLink & Comments

The init wars

The init wars have recently caught a lot of media attention (e.g. heise, prolinux, phoronix). However, one detail that is often overlooked: Debian is debating over the default, while all of them are already supported to a large extend, actually. Most likely, at least two of them will be made mandatory to support IMHO.
The discussion seems to be quite heated, with lots of people trying to evangelize for their preferred system. This actually only highlights that we need to support more than one, as Debian has always been about choice. This may mean some extra work for the debian-installer developers, because choosing the init system at install time (instead of switching later) will be much easier. More often than not, when switching from one init system to another you will have to perform a hard reset.
If you want to learn about the options, please go to the formal discussion page, which does a good job at presenting the positions in a neutral way.
Here is my subjective view of the init systems:
  • SysV init is the current default, and thus deserves to be mentioned first. It is slow, because it is based on a huge series of shell scripts. It can often be fragile, but at the same time it is very transparent. For a UNIX system administrator, SysV init is probably the preferred choice. You only reboot your servers every year anyway.
  • upstart seems to be a typical Canonical project. It solves a great deal of problems, but apparently isn't good enough at it for everybody, and they fail at including anyone in their efforts. Other examples of these fails include Unity and Mir, where they also announced the project as-is, instead of trying to get other supporters on board early (AFAICT). The key problem to widespread upstart acceptance seems to be the Canonical Contributor License Agreement that many would-be contributors are unwilling to accept. The only alternative would be to fork upstart completely, to make it independent of Canonical. (Note that upstart nevertheless is GPL, which is why it can be used by Debian just fine. The CLA only makes getting patches and enhancements included in the official version hard.)
  • systemd is the rising star in the init world. It probably has the best set of features, and it has started to incorporate/replace a number of existing projects such as ConsoleKit. I.e. it not only manages services, but also user sessions. It can be loosely tied to the GNOME project which has started to rely on it more and more (much to the unhappyness of Canonical, who used to be a key player for GNOME; note that officially, GNOME chose to not depend on systemd, yet I see this as the only reliable combination to get a complete GNOME system running, and since "systemd can eventually replace gnome-session" I foresee this tie to become closer). As the main drawback, systemd as is will (apparently) only work with the Linux kernel, whereas Debian has to also support kFreeBSD, NetBSD, Hurd and the OpenSolaris kernels (some aren't officially supported by Debian, but by separate projects).
So my take: I believe the only reasonable default is systemd. It has the most active development community and widest set of features. But as it cannot support all architectures, we need mandatory support for an alternative init system, probably SysV. Getting both working reliably will be a pain, in particular since more and more projects (e.g. GNOME) tie themselves closely to systemd, and would then become Linux-only or require major patches.
I have tried only systemd on a number of machines, and unfortunately I cannot report it as "prime time ready" yet. You do have the occasional upgrade problems and incompatibilities, as it is quite invasive. From screensavers activating during movies to double suspends, to being unable to shutdown my system when logged in (systemd would treat the login manager as separate session, and not being the sole user it would not allow me to shut down), I have seen quite a lot of annoyances happen. This is an obvious consequence of the active development on systemd. This means that we should make the decision early, because we will need a lot of time to resolve all these bugs for the release.
There are more disruptions coming on the way. Nobody seems to have talked about kDBUS yet, the integration of an IPC mechanism like DBUS into the Linux kernel. It IMHO has a good chance of making it into the Linux kernel rather soon, and I wouldn't be surprised if it became mandatory for systemd soon after. Which then implies that only a recent kernel (say, mid-2014) version might be fully supported by systemd soon.
I would also like to see less GNOME influence in systemd. I have pretty much given up on the GNOME community, which is moving into a UI direction that I hate: they seem to only care about tablet and mobile phones for dumb users, and slowly turn GNOME into an android UI; selling black background as major UI improvements. I feel that the key GNOME development community does not care about developers and desktop users like me anymore (but dream of being the next Android), and therefore I have abandoned GNOME and switched to XFCE.
I don't give upstart much of a chance. Of course there are some Debian developers already involved in its development (employed by Canonical), so this will cause some frustration. But so far, upstart is largely an Ubuntu-only solution. And just like Mir, I don't see much future in it; instead I foresee Ubuntu going systemd within a few years, because it will want to get all the latest GNOME features. Ubuntu relies on GNOME, and apparently GNOME already has chosen systemd over upstart (even though this is "officially" denied).
Sticking with SysV is obviously the easiest choice, but it does not make a lot of sense to me technically. It's okay for servers, but more and more desktop applications will start to rely on systemd. For legacy reasons, I would however like to retain good SysV support for at least 5-10 more years.

But what is the real problem? After all, this is a long overdue decision.
  • There is too much advocacy and evangelism, from either side. The CTTE isn't really left alone to do a technical decision, but instead the main factors have become of political nature, unfortunately. You have all kinds of companies (such as Spotify) weigh in on the debate, too.
  • The tone has become quite aggressive and emotional, unfortunately. I can already foresee some comments on this blog post "you are a liar, because GNOME is now spelled Gnome!!1!".
  • Media attention. This upcoming decision has been picked up by various Linux media already, increasing the pressure on everybody.
  • Last but not least, the impact will be major. Debian is one of the largest distributions, last but not least used by Ubuntu and Steam, amongst others. Debian preferring one over the other will be a slap in somebodys face, unfortunately.
So how to solve it? Let the CTTE do their discussions, and stop flooding them with mails trying to influence them. There has been so much influencing going on, it may even backfire. I'm confident they will find a reasonable decision, or they'll decide to poll all the DDs. If you want to influence the outcome provide patches to anything that doesn't yet fully support your init system of choice! I'm sure there are hundreds of packages which do neither have upstart nor systemd support yet (as is, I currently have 27 init.d scripts launched by systemd, for example). IMHO, nothing is more convincing than have things just work, and of course, contributing code. We are in open source development, and the one thing that gets you sympathy in the community is to contribute code to someone elses project. For example, contribute full integrated power-management support into XFCE, if you include power management functionality.
As is, I have apparently 7 packages installed with upstart support, and 25 with systemd support. So either, everybody is crazy about systemd, or they have the better record of getting their scripts accepted upstream. (Note that this straw poll is biased - with systemd, the benefits of not using "legacy" init.d script may just be larger).
2014-01-22 16:39 — Categories: English Linux DebianPermaLink & Comments

Java Hotspot Compiler - a heavily underappreciated technology

When I had my first contacts with Java, probably around Java 1.1 or Java 1.2, it felt all clumsy and slow. And this is still the reputation that Java has to many developers: bloated source code and slow performance.
The last years I've worked a lot with Java; it would not have been my personal first choice, but as this is usually the language the students know best, it was the best choice for this project, the data mining framework ELKI.
I've learned a lot on Java since, also on debugging and optimizing Java code. ELKI contains a number of tasks that require a good number chrunching performance; something where Java particularly had the reputation of being slow.
I must say, this is not entirely fair. Sure, the pure matrix multiplication performance of Java is not up to Fortran (BLAS libraries are usually implemented in Fortran, and many tools such as R or NumPy will use them for the heavy lifting). But there are other tasks than matrix multiplication, too!
There is a number of things where Java could be improved a lot. Some of this will be coming with Java 8, others is still missing. I'd particularly like to see native BLAS support and multi-valued on-stack returns (to allow intrinsic sincos, for example).

In this post, I want to emphasize that usually the Hotspot compiler does an excellent job.
A few years ago, I have always been laughing at those that claimed "Java code can even be faster than C code"; because the Java JVM is written in C. Having had a deeper look at what the hotspot compiler does, I'm now saying: I'm not surprised that quite often, reasonably good Java code outperforms reasonably good C code.
In fact, I'd love to see a "hotspot" optimizer for C.
So what is it what makes Hotspot so fast? In my opinion, the key ingredient to hotspot performance is aggressive inlining. And this is exactly why "reasonably well written" Java code can be faster than C code written at a similar level.
Let me explain this at an example. Assuming we want to compute a pariwise distance matrix; but we want the code to be able to support arbitrary distance functions. The code will roughly look like this (not heavily optimized):
for (int i = 0; i < size; i++) {
  for (int j = i + 1; j < size; j++) {
    matrix[i][j] = computeDistance(data[i], data[j]);
  }
}
In C, if you want to be able to choose computeDistance at runtime, you would likely make it a function pointer, or in C++ use e.g. boost::function or a virtual method. In Java, you would use an interface method instead, i.e. distanceFunction.distance().
In C, your compiler will most likely emit a jmp *%eax instruction to jump to the method to compute the distance; with virtual methods in C++, it would load the target method from the vtable and then jmp there. Technically, it will likely be a "register-indirect absolute jump". Java will, however, try to inline this code at runtime, i.e. it will often insert the actual distance function used at the location of this call.
Why does this make a difference? CPUs have become quite good at speculative execution, prefetching and caching. Yet, it can still pay off to save those jmps as far as I can tell; and if it is just to allow the CPU to apply these techniques to predict another branch better. But there is also a second effect: the hotspot compiler will be optimizing the inlined version of the code, whereas the C compiler has to optimize the two functions independently (as it cannot know they will be only used in this combination).
Hotspot can be quite aggressive there. It will even inline and optimize when it is not 100% sure that these assumptions are correct. It will just add simple tests (e.g. adding some type checks) and jump back to the interpreter/compiler when these assumptions fail and then reoptimize again.
You can see the inlining effect in Java when you use the -Xcomp flag, telling the Java VM to compile everything at load time. It cannot do as much speculative inlining there, as it does not know which method will be called and which class will be seen. Instead, it will have to compile the code using virtual method invocations, just like C++ would use for executing this. Except that in Java, every single method will be virtual (in C++, you have to be explicit). You will likely see a substantial performance drop when using this flag, it is not recommended to use. Instead, let hotspot perform its inlining magic. It will inline a lot by default - in particular tiny methods such as getters and setters.
I'd love to see something similar in C or C++. There are some optimizations that can only be done at runtime, not at compile time. Maybe not even at linking time; but only with runtime type information, and that may also change over time (e.g. the user first computes a distance matrix for Euclidean distance, then for Manhattan distance).

Don't get me wrong. I'm not saying Java is perfect. There are a lot of common mistakes, such as using java.util.Collections for primitive types, which comes at a massive memory cost and garbage collection overhead. The first thing to debug when optimizing Java applications is to check for memory usage overhead. But all in all, good Java code can indeed perform well, and may even outperform C code, due to the inlining optimization I just discussed; in particular on large projects where you cannot fine-tune inlining in C anymore.

Sometimes, Hotspot may also fail. Which is largely why I've been investigating these issues recently. In ELKI 0.6.0 I'm facing a severe performance regression with linear scans (which is actually the simpler codepath, not using indexes but using a simple loop as seen above). I had this with 0.5.0 before, but back then I was able to revert back to an earlier version that still performed good (even though the code was much more redundant). This time, I would have had to revert a larger refactoring that I wanted to keep, unfortunately.
Because the regression was quite severe - from 600 seconds to 1500-2500 seconds (still clearly better than -Xcomp) - I first assumed I was facing an actual programming bug. Careful inspection down to the assembler code produced by the hotspot VM did not reveal any such error. Then I tried Java 8, and the regression was gone.
So apparently, it is not a programming error, but Java 7 failed at optimizing it remotely as good as it did with the previous ELKI version!
If you are an Java guru, interested at tracking down this regression, feel free to contact me. It's in an open source project, ELKI. I'd be happy to have good performance even for linear-scans, and Java 7. But I don't want to waste any more hours on this, but instead plan to move on to Java 8 for other reasons (lambda expressions, which will greatly reduce the amount of glue coded needed), too. Plus, Java 8 is faster in my benchmarks.
2013-12-16 15:13 — Categories: English CodingPermaLink & Comments

Numerical precision and "Big Data"

Everybody is trying (or pretending) to be "big data" these days. Yet, I have the impression that there are very few true success stories. People fight with the technology to scale, and have not even arrived at the point of doing detailed analysis - or even meta-analysis, i.e. whether the new solutions actuall perform better than the old "small data" approaches.
In fact, a lot of the "big data" approaches just reason "you can't do it with the existing solutions, so we cannot compare". Which is not exactly true.

In my experiments with large data (not big; it still fits into main memory) is that you have to be quite careful with your methods. Just scaling up existing methods does not always yield the expected results. The larger your data set, the more tiny problem surface that can ruin your computations.
"Big data" is often based on the assumption that just by throwing more data at your problem, your results will automatically become more precise. This is not true. On contrary: the larger your data, the more likely you have some contamination that can ruin everything.
We tend to assume that numerics and such issues have long been resolved. But while there are some solutions, it should be noted that they come at a price: they are slower, have some limitations, and are harder to implement.
Unforunately, they are just about everywhere in data analysis. I'll demonstrate it with a very basic example. Assume we want to compute the mean of the following series: [1e20, 3, -1e20]. Computing the mean, everybody should be able to do this, right? Well, let's agree that the true solution is 1, as the first and last term cancel out. Now let's try some variants:
  • Python, naive: sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy sum: numpy.sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy mean: numpy.mean([1e20, 3, -1e20]) yields 0.0
  • Python, less-known function: math.fsum([1e20, 3, -1e20])/3 yields 1.0
  • Java, naive: System.out.println( (1e20+3-1e20)/3 ); yields 0.0
  • R, mean: mean( c(1e20,3,-1e20) ) yields 0
  • R, sum: sum( c(1e20,3,-1e20) )/3 yields 0
  • Octave, mean: mean([1e20,3,-1e20]) yields 0
  • Octave, sum: sum([1e20,3,-1e20])/3 yields 0
So what is happening here? All of these functions (except pythons less known math.fsum) use double precision. With double precision, 1e20 + 3 = 1e20, as double can only retain 15-16 digits of precision. To actually get the correct result, you need to keep track of your error using additional doubles.
Now you may argue, this would only happen when having large differences in magnitude. Unfortunately, this is not true. It also surfaces when you have a large number of observations! Again, I'm using python to exemplify (because math.fsum is accurate).
> a = array(range(-1000000,1000001)) * 0.000001
> min(a), max(a), numpy.sum(a), math.fsum(a)
(-1.0, 1.0, -2.3807511517759394e-11, 0.0)
As it can be seen from the math.fsum function, solutions exist. For example Shewchuk's algorithm (which is probably what powers math.fsum). For many cases, Kahan summation will also be sufficient, which essentially gives you twice the precision of doubles.
Note that these issues become even worse once you use subtraction, such as when computing variance. never use the famous E[X^2]-E[X]^2 formula. It's mathematically correct, but when your data is not central (i.e. E[X] is not close to 0, and much smaller than your standard deviation) then you will see all kinds of odd errors, including negative variance; which may then yield NaN standard deviation:
> b = a + 1e15
> numpy.var(a), numpy.var(b)
(0.33333366666666647, 0.33594164452917774)
> mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2
(0.33333366666666647, -21532835718365184.0)
(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)
So why do we not always use the accurate computations? Well, we use floating point with fixed precision because it is fast. And most of the time, when dealing with well conditioned numbers, it is easily accurate enough. To show the performance difference:
> import timeit
> for f in ["sum(a)", "math.fsum(a)"]:
>     print timeit.timeit(f, setup="import math; a=range(0,1000)")
30.6121790409
202.994441986
So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won't arise.

Long story. Short takeaway: When dealing with large data, pay extra attention to the quality of your results. In fact, even do so when handling small data that is dirty and contains outliers and different magnitudes. Don't assume that computer math is exact, floating point arithmetic is not.
Don't just blindly scale up approaches that seemed to work on small data. Analyze them carefully. And last but not least, consider if adding more data will actually give you extra precision.
2013-11-02 23:47 — Categories: English Technology Research CodingPermaLink & Comments

Big Data madness and reality

"Big Data" has been hyped a lot, and due to this now is burning down to get some reality checking. It's been already overdue, but it is now happening.
I have seen a lot of people be very excited about NoSQL databases; and these databases do have their use cases. However, you just cannot put everything into NoSQL mode and expect it to just work. That is why we have recently seen NewSQL databases, query languages for NoSQL databases etc. - and in fact, they all move towards relational database management again.
Sound new database systems seem to be mostly made up of three aspects: in-memory optimization instead of optimizing for on-disk operation (memory just has become a lot cheaper the last 10 years), the execution of queries on the servers that store the actual data (you may want to call this "stored procedures" if you like SQL, or "map-reduce" if you like buzzwords), and optimized memory layouts (many of the benefits of "colum store" databases come from having a single, often primitive, datatype for the full column to scan, instead of alternating datatypes in records.

However, here is one point you need to consider:
is your data actually this "big"? Big as in: Google scale.
I see people use big data and Hadoop a lot when they just shouldn't. I see a lot of people run Hadoop in a VM on their laptop. Ouch.
The big data technologies are not a one-size-fits-all solution. They are the supersize-me solution, and supersize just does not fit every task.
When you look at the cases where Hadoop is really successful, it is mostly in keeping the original raw data, and enabling people to re-scan the full data again when e.g. their requirements changed. This is where Hadoop is really good at: managing 100 TB of data, and allowing you to quickly filter out the few GB that you really need for your current task.
For the actual analysis - or when you don't have 100 TB, and a large cluster anyway - then just don't try to hadoopify everything.
Here is a raw number from a current experiment. I have a student work on indexing image data; he is implementing some state of the art techniques. For these, a large number of image features are extracted, and then clustering is used to find "typical" features to improve image search.
The benchmarking image set is 56 GB (I have others with 2 TB to use next). The subset the student is currently processing is 13 GB. Extracting 2.3 million 128 dimensional feature vectors reduces this to about 1.4 GB. As you can see, the numbers drop quickly.
State of the art seems to be to load the data into Hadoop, and run clustering (actually, this is more of a vector quantization than clustering) into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.
However, as I've put a lot of effort into the data mining software ELKI, I considered also to try processing it in ELKI.
By cutting the data into 10 MB blocks, Mahout/Hadoop can run the clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it needs multiple processes. I have fixed the number of iterations to 10, which should produce a good enough approximation for my use cases.
K-means is embarrassingly parallel, so one would expect the cluster to really shine at this task. Well, here are some numbers:
  • Mahout k-Means took 51 minutes on average per iteration (The raw numbers are 38:24, 62:29, 46:21, 56:20, 52:15, 45:11, 57:12, 44:58, 52:01, 50:26, so you can see there is a surpisingly high amount of variance there).
  • ELKI k-Means on a single CPU core took 4 minutes 25 seconds per iteration, and 45 minutes total, including parsing the input data from an ascii file. Maybe I will try a parallel implementation next.

So what is happening? Why is ELKI beating Mahout by a factor of 10x?
It's (as always) a mixture of a number of things:
  • ELKI is quite well optimized. The Java Hotspot VM does a good job at optimizing this code, and I have seen it to be on par with R's k-means, which is written in C. I'm not sure if Mahout has received a similar amount of optimization yet. (In fact, 15% of the Mahout runtime was garbage collection runtime - indicating that it creates too many objects.)
  • ELKI can use the data in a uniform way, similar to a column store database. It's literally crunching the raw double[] arrays. Mahout on the other hand - as far as I can tell - is getting the data from a sequence file, which then is deserialized into a complex object. In addition to the actual data, it might be expecting sparse and dense vectors mixed etc.
  • Size: this data set fits well into memory. Once this no longer holds, ELKI will no longer be an option. Then MapReduce/Hadoop/Mahout wins. In particular, such an implementation will by design not keep the whole data set in memory, but need to de-serialize it from disk again on each iteration. This is overhead, but saves memory.
  • Design: MapReduce is designed for huge clusters, where you must expect nodes to crash during your computation. Well, chances are that my computer will survive 45 minutes, so I do not need this for this data size. However, when you really have large data, and need multiple hours on 1000 nodes to process it, then this becomes important to survive losing a node. The cost is that all interim results are written to the distributed filesystem. This extra I/O comes, again, at a cost.
Let me emphasize this: I'm not saying, Hadoop/Mahout is bad. I'm saying: this data set is not big enough to make Mahout beneficial.

Conclusions: As long as your data fits into your main memory and takes just a few hours to compute, don't go for Hadoop.
It will likely be faster on a single node by avoiding the overhead associated with (current) distributed implementations.
Sometimes, it may also be a solution to use the cluster only for preparing the data, then get it to a powerful workstation, and analyze it there. We did do this with the images: for extracting the image features, we used a distributed implementation (not on Hadoop, but on a dozen PCs).
I'm not saying it will stay that way. I have plans for starting "Project HELKI", aka "ELKI on Hadoop". Because I do sometimes hit the barrier of what I can compute on my 8 core CPU in an "reasonable" amount of time. And of course, Mahout will improve over time, and hopefully lose some of its "Java boxing" weight.
But before trying to run everything on a cluster, always try to run it on a single node first. You can still figure out how to scale up later, once you really know what you need.
And last but not least, consider whether scaling up really makes sense. K-means results don't really get better with a larger data set. They are averages, and adding more data will likely only change the last few digits. Now if you have an implementation that doesn't pay attention to numerical issues, you might end up with more numerical error than you gained from that extra data.
In fact, k-means can effectively be accelerated by processing a sample first, and only refining this on the full data set then. And: sampling is the most important "big data" approach. In particular when using statistical methods, consider whether more data can really be expected to improve your quality.
Better algorithms: K-means is a very crude heuristic. It optimizes the sum of squared deviations, which may be not too meaningful for your problem. It is not very noise tolerant, either. And there are thousands of variants. For example bisecting k-means, which no longer is embarrassingly parallel (i.e. it is not as easy to implement on MapReduce), but took just 7 minutes doing 20 iterations for each split. The algorithm can be summarized as starting with k=2, then always splitting the largest cluster in two until you have the desired k. For many use cases, this result will be just as good as the full k-means result.
Don't get me wrong. There are true big data problems. Web scale search, for example. Or product and friend recommendations at Facebook scale. But chances are that you don't have this amount of data. Google probably doesn't employ k-means at that scale either. (actually, Google runs k-means on 5 mio keypoints for vector quantization; which, judging my experience here, can still be done one a single host; in particular with hierarchical approaches such as bisecting k-means)
Don't choose the wrong tool for the wrong problem!
2013-09-27 19:34 — Categories: English Coding Technology ResearchPermaLink & Comments

Google turning evil

I used to be a fan of the Google company. I used to think of this as a future career opportunity, as I've received a number of invitations to apply for a job with them.
But I am no longer a fan.
Recently, the company has in my opinion turned to the "evil" side; they probably became worse than Microsoft ever was and the "do no evil" principle is long gone.
So what has changed? In my perception, Google:
  • Is much more focused on making money now, than making technology happen.
  • Instead of making cool technology, it tries more and more to be a "hipster" thing and following the classic old 80-20 rule: get 80% of the users with 20% of the effort. Think of the GMail changes that received so much hate and the various services it shuts down (Code search, Reader, Googlecode downloads) and how much Google is becoming a walled garden.
  • Google leverages its search market dominance as well as Android to push their products that don't perform as good as desired. Plus for example. There is a lot of things I like about plus. In particular when you use it as a "blog" type of conversation tool instead of a site for sharing your private data, then it's much better than Facebook because of the search function and communities. (And I definitely wouldn't cry if a Facebook successor emerges). But what I hate is how Google tries hard to leverage their Android and YouTube power to force people to Plus. And now they are spreading even further, into TV (ChromeCast) and ISP (Google Fiber) markets.
  • Hangups, oops, Hangouts is another such example. I've been using XMPP for a long time. At some point, Google started operating an XMPP server, and it would actually "federate" with other servers, and it worked quite nicely. But at some point they decided they need to have more "hipster" features, like ponies. Or more likely, they decided that they want all these users to use Plus. So now they are moving it away from an open standard towards a walled garden. But since 90% of my use is XMPP outside of Google, I will instead just move away from them. Worse.
  • Reader. They probably shut this down to support Currents and on the long run move people over to Plus, too. Fortunately here, there now exist a number of good alternatives such as Feedly. But again: Google failed my needs.
  • Maps. Again an example where they moved from "works good" to "hipster crap". The first thing the new maps always greets me with is a gray screen. The new 3D view looks like an earthquake happened, without offering any actual benefit over the classic (and fast) satellite imagery. Worse. In fact my favorite map service right now is an OpenStreetmap offline map.
  • Google Glass is pure hipsterdom crap. I have yet to see an actual use case for it. People use it to show off, and that's about it. If you are serious about photos, you use a DSLR camera with a large objective and sensor. But face it: it's just distraction. If you want productive, it's actually best to turn of all chat and email and focus. Best productivity tip ever: Turn off email notifications. And turn of chat, always-on and Glass, too.
  • Privacy. When privacy-aware US providers actually recommend against anyone trusting their private data to a company with physical ties to the United States, then maybe it's time to look out for services in countries that value freedom higher than spying. I cannot trust Google on keeping my data private.
We are living in worrisome times. The U.S.A., once proud defenders of freedom, have become the most systematic spies in the world, even on their own people. The surveillance in the Eastern Bloc is no match for this machinery built the last decade. Say hello to Nineteen Eighty-Four. Surveillance, in a mixture of multi-state and commercial surveillance has indeed become omnipresent. Wireless sensors in trash cans track your device MACs. Your email is automatically scanned both by the NSA and Google. Your social friendships are tracked by Facebook and your iPhone.
I'm not being paranoid. These things are real, and it can only be explained with a "brainwashing" with both the constantly raised fear of terror (note that you are much more likely to be killed in a traffic accident or by a random gun crazy in the U.S.A. than by terrorists) and the constant advertisement for useless technology such as Google Glass. Why have so many people stopped fighting for their freedom?
So what now? I will look into options to move away stuff from Google (that is mostly, my email -- although I'd really like to see a successor to email finally emerge). Maybe I can find a number of good services located e.g. in Switzerland or Norway (who still seem to hold freedom in high respect - neither the UK nor Germany are an alternative these days). And I hope that some politicians will have the courage to openly discuss whether it may be necessary to break up Google into "Boogles" (Baby-Googles), just as Reagan did with AT&T. But unfortunately, todays politicians are really bad at such decisions, in particular when it might lower their short-run popularity. They are only good at making excuses.
2013-08-20 14:25 — Categories: English Web CodingPermaLink & Comments

Google Hangouts drops XMPP support

Update: today I've been receiving XMPP messages in the Google+ variant of Hangouts. Looks as if it currently is back (at least while you are logged in via XMPP - havn't tried without pidgin at the same time yet). Let's just hope that XMPP federation will continue to be supported on the long run.
It's been all over the internet, so you probably heard it already: Google Hangouts no longer receives messages from XMPP users. Before, you could easily chat with "federated" users from other Jabber servers.
While of course the various open-source people are not amused -- for me, most of my contacts disappeared, so I then uninstalled Hangouts to get back Google Talk (apparently this works if Talk was preinstalled in your phones firmware) -- this bears some larger risks for Google:
  • Reputation: Google used to have the reputation of being open. XMPP support was open, the current "Hangups" protocol is not. This continuing trend of abandoning open standards and moving to "walled garden" solutions will likely harm the companies reputation in the open source community
  • Legal risk of an antitrust action: Before, other competitors could interface with Google using an indepentend and widely accepted standard. An example is United Internet in Germany, which operates for example the Web.de and GMX platforms, mail.com, the 1&1 internet provider. By effectively locking out its competitors - without an obvious technical reason, as XMPP was working fine just before, and apparently continues to be used at Google for example in AppEngine - bears a high risk of running into an antitrust action in Europe. If I were 1&1, I would try to get my lawyers started... or if I were Microsoft, who apparently just wanted to add XMPP messaging to Hotmail?
  • Users: Google+ is not that big yet. Especially in Germany. Since 90% of my contacts were XMPP contacts, where am I likely going to move to: Hangouts or another XMPP server? Or back to Skype? I still use Skype for more Voice calls than Google (which I used like twice), because there are some people that prefer Skype. One of these calls probably was not using the Google plugin, but an open source phone. Because with XMPP and Jingle, my regular chat client would interoperate. An in fact, the reason I started using Google Talk the first place was because it would interoperate with other networks, too, and I assumed they would be good at operating a Jabber server.
In my opinion, Google needs to quickly restore a functioning XMPP bridge. It is okay if they offer add-on functionality only for Hangout users (XMPP was always designed to allow for add-on functionality); it is also okay if they propose an entirely new open protocol to migrate to on the long run, if they can show good reasons such as scalability issues. But the way they approached the Hangup rollout looks like a big #fail to me.
Oh, and there are other issues, too. For example Linus Torvalds complains about the fonts being screwed up (not hinted properly) in the new Google+, others complain about broken presence indicators (but then you might as well just send an email, if you can't tell whether the recepient will be able to receive and answer right away), but using Hangouts will apparently also (for now -- rumor has it that Voice will also be replaced by Hangups entirely) lose you Google Voice support. The only thing that seems to give positive press are the easter eggs...
All in all, I'm not surprised to see over 20% of users giving the lowest rating in the Google Play Store, and less than 45% giving the highest rating - for a Google product, this must be really low.
2013-05-21 15:41 — Categories: English Web Google TechnologyPermaLink & Comments

ELKI data mining in 2013

ELKI, the data mining framework I use for all my research, is coming along nicely, and will see continued progress in 2013. The next release is scheduled for SIGMOD 2013, where we will be presenting the novel 3D parallel coordinates visualization we recently developed. This release will bear the version number 0.6.0.
Version 0.5.5 of ELKI is in Debian unstable since december (Version 0.5.0 will be in the next stable release) and Ubuntu raring. The packaged installation can share the dependencies with other Debian packages, so they are smaller than the download from the ELKI web site.
If you are developing cluster analysis or outlier detection algorithm, I would love to see them contributed to ELKI. If I get a clean and well-integrated code by mid june, your algorithm could be included in the next release, too. Publishing your algorithms in source code in a larger framework such as ELKI will often give you more citations. Because it is easier to compare with your algorithm then and to try it on new problems. And, well, citations counts are a measure that administration loves to judge researchers ...
So what else is happening with ELKI:
  • The new book "Outlier Analysis" by C. C. Aggarwal mentions ELKI for visual evaluation of outlier results as well as in the "Resources for the Practioner" section and cites around 10 publications closely related to ELKI.
  • Some classes for color feature extraction of ELKI have been contributed to jFeatureLib, a Java library for feature detection in image data.
  • I'd love to participate in the Google Summer of Code, but I need a contact at Google to "vouch" for the project, otherwise it is hard to get in. I've been sending a couple of emails, but so far have not heard back much yet.
  • As the performance of SVG/Batik is not too good, I'd like to see more OpenGL based visualizations. This could also lead to an Android based version for use on tablets.
  • As I'm not an UI guy, I would love to have someone make a fancier UI that still exposes all the rich functions we have. The current UI is essentially an automatically generated command line builder - which is nice, as new functionality shows up without the need to modify UI code. It's good for experienced users like me, but hard for beginners to get started.
  • I'd love to see integration of ELKI with e.g. OpenRefine / Google Refine to make it easier to do appropriate data cleaning and preprocessing
  • There is work underway for a distributed version running on Hadoop/YARN.
2013-02-28 10:55 — Categories: English Coding Debian ResearchPermaLink & Comments

Phoronix GNOME user survey

While not everybody likes Phoronix (common complaints include tabloid journalism), they are doing a GNOME user survey again this year. If you are concerned about Linux on the desktop, you might want to participate; it is not particularly long.
Unfortunately, "the GNOME Foundation still isn't interested in having a user survey", and may again ignore the results; and already last year you could see a lot of articles along the lines of The Survey That GNOME Would Rather Ignore. One more reason to fill it out.
2012-11-21 09:30 — Categories: English LinuxPermaLink & Comments

Migrating from GNOME3 to XFCE

I have been a GNOME fan for years. I actually liked the switch from 1.x to 2.x, and at some point switched to 3.x when it became somewhat usable. At some point, I even started some small Gnome projects, one even was uploaded to the Gnome repositories. But I didn't have much time for my Linux hobby anymore back then.
However, I am now switching to XFCE. And for all I can tell, I am about the last one to make that switch. Everybody I know hates the new Gnome.
My reason is not emotional. It's simple: I have systems that don't work well with OpenGL, and thus don't work well with Gnome shell. Up to now, I can live fine with "Fallback mode" (aka: Gnome classic). It works really good for me, and does exactly what I need. But it has been all over the media: Gnome 3.8 will drop 'fallback' mode.
Now the choice is obvious: instead of switching to shell, I go to XFCE. Which is much closer to the original Gnome experience, and very productivity oriented.
There are tons of rants on GNOME 3 (for one of the most detailed ones, see Gnome rotting in threes, going through various issues). Something must be very wrong about what they are doing to receive this many sh*tstorms all the time. Every project receives some. I've even received a share of the Gnome 2 storms when Galeon (an early Gnome browser) made the move and started dropping some of the hard-to-explain and barely used options that would break with every other Mozilla release. And Mozilla embedding was a major pain these days. Yet, for every feature there would be some user somewhere that loved it, and as Debian maintainer of Galeon, I got to see all the complaints (and at the same time was well aware of the bugs caused by the feature overload).
Yet with Gnome 3, things are IMHO a lot different. In Gnome 2, it was a lot about making things more usable as they are, a bit cleaner and more efficient. With Gnome 3, it seems to be about experimenting with new stuff. Which is why it keeps on breaking APIs all the time. For example themeing GTK 3 is constantly broken; most of the themes available just don't work. Similar Gnome Shell extensions - most of them work with exactly one version of Gnome Shell (doesn't this indicate the author has abandoned Gnome shell?).
But the one thing that was really sticking out was when my I updated the PC of my dad. Apart from some glitches, he could not even shutdown his PC with Gnome-shell. Because you needed to press the Alt button to actually get a shutdown option.
This is indicative of where Gnome is heading: something undefined inbetween of PCs, tablets, media centers and mobile phones. They just decided that users don't need to shutdown anymore, so they could as well drop that option.
But the worst thing about the current state of GNOME is: They happily live with it. They don't care that they are losing users by the dozens. Because to them, these are just "complainers". Of cousre there is some truth in "Complainers gonna complain and haters gonna hate". But what Gnome is receiving is way above average. At some point, they should listen. 200 posts long comment chains from dozens of peopls on LWN are not just your average "complaints". It's an indicator that a key user base is unhappy with the software. In 2010 GNOME 2 had 45% market share in the LinuxQuestions poll, XFCE had 15%. In 2011, GNOME 3 had 19%, and XFCE jumped to 28%. And I wouldn't be surprised if GNOME 3 shell (not counting fallback mode) would clock at less than 10% in 2012 - despite being default.
Don't get me wrong: there is a lot on Gnome that I really like. But as they decided to drop my preferred UI, I am of course looking for alternatives. In particular, as I can get lots of the Gnome 3 benefits with XFCE. There is a lot in the Gnome ecosystem that I value, and that IMHO is driving Linux forward. Network-manager, Poppler, Pulseaudio, Clutter just to name a few. Usually, the stuff that is modular is really good. And in fact I have been a happy user of the "fallback" mode, too. Yet, the overall "desktop" Gnome 3 goals are in my opinion targeting the wrong user group. Gnome might need to target linux developers more again, to keep a healthy development community around. Frequently triggering sh*tstorms by high-profile people such as Linux Torvalds is not going to strengthen the community. There is nothing wrong in the FL/OSS community to encourage people to use XFCE. But these are developers that Gnome might need at some point.
On a backend / technical level (away from the Shell/UI stuff that most of the rants are about), my main concern about the Gnome future is GTK3. GTK2 was a good toolkit for cross-platform development. GTK3 as of now is not, but is largely a Linux/Unix only toolkit - in particular, because there apparently is no up to date Win32 port. With GTK 3.4 it was said that they are now working on Windows - but as of GTK 3.6 they are still nowhere to be found. So if you want to develop cross-platform, as of now, you better stay away from GTK 3. If this doesn't change soon, GTK might sooner or later lose the API battle to more portable libraries.
Update: Some people at reddit seem to read this as if I am switching out of protest. This is incorrect. As "fallback" mode is now officially discontinued, I switch to the next best choice for me: XFCE. And I do this switch before things start breaking with some random upgrade. I know that XFCE is a good choice, so why not switch early? In fact, I've right now only given XFCE a test drive, but it already feels right, and maybe even slightly better than fallback mode.
2012-11-13 14:14 — Categories: English Linux CodingPermaLink & Comments

DBSCAN and OPTICS clustering

DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.
An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow "noise" objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.
The key parameter to DBSCAN and OPTICS is the "minPts" parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won't be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I'd start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.
The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.
OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.
OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won't produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting "steep" areas.
The open source ELKI data mining framework (package "elki" in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I've benchmarked this against GNU R ("fpc" package") and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours Update 11.1.2013: 84 minutes (with the 1.0.3 revision of the DBSCAN extension, some performance issues were resolved) and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version ...
ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.
Some part of the code may be a bit confusing at first. The "Parameterizer" classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.
Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.
for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())
is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.
ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);
is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).
Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.
(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)
2012-11-02 14:52 — Categories: English CodingPermaLink & Comments

Changing Gnome 3 colors

One thing that many people dislike about Gnome 3, in my opinion is that the authors/maintainers impose a lot of decisions on you. They are in fact not really hard coded, but I found documentation to be really inaccessble on how to change them.
For example colors. I found it extremely badly documented on how to customize GTK colors. And at the same time, a lot of the themes do not work reliably across different Gnome versions. For example the unico engine in Debian experimental is currently incompatible with the main GTK version there (and even worse, GTK does not realize this and refuse to load the incompatible engine). A lot of the themes you can get on gnome-look.org for example use unico. So it's pretty easy to get stuck with a non-working GTK 3, this really should not happen that easily. (I do not blame the Debian maintainers to not have worked around this using package conflicts yet - it's in experimental after all. But upstream should know when they are breaking APIs!)
For my work on the ELKI data mining framework I do a lot of work in Eclipse. And here GTK3 really is annoying, in particular the default theme. Next to unusable, actually, as code documentation tooltips show up black-on-black.
Recently, Gnome seems to be mostly driven by a mix of design and visual motivation. Gnome shell is a good example. No classic Linux user I've met likes it, even my dad immediately asked me how to get the classic panels back. It is only the designers that seem to love it. I'm concerned that they are totally off on their audience, they seem to target the mac OSX users instead of the Linux users. This is a pity, and probably much more a reason why Gnome so far does not succeed on the Desktop: it keeps on forgetting the users it already has. They by now seem to move to XFCE and LXDE because neither the KDE nor the Gnome crowd care about classic Linux users in the hunt for copying OSX & Co.
Anyway, enough ranting. Here is a simple workaround -- that hopefully is more stable across GTK/Gnome versions than all those themes out there -- that just slightly adjusts the default theme:
$ gsettings set \
org.gnome.desktop.interface gtk-color-scheme '
os_chrome_fg_color:black;
os_chrome_bg_color:#FED;
os_chrome_selected_fg_color:black;
os_chrome_selected_bg_color:#f5a089;
selected_fg_color:white;
selected_bg_color:#c50039;
theme_selected_fg_color:black;
theme_selected_bg_color:#c50039;
tooltip_fg_color:black;
tooltip_bg_color:#FFC;
'
This will turn your panel from a designer-hip black back to a regular grayish work panel. If you are working a lot with Eclipse, you'll love the last two options. That part makes the tooltips readable again! Isn't that great? Instead of caring about what is the latest hipster colors, we now have readable tooltips for developers again instead of all that fancy-schmanzy designer orgasms!
Alternatively, you can use dconf-editor to set and edit the value. The tricky part was to find out which variables to set. The (undocumented?) os_chrome stuff seems to be responsible for the panel. Feel free to change the colors to whatever you prefer!
GTK is quite customizable. And the gsettings mechanism actually is quite nice for this. It just seems to be really badly documented. The Adwaita theme in particular seems to have quite some hard-coded relationships also for the colors. And I havn't found a way (without doing a complete new theme) to just reduce padding, for example. In particular, as there probably are a hundred of CSS parameters that one would need to override to get it into everywhere (and with the next Gnome, there will be again two dozen to add?)
Above method just seems to be the best way to tweak the looks. At least the colors, since that is all that you can do this way. If you want to customize more, you probably have to do a complete theme. At which point, you probably have to redo this at every new version. And to pick on Miguel de Icaza: the kernel APIs are extremely stable, in particular compared to the mess that Gnome has been across versions. And at every new iteration, they manage to offend a lot of their existing users (and end up looking more and more like Apple - maybe we should copy more from where we are good at, instead of copying OSX and .NET?).
2012-10-22 17:31 — Categories: English LinuxPermaLink & Comments

Google Plus replacing blogs not Facebook

When Google launched Google+, a lot of people were very sceptical. Some outright claimed it to be useless. I must admit, it has a number of functions that really rock.
Google Plus is not a Facebook clone. It does not try to mimick Facebook that much. To me, it looks much more like a blog thing. A blog system, where everybody has to have a Google account, and then can comment (plus, you can then restrict access and share only with some people). It also encourages you to share shorter posts. Successful blogs always tried to make their posts "articles". Now the posts themselves are merely comments; but not as crazy short as Twitter (it is not a Twitter clone either), and it does have rich media contents, too.
Those who expect it to replace their Facebook where the interaction is all about personal stuff will be somewhat disappointed. Because it IMHO much less encourages the smalltalk type of interaction.
However, it won a couple of pretty high profile people to share their thoughts and web discoveries with the world. Some of the most active users I follow on Google Plus are: Linus Torvalds and Tim O'Reilly (of the publishing house O'Reilly)
Of course I also have a number of friends that share private stuff on Google Plus. But in my opinion the strength of Google Plus is on sharing publicly. Since Google is the king of search, they can both feed shares of your friends into your regular search results, but there is also a pretty interesting search in Google PLus. The key difference is that with this search, the focus is on what is new. Regular web search is also a lot about searching for old things (where you did not bother to remember the address or bookmark the site - and mind it, today a lot of people even "google for Google" ...) For example I like the plus search for data mining because it occasionally has some interesting links in it. A lot of the stuff is coming in again and again, but using the "j and k" keys, I can quickly scroll through these results to see if there is anything interesting. And there are quite a lot of interesting things I've discovered this way.
Note that this can change anytime. And maybe it is because I'm interested in technology stuff that it works well for me. But say, maybe you are more into HDR photography than me (I think they look unreal, as if someone has done way too much contrast and edge enhancing on the image). But go there, and press "j" a number of times to browse through some HDR shots. That is a pretty neat search function there. And if you come back tomorrow, there will likely be new images!
Facebook tried to clone this functionality. Google+ launched in June 2011, and in September 2011, Facebook added "subscribers". So they realized the need for having "non-friends" that are interested in what you are doing. Yet, I don't know anybody actually using it. And the Public posts search is much less interesting than of Google Plus, and the nice keyboard navigation is also missing.
Don't get me wrong, Facebook still has its uses. When I travel, Facebook is great for me to get into contact with locals to go swing dancing. There are a number of events where people only invite you on Facebook (and that is one of the reasons why I've missed a number of events - because I don't use Facebook that much). But mind it, a lot of the stuff that people share on Facebook is also really boring.
And that will actually be the big challenge for Google: keeping the search results interesting. Once you have millions of people there sharing pictures of lolcats - will it still return good results? Or will just about every search give you more lolcats?
And of course, spam. The SEO crowd is just warming up in exploring the benefits of Google Plus. And there are quite some benefits to be gained from connecting web pages to Google Plus, as this will make your search results stick out somehow, or maybe give them that little extra edge over other results. But just like Facebook at some point was so heavily spammed when every little shop was setting up his Facebook pages, inviting everyone to all the events and so on - this is bound to happen on Google Plus, too. We'll see how Google then reacts, and how quickly and effectively.
2012-09-09 16:01 — Categories: English Web SEOPermaLink & Comments

ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.

The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible "flexclus" version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to 11 hours Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, in particular with DBSCAN and OPTICS there seems to be something seriously broken. Update 11.1.2013: Eibe Frank from Weka had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that "extension" altogether. The much newer and Weka-like LOF module is much more comparable.)

Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.

Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.

But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.

I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

  • A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
  • Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I'd love to see this for ELKI, too.
  • Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won't even need to fit into main memory anymore. Yet, everybody now wants "big data", and parallel computation probably is the future. I'm currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true "elk yarn". I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
  • New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn't found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I'd love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
  • Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the "big data" stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don't have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
    For the "big data future" once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up "big data processing" to the average non-IT user, this will be big.
  • Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don't have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, ... just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.

2012-09-02 13:40 — Categories: English Research Java Coding ELKIPermaLink & Comments

ResearchGate Spam

Update Dec 2012: ResearchGate still keeps on sending me their spam. Most of the colleagues I had that tried out RG now deleted their account there, apparently, so the invitation mails become fewer.

Please do not try to push this link on Wikipedia just because you are also annoyed by their emails. My blog is not a "reliable source" by Wikipedia standards. It solely reflects my personal view of that web site, not journalistic or scientific research.

The reason why I call ResearchGate spam is the weasel words they use to trick authors into sending the invitation spam. Here's the text coming with the checkbox you need to uncheck (from the ResearchGate "blog")

Add my co-authors that are already using ResearchGate as contacts and invite those who are not yet members.
See how it is worded so it sounds much more like "link my colleagues that are already on researchgate" instead of "send invitation emails to my colleagues"? It deliberately avoids the mentioning of "email", too. And according to the researchgate news post, this is hidden in "Edit Settings", too (I never bothered to try it -- I do not see any benefit to me in their offers, so why should I?).

Original post below:


If you are in science, you probably already received a couple of copies of the ResearchGate spam. They are trying to build a "Facebook for scienctists", and so far, their main strategy seems to be aggressive inivitation spam.

So far, I've received around 5 of their "inivitations", which essentially sound like "Claim your papers now!" (without actually getting any benefit). When I asked my colleagues about these invitations none actually meant to invite me! This is why I consider this behaviour of ResearchGate to be spam. Plus, at least one of these messages was a reminder, not triggered by user interaction.

Right now, they claim to have 1.9 million users. They also claim "20% interact at least once a month". However, they have around 4000 Twitter followers and Facebook fans, and their top topics on their network are at like 10000-50000 users. That is probably a much more real user count estimation: 4k-40k. And these "20%" that interact, might just be those 20% the site grew in this timeframe and that happened to click on the sign up link. For a "social networking" site, these numbers are pointless anyway. That is probably even less than MySpace.

Because I do not see any benefit in their offers! Before going on an extremely aggressive marketing campaign like this, they really should consider to actually have something to offer...

And the science community is a lot about not wasting their time. It is a dangerous game that ResearchGate is playing here. It may appeal to their techies and investors to artificially inflate their user numbers in the millions. But if you pay for the user numbers with your reputation, that is a bad deal! Once you have the reputation as being a spammer (and mind it, every scientist I've talked to so far complained about the spam and "I clicked on it only to make it stop sending me emails") it's hard to be taken serious again. The scientific community is a lot about reputation, and ResearchGate is screwing up badly on this.

In particular, according to researchgate founder on quora, the invitations are opt-out on "claiming" a paper. Sorry, this is wrong. Don't make users annoy other users by sending them unwanted invitations to a worthless service!

And after all, there are alternatives such as Academia and Mendeley that do offer much more benefit. (I do not use these either, though. In my opinion, they also do not offer enough benefit to bother going to their website. I've mentioned the inaccuracy of Mendeleys data - and the lack of an option to get them corrected - before in an earlier blog post. Don't rely on Mendeley as citation manager! Their citation data is unreviewed.

I'm considering to send ResearchGate (they're Berlin based, but there maybe also is a US office you could direct this to) a cease and desist letter, denying them to store personal information on me, and to use my name on their websites to promote their "services". They may have visions of a more connected and more collaborative science, but they actually don't have new solutions. You can't solve everything by creating yet another web forum and "web2.0izing" everything. Although many of the web 2.0 bubble boys don't want to hear it: you won't solve world hunger and AIDS by doing another website. And there is a life outside the web.

2012-08-20 20:20 — Categories: English Research WebPermaLink & Comments

DMOZ dieing

Sometime in the late 1990s I became a DMOZ editor for my local area. At that time, when the internet was a nieche thing and I was still a kid, I was actually operating a web site that had a similar goal as the corresponding category for a non-profit organization.
In the following years, I would occasionally log in, try to review some pages. It was a really scary experience: it was still exactly the same, web 0.5 experience. You had a spreadsheet type of view, tons of buttons, and it would take like 10 page loads to just review a single site. A lot of the time, you would end up search a more appropriate category, copy the URL, replace some URL-encoded special characters, paste it in one out of 30 fields on the form just to move the suggested site to a more appropriate category. Most of the edits would be by bots that detected a dead link and disabled it by moving it to the review stage. While at the same time, every SEO manual said you need to be listed on DMOZ, so people would mass-submit all kinds of stuff to DMOZ in any category that it could in any way fit in.
Then AOL announced DMOZ 2.0. And everybody probably thought: about time to refresh the UI and make everything more usable. But it didn't. First of all, it came late (announced in 2008, actually delivered sometime in 2010), then it was incredibly buggy in the beginning. They re-launched 2.0 at least two times. For quite some time, editors would be unable to login.
When DMOZ 2.0 came, my account was already "inactive", but I was able to get it re-activated. And it still looked the same. I read they changed from Times to Arial, and probably changed some CSS. But other than that, it was still as complicated to edit links as you could make it. So I did just a few changes then lost interest largely again.
During the last year I must have tried to give it another try multiple times. But my account had expired again, and I never got a reply to my reinstatement request.
A year ago finall Google Directory - the most prominent use of DMOZ/ODP data, although the users were totally unaware of it - was discontinued, too.
So by now, DMOZ seems to be as dead as it can get (they don't even bother to answer former contributors that want to get reinstated). The links are old, and if it weren't for bots to disable dead sites, it would probably look like an internet graveyard. But this poses an interesting question: will someone come up with a working "web 2.0 social" idea of the "directory" concept (I'm not talking about Digg and these classic "social bookmarking" dead ducks)? Something that strikes the right balance of on one hand the web page admins (and the SEO gold diggers) being allowed to promote their sites (and keep the data accurate) and at the same time crowd-sourcing the quality control, while also opening the data? To some extend, Facebook and Google+ can do this, but they're largely walled gardens. But they don't have real social quality assurance; money is key there.
2012-06-09 18:53 — Categories: English WebPermaLink & Comments

Are likes still worth anything?

When Facebook became "the next big thing", you had the "like" buttons pop up on various web sites. An of course "going viral" was the big thing everybody talked about, in particular SEO experts (or those that would like to be that).
But things have changed. In particular Facebook has. In the beginning, any "like" would be announced in the newsfeed to all your friends. This was what allowed likes to go viral, when your friends re-liked the link. This is what made it attractive to have like buttons on your web pages. (Note that I'm not referring to "likes" of a single Facebook post; they are something quite different!)
Once that everybody "knew" how important this was, everbody tried to make the most out of it. In particular scammers, viruses and SEO people. Every other day, some clickjacking application would flood Facebook with likes. Every backwater website was trying to get more audience by getting "liked". But at some point Facebook just stopped showing "likes". This is not bad. It is the obvious reaction when people get too annoyed by the constant "like spam". Facebook had to put an end to this.

But now that a "like" is pretty much worthless (in my opinion). Still, many people following "SEO Tutorials" are all crazy about likes. Instead, we should reconsider whether we really want to slow down our site loading by having like buttons on every page. A like button is not as lightweight as you might think it is. It's a complex JavaScript that tries to detect clickjacking attacks, and in fact invades your users' privacy, up to the point where for example in Germany it may even be illegal to use the Facebook like button on a web site.
In a few months, the SEO people will realize that the "like"s are a fad now, and will likely all try to jump the Google+ bandwagon. Google+ is probably not half as much a "dud" as many think it is (because their friends are still on Facebook and because you cannot scribble birthday wishes on a wall in Google+). The point is that Google can actually use the "+1" likes to improve everyday search results. Google for something a friend liked, and it will show up higher in the search results, and Google will show the friend who recommended it. Facebook cannot do this, because it is not a search engine (well, you can use it for searching people, although Ark probably is better at this, and one does nowhere search as many people as one does regular web searches). Unless they go into a strong partnership with Microsoft Bing or Yahoo, the "like"s can never be as important as Google "+1" likes. So don't underestimate the Google+ strategy on the long run.
There are more points where Facebook by now is much less useful as it used to be. For example event invitations. When Facebook was in full growth, you could essentially invite all your friends to your events. You could also use lists to organize your friends, and invite only the appropriate subset, if you cared enough. The problem again was: nobody cared enough. Everybody would just invite all their friends, and you would end up getting "invitation spam" several times a day. So again Facebook had to change and limit the invitation capabilities. You can no longer invite all, or even just all on one particular list. There are some tools and tricks that can work around this to some extend, but once everybody uses that, Facebook will just have to cut it down even further.
Similarly, you might remember "superpoke" and all the "gift" applications. Facebook (and the app makers) probably made a fortune on them with premium pokes and gifts. But then this too reached a level that started to annoy the users, so they had to cut down the ability of applications to post to walls. And boom, this segment essentially imploded. I havn't seen numbers on Facebook gaming, and I figure that by doing some special setup for the games Facebook managed to keep them somewhat happy. But many will remember the time when the newsfeed would be full of Farmville and Mafia Wars crap ... it just does no longer work this way.

So when working with Facebook and such, you really need to be on the move. Right now it seems that groups and applications are more useful to get that viral dream going. A couple of apps such as Yahoo currently require you to install their app (which then may post to your wall on your behalf and get your personal information!) to follow a link shared this way, and then can actively encourage you to reshare. And messages sent to a "Facebook group" are more likely to reach people that aren't direct friends of yours. When friends actually "join" an event, this is currently showing up in the news feed. But all of this can change with 0 days notice.
It will be interesting to see if Facebook can on the long run keep up with Googles ability to integrate the +1 likes into search results. It probably takes just a few success stories in the SEO community to become the "next big thing" in SEO to get +1 instead of Facebook likes. Then Google just has to wait for them to virally spread +1 adoption. Google can wait - its Google Plus growth rates aren't bad, and they have a working business model already that doesn't rely on the extra growth - they are big already and make good profits.
Facebook however is walking on a surprisingly thin line. They need a tight control on the amount of data shared (which is probably why they try to do this with "magic"). People don't want to have the impression that Facebook is hiding something from them (although it is in fact suppressing a huge part of your friends activity!), but they also don't want to get all this data spammed onto them. And in particular, it needs to give the web publishers and app developers the right amount of that extra access to the users, while in turn keeping the major spam away from the users.

Independent of the technology and actual products, it will be really interesting to see if we manage to find some way to keep the balance in "social" one-to-many communication right. It's not a fault of Facebook that many people "spam" all their friends with all their "data". Googles Circles probably isn't the final answer either. The reason why email still works rather well was probably because it makes one-to-one communication easier than one-to-many, because it isn't realtime, and because people expect you to put enough effort into composing your mails and choosing the right receipients for the message. Current "social" communication is pretty much posting everything to everyone you know adressed as "to whoever it may concern". Much of it is in fact pretty non-personal or even non-social. We have definitely reached the point where more data is shared than is being read. Twitter is probably the most extreme example of a "write-only" medium. The average number a tweet is read by a human except the original poster must be way below 1, and definitely much less than the average number of "followers".
So in the end, the answer may actually be a good automatic address book, with automatic groups and rich clients, to enable everybody to easily use email more efficiently. On the other hand, separting "serious" communication from "entertainment" communication may be well worth having a separate communications channel, and email definitely is dated and is having spam problems.
2012-04-11 21:07 — Categories: English SEO WebPermaLink & Comments

Google Scholar, Mendeley and unreliable sources

Google Scholar and Mendeley need to do more quality control.
Take for example the article
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek
Scientific and Statistical Database Management (SSDBM 2008)
(part of my diploma thesis).
Apparently, someone screwed up entering the data into Mendeley and added the editors to the authors. Now Google happily imported this data into Google Scholar, and keeps on reporting the authors incorrectly, too. Of course, many people will again import this incorrect data into their bibtex files, upload it to Mendeley and others...
Yet, neither Google Scholar nor Mendeley has an option for reporting such an error. They don't even realize that maybe Springerlink - where the DOI points to - is the more reliable source.
On the contrary, Google Scholar just started suggesting to me that the editors are coauthors ...
They really need to add an option to fix such errors. There is nothing wrong with having errors in gathered data, but you need to have a way of fixing them.
2012-03-14 12:52 — Categories: English ResearchPermaLink & Comments

Faster web with NoScripts ABE

NoScript (sorry, Firefox only - there is no comparable functionality available in Chrome) is a must-have add-on for safer web surfing. It does not only prevent many clickjacking attacks, but it can do much more for you.

In the default setting, NoScript will block any script that you did not explicitely whitelist. While this is a bit annoying in the beginning - you will have to whitelist most of your everyday web pages - it will give you quite some insight in the amount of tracking that you are exposed to. A recent test showed that on a typical newspaper website, there will be tracking codes of more than 10 web sites (mostly ad websites and social networks). Accepting these will probably pull in another set. Most of this is happening in the background, and tracking you across various web sites this way.

NoScript will essentially force you to make a decision for each site: permanently allow it, temporarily allow it, or block it. Since it blocks by default, you will easily see what works without and what does not - if it doesn't work as expected, and you need the site, you can allow it with just a few clicks.

But there is more functionality hidden. NoScript has a function called ABE, "Application Boundaries Enforcer". This can be seen as a refinement of NoScript: you don't only whitelist web sites, but actually web site combinations. I'll give you a simple example of why and how this is useful. Consider these ABE rules:

# Only Facebook may embed Facebook
Site        .facebook.com .fbcdn.net .facebook.net
Accept from .facebook.com .fbcdn.net .facebook.net
Deny        INCLUSION POST

# Only Google may embed Google +1
Site        plusone.google.com
Accept from         google.com
Deny        INCLUSION POST

These rules are quite simple: essentially they say that no website may access facebook except facebook, and no website may access Google +1 except Google. I chose these rules for multiple reasons:

  1. I don't ever click on a "like" or "+1" button. I could as well not load them at all in the first place.
  2. These websites tend to track your behaviour.

Note that I did not block them altogether. I can still access the web pages as usual, if I want to. I even allowed links, but not scripts and similar embeddings.

And it doesn't just increase your privacy (read this current article in the NYT for an example of the amount of tracking happening these days). It also makes web pages load faster, because you don't load all their cruft all the time, and can live without them showing you videos from 3 different domains next to the actual article that you want to read ...

Update: I've learned that newer version of Chrome actually can filter on load (and not just display), and there is a similar extension available called ScriptNo. The main reason I'm currently moving away from Chromium is that it wastes more memory than Firefox, and I'm always short on RAM.

2012-02-22 21:23 — Categories: English Web SecurityPermaLink & Comments

Class management system

Dear Lazyweb.
A friend of mine is looking for a small web application to manage tiny classes (as in course, not as in computing). They usually span just four dates, and people will often sign up for the next class then. Usually 10-20 people per class, although some might not sign up via internet.
We deliberately don't want to require them to fully register for the web site and go through all that registration, email verification etc. trouble. Anything that takes more than filling out the obviously required form will just cause trouble.
At first it sounded like this is a common task, but in essence all the systems I've seen so far are totally overpowered for this. There are no grades, no working groups, no "customer relationship management". There isn't much more needed than the ability to easily configure the classes, have people book them, and get the list of singed up users into a spreadsheet easily (CSV will do).
It must be able to run on the typical PHP+MySQL web hoster and be open source.
Any recommendations? Drop me a comment or email at erich () debian org Thank you.
2011-10-09 15:04 — Categories: English LinuxPermaLink & Comments

Privacy in the public opinion

Many people in the united states seem to have the opinion, that the "public is willing to give up most of their privacy" in particular when dealing with online services such as Facebook. I believe in his keynote at ECML-PKDD, Albert-László Barabási of Harvard University expessed such a view, that this data will just become more and more available. I'm not sure if it was him or someone else (I believe it was someone else) that essentially claimed "privacy is irrelevant". Another popular opinion is that "it's only old people caring for privacy".
However, just like politics, these things tend to oscillate from one extreme to another. For example, the recent years in Europe, conservative parties were winning one election after another. Now in France, the socialist parties have just won the senate, the conservative parties in Germany are losing in one state after the other and so on. And this will change back again, too. Democracy also lives from changing roles in government, as this drives both progress and fights corruption.
We might be seeing the one extreme in the united states right now, where people are readily giving away their location and interests for free access to a web site. This can swing back any time.
In Germany, one of the government parties - the liberal democrats, FDP - just dropped out of the Berlin city government, down to 1.8% of voters. Yes, this is the party the German foreign minister, Guido Westerwelle is from. The pirate party [en.wikipedia.org] - much of their program is about privacy, civil rights, copyright reforms and the internet - which didn't even participate in the previous elections since they were was founded just 5 years ago jumped to 8.9%, scoring higher than the liberal democrats did in the previous elections. In 2009 they scored a surprising high 2% in the federal elections - current polls see them anywhere from 4% to 7% at the federal level, so they will probably get seats in parliament in 2013. (There are also other reasons why the liberal democrats have been losing voters so badly, though! Their current numbers indicate they might drop out of parliament in 2013.)
The Greens in Germany, which are also very much oriented towards privacy and civil rights, are also on the rise, and in march just became the second strongest party and senior partner in the governing coalition of Baden-Württemberg, which historically was a heartland of the conservatives.
So don't assume that privacy is irrelevant nowadays. The public opinion can swing quickly. In particular in democratic systems that have room for more than two parties - so probably not in the united states - such topics can actually influence elections a lot. Within 30 years, the Greens now frequently reach values of 20% in federal polls and up to 30% in some states. It doesn't look as if they are going to go away soon.
Also don't assume that it's just old people caring about privacy - in Germany, in particular the pirate party and the Greens are very much favored by the young people. The typical voter for the pirates is less than 30 years old, male, has a higher education and works in the media or internet business.
In Germany, much of the protest for more privacy - and against the too readily data collection by companies such as Facebook and Google - is driven by the young internet-users and -workers. I believe this will be similar in other parts of Europe - there are other pirate parties all over Europe. And this can happen to the united states any time, too.
Electronic freedom - e.g. pushed by the Electronic Frontier Foundation, but also the open source movement - does have quite a history in the united states. But in particular open source has made such a huge progress the last decade, these movements in the US could just be a bit out of breath right now. I'm sure they will come back with a strong push against the privacy invasions we're seeing right now. And that can likely take down a giant like Facebook, too. So don't bet on people continuing to give up their privacy!
2011-09-28 09:29 — Categories: English PoliticsPermaLink & Comments

ELKI 0.4 beta release

Two weeks ago, I've published the first beta of the upcoming ELKI 0.4 release. The accompanying publication at SSTD 2011 won the "Best Demonstration Paper Award"!
ELKI is a Java framework for developing data mining algorithms and index structures. It has indexes such as the R*-Tree and M-Tree, and a huge collection of algorithms and distance functions. These are all writen rather generic, so you can build all the combinations of indexes, algorithms and distances. There are evaluation and visualization modules.
Note that I'm using "data mining" in the broad, original sense that focuses on knowledge discovery by unsupervised methods such as clustering and outlier detection. Today, many people just think of machine learning and "artificial intelligence" - or even worse: large scale data collecting - when they hear data mining. But there is much more to it than just learning!
Java comes at a certain price. The latest version got already around 50% faster than the previous release just by reducing Java boxing and unboxing that puts quite some pressure on the memory management. So you could implement these things in C to become a lot faster; but this is not production software. I need code that I can put students on to work with it and extend it, this is much more important than getting the maximum speed. You can probably still use this for prototyping. See what works, then implement just that which you really need in a low level language for maximum performance.
You can do some of that in Java. You could work on a large chunk of doubles, and access them via the Unsafe class. But is that then still Java, or aren't you actually doing just plain C? In our framework, we want to be able to support non-numerical vectors and non-double distances, too. Even when they are only applicable to certain specialized use cases. Plus, generic and Java-style code is usually much more readable, and the performance cost is not critical for research use.
Release 0.4 has plenty of under the hood changes. It allows multiple indexes to exist in parallel, it support multi-relational data. There are also a dozen new algorithms, mostly from the geo/spatial outlier field, which were used for the demonstration. But for example, it also includes methods for rescaling the output of outlier detection methods to a more sensible numerical scale for visualization and comparison.
You can install ELKI on a Debian testing and unstable system by the usual "aptitude install elki" command. It will install a menu entry for the UI and also includes the command-line launcher "elki-cli" for batch operation. The "-h" flag can produce an extensive online help, or you can just copy the parameters from the GUI. By reusing Java packages such as Batik and FOP already in Debian, this also is a smaller download. I guess the package will at some point also transition to Ubuntu - since it is Java you can just download and install it anyway I guess.
2011-08-29 18:20 — Categories: English Coding Research ELKI JavaPermaLink & Comments

Missing in Firefox: anon and regular mode in parallel

The killer feature that Chrome has and Firefox is missing is quite simple: the ability to have "private" aka "anonymous" and non-private tabs open at the same time. As far as I can tell, with Firefox you can only be in one of these modes.
My vision of a next generation privacy browser would essentially allow you to have tabs in individual modes. With some simple rules for mode switching. Essentially, unknown sites should alwasy be opened in anonymous mode. Only sites that I register for should automatically switch to a tracked mode where cookies are kept, for my own convenience. And then there are sites in a safety category that should be isolated from any other contents such as my banking site. Going to these sites should require me to manually switch modes (except when using my bookmark). Embedding, framing and such things to these sites should be impossible.
On a side note, having TOR tabs would also be nice.
2011-08-22 18:46 — Categories: English WebPermaLink & Comments