Vitavonni

Google Hangouts drops XMPP support

Update: today I've been receiving XMPP messages in the Google+ variant of Hangouts. Looks as if it currently is back (at least while you are logged in via XMPP - havn't tried without pidgin at the same time yet). Let's just hope that XMPP federation will continue to be supported on the long run.
It's been all over the internet, so you probably heard it already: Google Hangouts no longer receives messages from XMPP users. Before, you could easily chat with "federated" users from other Jabber servers.
While of course the various open-source people are not amused -- for me, most of my contacts disappeared, so I then uninstalled Hangouts to get back Google Talk (apparently this works if Talk was preinstalled in your phones firmware) -- this bears some larger risks for Google:
  • Reputation: Google used to have the reputation of being open. XMPP support was open, the current "Hangups" protocol is not. This continuing trend of abandoning open standards and moving to "walled garden" solutions will likely harm the companies reputation in the open source community
  • Legal risk of an antitrust action: Before, other competitors could interface with Google using an indepentend and widely accepted standard. An example is United Internet in Germany, which operates for example the Web.de and GMX platforms, mail.com, the 1&1 internet provider. By effectively locking out its competitors - without an obvious technical reason, as XMPP was working fine just before, and apparently continues to be used at Google for example in AppEngine - bears a high risk of running into an antitrust action in Europe. If I were 1&1, I would try to get my lawyers started... or if I were Microsoft, who apparently just wanted to add XMPP messaging to Hotmail?
  • Users: Google+ is not that big yet. Especially in Germany. Since 90% of my contacts were XMPP contacts, where am I likely going to move to: Hangouts or another XMPP server? Or back to Skype? I still use Skype for more Voice calls than Google (which I used like twice), because there are some people that prefer Skype. One of these calls probably was not using the Google plugin, but an open source phone. Because with XMPP and Jingle, my regular chat client would interoperate. An in fact, the reason I started using Google Talk the first place was because it would interoperate with other networks, too, and I assumed they would be good at operating a Jabber server.
In my opinion, Google needs to quickly restore a functioning XMPP bridge. It is okay if they offer add-on functionality only for Hangout users (XMPP was always designed to allow for add-on functionality); it is also okay if they propose an entirely new open protocol to migrate to on the long run, if they can show good reasons such as scalability issues. But the way they approached the Hangup rollout looks like a big #fail to me.
Oh, and there are other issues, too. For example Linus Torvalds complains about the fonts being screwed up (not hinted properly) in the new Google+, others complain about broken presence indicators (but then you might as well just send an email, if you can't tell whether the recepient will be able to receive and answer right away), but using Hangouts will apparently also (for now -- rumor has it that Voice will also be replaced by Hangups entirely) lose you Google Voice support. The only thing that seems to give positive press are the easter eggs...
All in all, I'm not surprised to see over 20% of users giving the lowest rating in the Google Play Store, and less than 45% giving the highest rating - for a Google product, this must be really low.
2013-05-21 15:41 — Categories: English Web Google TechnologyPermaLink & Comments

ELKI data mining in 2013

ELKI, the data mining framework I use for all my research, is coming along nicely, and will see continued progress in 2013. The next release is scheduled for SIGMOD 2013, where we will be presenting the novel 3D parallel coordinates visualization we recently developed. This release will bear the version number 0.6.0.
Version 0.5.5 of ELKI is in Debian unstable since december (Version 0.5.0 will be in the next stable release) and Ubuntu raring. The packaged installation can share the dependencies with other Debian packages, so they are smaller than the download from the ELKI web site.
If you are developing cluster analysis or outlier detection algorithm, I would love to see them contributed to ELKI. If I get a clean and well-integrated code by mid june, your algorithm could be included in the next release, too. Publishing your algorithms in source code in a larger framework such as ELKI will often give you more citations. Because it is easier to compare with your algorithm then and to try it on new problems. And, well, citations counts are a measure that administration loves to judge researchers ...
So what else is happening with ELKI:
  • The new book "Outlier Analysis" by C. C. Aggarwal mentions ELKI for visual evaluation of outlier results as well as in the "Resources for the Practioner" section and cites around 10 publications closely related to ELKI.
  • Some classes for color feature extraction of ELKI have been contributed to jFeatureLib, a Java library for feature detection in image data.
  • I'd love to participate in the Google Summer of Code, but I need a contact at Google to "vouch" for the project, otherwise it is hard to get in. I've been sending a couple of emails, but so far have not heard back much yet.
  • As the performance of SVG/Batik is not too good, I'd like to see more OpenGL based visualizations. This could also lead to an Android based version for use on tablets.
  • As I'm not an UI guy, I would love to have someone make a fancier UI that still exposes all the rich functions we have. The current UI is essentially an automatically generated command line builder - which is nice, as new functionality shows up without the need to modify UI code. It's good for experienced users like me, but hard for beginners to get started.
  • I'd love to see integration of ELKI with e.g. OpenRefine / Google Refine to make it easier to do appropriate data cleaning and preprocessing
  • There is work underway for a distributed version running on Hadoop/YARN.
2013-02-28 10:55 — Categories: English Coding Debian ResearchPermaLink & Comments

Phoronix GNOME user survey

While not everybody likes Phoronix (common complaints include tabloid journalism), they are doing a GNOME user survey again this year. If you are concerned about Linux on the desktop, you might want to participate; it is not particularly long.
Unfortunately, "the GNOME Foundation still isn't interested in having a user survey", and may again ignore the results; and already last year you could see a lot of articles along the lines of The Survey That GNOME Would Rather Ignore. One more reason to fill it out.
2012-11-21 09:30 — Categories: English LinuxPermaLink & Comments

Migrating from GNOME3 to XFCE

I have been a GNOME fan for years. I actually liked the switch from 1.x to 2.x, and at some point switched to 3.x when it became somewhat usable. At some point, I even started some small Gnome projects, one even was uploaded to the Gnome repositories. But I didn't have much time for my Linux hobby anymore back then.
However, I am now switching to XFCE. And for all I can tell, I am about the last one to make that switch. Everybody I know hates the new Gnome.
My reason is not emotional. It's simple: I have systems that don't work well with OpenGL, and thus don't work well with Gnome shell. Up to now, I can live fine with "Fallback mode" (aka: Gnome classic). It works really good for me, and does exactly what I need. But it has been all over the media: Gnome 3.8 will drop 'fallback' mode.
Now the choice is obvious: instead of switching to shell, I go to XFCE. Which is much closer to the original Gnome experience, and very productivity oriented.
There are tons of rants on GNOME 3 (for one of the most detailed ones, see Gnome rotting in threes, going through various issues). Something must be very wrong about what they are doing to receive this many sh*tstorms all the time. Every project receives some. I've even received a share of the Gnome 2 storms when Galeon (an early Gnome browser) made the move and started dropping some of the hard-to-explain and barely used options that would break with every other Mozilla release. And Mozilla embedding was a major pain these days. Yet, for every feature there would be some user somewhere that loved it, and as Debian maintainer of Galeon, I got to see all the complaints (and at the same time was well aware of the bugs caused by the feature overload).
Yet with Gnome 3, things are IMHO a lot different. In Gnome 2, it was a lot about making things more usable as they are, a bit cleaner and more efficient. With Gnome 3, it seems to be about experimenting with new stuff. Which is why it keeps on breaking APIs all the time. For example themeing GTK 3 is constantly broken; most of the themes available just don't work. Similar Gnome Shell extensions - most of them work with exactly one version of Gnome Shell (doesn't this indicate the author has abandoned Gnome shell?).
But the one thing that was really sticking out was when my I updated the PC of my dad. Apart from some glitches, he could not even shutdown his PC with Gnome-shell. Because you needed to press the Alt button to actually get a shutdown option.
This is indicative of where Gnome is heading: something undefined inbetween of PCs, tablets, media centers and mobile phones. They just decided that users don't need to shutdown anymore, so they could as well drop that option.
But the worst thing about the current state of GNOME is: They happily live with it. They don't care that they are losing users by the dozens. Because to them, these are just "complainers". Of cousre there is some truth in "Complainers gonna complain and haters gonna hate". But what Gnome is receiving is way above average. At some point, they should listen. 200 posts long comment chains from dozens of peopls on LWN are not just your average "complaints". It's an indicator that a key user base is unhappy with the software. In 2010 GNOME 2 had 45% market share in the LinuxQuestions poll, XFCE had 15%. In 2011, GNOME 3 had 19%, and XFCE jumped to 28%. And I wouldn't be surprised if GNOME 3 shell (not counting fallback mode) would clock at less than 10% in 2012 - despite being default.
Don't get me wrong: there is a lot on Gnome that I really like. But as they decided to drop my preferred UI, I am of course looking for alternatives. In particular, as I can get lots of the Gnome 3 benefits with XFCE. There is a lot in the Gnome ecosystem that I value, and that IMHO is driving Linux forward. Network-manager, Poppler, Pulseaudio, Clutter just to name a few. Usually, the stuff that is modular is really good. And in fact I have been a happy user of the "fallback" mode, too. Yet, the overall "desktop" Gnome 3 goals are in my opinion targeting the wrong user group. Gnome might need to target linux developers more again, to keep a healthy development community around. Frequently triggering sh*tstorms by high-profile people such as Linux Torvalds is not going to strengthen the community. There is nothing wrong in the FL/OSS community to encourage people to use XFCE. But these are developers that Gnome might need at some point.
On a backend / technical level (away from the Shell/UI stuff that most of the rants are about), my main concern about the Gnome future is GTK3. GTK2 was a good toolkit for cross-platform development. GTK3 as of now is not, but is largely a Linux/Unix only toolkit - in particular, because there apparently is no up to date Win32 port. With GTK 3.4 it was said that they are now working on Windows - but as of GTK 3.6 they are still nowhere to be found. So if you want to develop cross-platform, as of now, you better stay away from GTK 3. If this doesn't change soon, GTK might sooner or later lose the API battle to more portable libraries.
Update: Some people at reddit seem to read this as if I am switching out of protest. This is incorrect. As "fallback" mode is now officially discontinued, I switch to the next best choice for me: XFCE. And I do this switch before things start breaking with some random upgrade. I know that XFCE is a good choice, so why not switch early? In fact, I've right now only given XFCE a test drive, but it already feels right, and maybe even slightly better than fallback mode.
2012-11-13 14:14 — Categories: English Linux CodingPermaLink & Comments

DBSCAN and OPTICS clustering

DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.
An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow "noise" objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.
The key parameter to DBSCAN and OPTICS is the "minPts" parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won't be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I'd start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.
The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.
OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.
OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won't produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting "steep" areas.
The open source ELKI data mining framework (package "elki" in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I've benchmarked this against GNU R ("fpc" package") and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours Update 11.1.2013: 84 minutes (with the 1.0.3 revision of the DBSCAN extension, some performance issues were resolved) and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version ...
ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.
Some part of the code may be a bit confusing at first. The "Parameterizer" classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.
Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.
for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())
is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.
ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);
is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).
Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.
(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)
2012-11-02 14:52 — Categories: English CodingPermaLink & Comments

Changing Gnome 3 colors

One thing that many people dislike about Gnome 3, in my opinion is that the authors/maintainers impose a lot of decisions on you. They are in fact not really hard coded, but I found documentation to be really inaccessble on how to change them.
For example colors. I found it extremely badly documented on how to customize GTK colors. And at the same time, a lot of the themes do not work reliably across different Gnome versions. For example the unico engine in Debian experimental is currently incompatible with the main GTK version there (and even worse, GTK does not realize this and refuse to load the incompatible engine). A lot of the themes you can get on gnome-look.org for example use unico. So it's pretty easy to get stuck with a non-working GTK 3, this really should not happen that easily. (I do not blame the Debian maintainers to not have worked around this using package conflicts yet - it's in experimental after all. But upstream should know when they are breaking APIs!)
For my work on the ELKI data mining framework I do a lot of work in Eclipse. And here GTK3 really is annoying, in particular the default theme. Next to unusable, actually, as code documentation tooltips show up black-on-black.
Recently, Gnome seems to be mostly driven by a mix of design and visual motivation. Gnome shell is a good example. No classic Linux user I've met likes it, even my dad immediately asked me how to get the classic panels back. It is only the designers that seem to love it. I'm concerned that they are totally off on their audience, they seem to target the mac OSX users instead of the Linux users. This is a pity, and probably much more a reason why Gnome so far does not succeed on the Desktop: it keeps on forgetting the users it already has. They by now seem to move to XFCE and LXDE because neither the KDE nor the Gnome crowd care about classic Linux users in the hunt for copying OSX & Co.
Anyway, enough ranting. Here is a simple workaround -- that hopefully is more stable across GTK/Gnome versions than all those themes out there -- that just slightly adjusts the default theme:
$ gsettings set \
org.gnome.desktop.interface gtk-color-scheme '
os_chrome_fg_color:black;
os_chrome_bg_color:#FED;
os_chrome_selected_fg_color:black;
os_chrome_selected_bg_color:#f5a089;
selected_fg_color:white;
selected_bg_color:#c50039;
theme_selected_fg_color:black;
theme_selected_bg_color:#c50039;
tooltip_fg_color:black;
tooltip_bg_color:#FFC;
'
This will turn your panel from a designer-hip black back to a regular grayish work panel. If you are working a lot with Eclipse, you'll love the last two options. That part makes the tooltips readable again! Isn't that great? Instead of caring about what is the latest hipster colors, we now have readable tooltips for developers again instead of all that fancy-schmanzy designer orgasms!
Alternatively, you can use dconf-editor to set and edit the value. The tricky part was to find out which variables to set. The (undocumented?) os_chrome stuff seems to be responsible for the panel. Feel free to change the colors to whatever you prefer!
GTK is quite customizable. And the gsettings mechanism actually is quite nice for this. It just seems to be really badly documented. The Adwaita theme in particular seems to have quite some hard-coded relationships also for the colors. And I havn't found a way (without doing a complete new theme) to just reduce padding, for example. In particular, as there probably are a hundred of CSS parameters that one would need to override to get it into everywhere (and with the next Gnome, there will be again two dozen to add?)
Above method just seems to be the best way to tweak the looks. At least the colors, since that is all that you can do this way. If you want to customize more, you probably have to do a complete theme. At which point, you probably have to redo this at every new version. And to pick on Miguel de Icaza: the kernel APIs are extremely stable, in particular compared to the mess that Gnome has been across versions. And at every new iteration, they manage to offend a lot of their existing users (and end up looking more and more like Apple - maybe we should copy more from where we are good at, instead of copying OSX and .NET?).
2012-10-22 17:31 — Categories: English LinuxPermaLink & Comments

Google Plus replacing blogs not Facebook

When Google launched Google+, a lot of people were very sceptical. Some outright claimed it to be useless. I must admit, it has a number of functions that really rock.
Google Plus is not a Facebook clone. It does not try to mimick Facebook that much. To me, it looks much more like a blog thing. A blog system, where everybody has to have a Google account, and then can comment (plus, you can then restrict access and share only with some people). It also encourages you to share shorter posts. Successful blogs always tried to make their posts "articles". Now the posts themselves are merely comments; but not as crazy short as Twitter (it is not a Twitter clone either), and it does have rich media contents, too.
Those who expect it to replace their Facebook where the interaction is all about personal stuff will be somewhat disappointed. Because it IMHO much less encourages the smalltalk type of interaction.
However, it won a couple of pretty high profile people to share their thoughts and web discoveries with the world. Some of the most active users I follow on Google Plus are: Linus Torvalds and Tim O'Reilly (of the publishing house O'Reilly)
Of course I also have a number of friends that share private stuff on Google Plus. But in my opinion the strength of Google Plus is on sharing publicly. Since Google is the king of search, they can both feed shares of your friends into your regular search results, but there is also a pretty interesting search in Google PLus. The key difference is that with this search, the focus is on what is new. Regular web search is also a lot about searching for old things (where you did not bother to remember the address or bookmark the site - and mind it, today a lot of people even "google for Google" ...) For example I like the plus search for data mining because it occasionally has some interesting links in it. A lot of the stuff is coming in again and again, but using the "j and k" keys, I can quickly scroll through these results to see if there is anything interesting. And there are quite a lot of interesting things I've discovered this way.
Note that this can change anytime. And maybe it is because I'm interested in technology stuff that it works well for me. But say, maybe you are more into HDR photography than me (I think they look unreal, as if someone has done way too much contrast and edge enhancing on the image). But go there, and press "j" a number of times to browse through some HDR shots. That is a pretty neat search function there. And if you come back tomorrow, there will likely be new images!
Facebook tried to clone this functionality. Google+ launched in June 2011, and in September 2011, Facebook added "subscribers". So they realized the need for having "non-friends" that are interested in what you are doing. Yet, I don't know anybody actually using it. And the Public posts search is much less interesting than of Google Plus, and the nice keyboard navigation is also missing.
Don't get me wrong, Facebook still has its uses. When I travel, Facebook is great for me to get into contact with locals to go swing dancing. There are a number of events where people only invite you on Facebook (and that is one of the reasons why I've missed a number of events - because I don't use Facebook that much). But mind it, a lot of the stuff that people share on Facebook is also really boring.
And that will actually be the big challenge for Google: keeping the search results interesting. Once you have millions of people there sharing pictures of lolcats - will it still return good results? Or will just about every search give you more lolcats?
And of course, spam. The SEO crowd is just warming up in exploring the benefits of Google Plus. And there are quite some benefits to be gained from connecting web pages to Google Plus, as this will make your search results stick out somehow, or maybe give them that little extra edge over other results. But just like Facebook at some point was so heavily spammed when every little shop was setting up his Facebook pages, inviting everyone to all the events and so on - this is bound to happen on Google Plus, too. We'll see how Google then reacts, and how quickly and effectively.
2012-09-09 16:01 — Categories: English Web SEOPermaLink & Comments

ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.

The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible "flexclus" version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to 11 hours Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, in particular with DBSCAN and OPTICS there seems to be something seriously broken. Update 11.1.2013: Eibe Frank from Weka had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that "extension" altogether. The much newer and Weka-like LOF module is much more comparable.)

Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.

Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.

But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.

I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

  • A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
  • Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I'd love to see this for ELKI, too.
  • Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won't even need to fit into main memory anymore. Yet, everybody now wants "big data", and parallel computation probably is the future. I'm currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true "elk yarn". I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
  • New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn't found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I'd love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
  • Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the "big data" stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don't have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
    For the "big data future" once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up "big data processing" to the average non-IT user, this will be big.
  • Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don't have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, ... just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.

2012-09-02 13:40 — Categories: English Research Java Coding ELKIPermaLink & Comments

ResearchGate Spam

Update Dec 2012: ResearchGate still keeps on sending me their spam. Most of the colleagues I had that tried out RG now deleted their account there, apparently, so the invitation mails become fewer.

Please do not try to push this link on Wikipedia just because you are also annoyed by their emails. My blog is not a "reliable source" by Wikipedia standards. It solely reflects my personal view of that web site, not journalistic or scientific research.

The reason why I call ResearchGate spam is the weasel words they use to trick authors into sending the invitation spam. Here's the text coming with the checkbox you need to uncheck (from the ResearchGate "blog")

Add my co-authors that are already using ResearchGate as contacts and invite those who are not yet members.
See how it is worded so it sounds much more like "link my colleagues that are already on researchgate" instead of "send invitation emails to my colleagues"? It deliberately avoids the mentioning of "email", too. And according to the researchgate news post, this is hidden in "Edit Settings", too (I never bothered to try it -- I do not see any benefit to me in their offers, so why should I?).

Original post below:


If you are in science, you probably already received a couple of copies of the ResearchGate spam. They are trying to build a "Facebook for scienctists", and so far, their main strategy seems to be aggressive inivitation spam.

So far, I've received around 5 of their "inivitations", which essentially sound like "Claim your papers now!" (without actually getting any benefit). When I asked my colleagues about these invitations none actually meant to invite me! This is why I consider this behaviour of ResearchGate to be spam. Plus, at least one of these messages was a reminder, not triggered by user interaction.

Right now, they claim to have 1.9 million users. They also claim "20% interact at least once a month". However, they have around 4000 Twitter followers and Facebook fans, and their top topics on their network are at like 10000-50000 users. That is probably a much more real user count estimation: 4k-40k. And these "20%" that interact, might just be those 20% the site grew in this timeframe and that happened to click on the sign up link. For a "social networking" site, these numbers are pointless anyway. That is probably even less than MySpace.

Because I do not see any benefit in their offers! Before going on an extremely aggressive marketing campaign like this, they really should consider to actually have something to offer...

And the science community is a lot about not wasting their time. It is a dangerous game that ResearchGate is playing here. It may appeal to their techies and investors to artificially inflate their user numbers in the millions. But if you pay for the user numbers with your reputation, that is a bad deal! Once you have the reputation as being a spammer (and mind it, every scientist I've talked to so far complained about the spam and "I clicked on it only to make it stop sending me emails") it's hard to be taken serious again. The scientific community is a lot about reputation, and ResearchGate is screwing up badly on this.

In particular, according to researchgate founder on quora, the invitations are opt-out on "claiming" a paper. Sorry, this is wrong. Don't make users annoy other users by sending them unwanted invitations to a worthless service!

And after all, there are alternatives such as Academia and Mendeley that do offer much more benefit. (I do not use these either, though. In my opinion, they also do not offer enough benefit to bother going to their website. I've mentioned the inaccuracy of Mendeleys data - and the lack of an option to get them corrected - before in an earlier blog post. Don't rely on Mendeley as citation manager! Their citation data is unreviewed.

I'm considering to send ResearchGate (they're Berlin based, but there maybe also is a US office you could direct this to) a cease and desist letter, denying them to store personal information on me, and to use my name on their websites to promote their "services". They may have visions of a more connected and more collaborative science, but they actually don't have new solutions. You can't solve everything by creating yet another web forum and "web2.0izing" everything. Although many of the web 2.0 bubble boys don't want to hear it: you won't solve world hunger and AIDS by doing another website. And there is a life outside the web.

2012-08-20 20:20 — Categories: English Research WebPermaLink & Comments

DMOZ dieing

Sometime in the late 1990s I became a DMOZ editor for my local area. At that time, when the internet was a nieche thing and I was still a kid, I was actually operating a web site that had a similar goal as the corresponding category for a non-profit organization.
In the following years, I would occasionally log in, try to review some pages. It was a really scary experience: it was still exactly the same, web 0.5 experience. You had a spreadsheet type of view, tons of buttons, and it would take like 10 page loads to just review a single site. A lot of the time, you would end up search a more appropriate category, copy the URL, replace some URL-encoded special characters, paste it in one out of 30 fields on the form just to move the suggested site to a more appropriate category. Most of the edits would be by bots that detected a dead link and disabled it by moving it to the review stage. While at the same time, every SEO manual said you need to be listed on DMOZ, so people would mass-submit all kinds of stuff to DMOZ in any category that it could in any way fit in.
Then AOL announced DMOZ 2.0. And everybody probably thought: about time to refresh the UI and make everything more usable. But it didn't. First of all, it came late (announced in 2008, actually delivered sometime in 2010), then it was incredibly buggy in the beginning. They re-launched 2.0 at least two times. For quite some time, editors would be unable to login.
When DMOZ 2.0 came, my account was already "inactive", but I was able to get it re-activated. And it still looked the same. I read they changed from Times to Arial, and probably changed some CSS. But other than that, it was still as complicated to edit links as you could make it. So I did just a few changes then lost interest largely again.
During the last year I must have tried to give it another try multiple times. But my account had expired again, and I never got a reply to my reinstatement request.
A year ago finall Google Directory - the most prominent use of DMOZ/ODP data, although the users were totally unaware of it - was discontinued, too.
So by now, DMOZ seems to be as dead as it can get (they don't even bother to answer former contributors that want to get reinstated). The links are old, and if it weren't for bots to disable dead sites, it would probably look like an internet graveyard. But this poses an interesting question: will someone come up with a working "web 2.0 social" idea of the "directory" concept (I'm not talking about Digg and these classic "social bookmarking" dead ducks)? Something that strikes the right balance of on one hand the web page admins (and the SEO gold diggers) being allowed to promote their sites (and keep the data accurate) and at the same time crowd-sourcing the quality control, while also opening the data? To some extend, Facebook and Google+ can do this, but they're largely walled gardens. But they don't have real social quality assurance; money is key there.
2012-06-09 18:53 — Categories: English WebPermaLink & Comments

Are likes still worth anything?

When Facebook became "the next big thing", you had the "like" buttons pop up on various web sites. An of course "going viral" was the big thing everybody talked about, in particular SEO experts (or those that would like to be that).
But things have changed. In particular Facebook has. In the beginning, any "like" would be announced in the newsfeed to all your friends. This was what allowed likes to go viral, when your friends re-liked the link. This is what made it attractive to have like buttons on your web pages. (Note that I'm not referring to "likes" of a single Facebook post; they are something quite different!)
Once that everybody "knew" how important this was, everbody tried to make the most out of it. In particular scammers, viruses and SEO people. Every other day, some clickjacking application would flood Facebook with likes. Every backwater website was trying to get more audience by getting "liked". But at some point Facebook just stopped showing "likes". This is not bad. It is the obvious reaction when people get too annoyed by the constant "like spam". Facebook had to put an end to this.

But now that a "like" is pretty much worthless (in my opinion). Still, many people following "SEO Tutorials" are all crazy about likes. Instead, we should reconsider whether we really want to slow down our site loading by having like buttons on every page. A like button is not as lightweight as you might think it is. It's a complex JavaScript that tries to detect clickjacking attacks, and in fact invades your users' privacy, up to the point where for example in Germany it may even be illegal to use the Facebook like button on a web site.
In a few months, the SEO people will realize that the "like"s are a fad now, and will likely all try to jump the Google+ bandwagon. Google+ is probably not half as much a "dud" as many think it is (because their friends are still on Facebook and because you cannot scribble birthday wishes on a wall in Google+). The point is that Google can actually use the "+1" likes to improve everyday search results. Google for something a friend liked, and it will show up higher in the search results, and Google will show the friend who recommended it. Facebook cannot do this, because it is not a search engine (well, you can use it for searching people, although Ark probably is better at this, and one does nowhere search as many people as one does regular web searches). Unless they go into a strong partnership with Microsoft Bing or Yahoo, the "like"s can never be as important as Google "+1" likes. So don't underestimate the Google+ strategy on the long run.
There are more points where Facebook by now is much less useful as it used to be. For example event invitations. When Facebook was in full growth, you could essentially invite all your friends to your events. You could also use lists to organize your friends, and invite only the appropriate subset, if you cared enough. The problem again was: nobody cared enough. Everybody would just invite all their friends, and you would end up getting "invitation spam" several times a day. So again Facebook had to change and limit the invitation capabilities. You can no longer invite all, or even just all on one particular list. There are some tools and tricks that can work around this to some extend, but once everybody uses that, Facebook will just have to cut it down even further.
Similarly, you might remember "superpoke" and all the "gift" applications. Facebook (and the app makers) probably made a fortune on them with premium pokes and gifts. But then this too reached a level that started to annoy the users, so they had to cut down the ability of applications to post to walls. And boom, this segment essentially imploded. I havn't seen numbers on Facebook gaming, and I figure that by doing some special setup for the games Facebook managed to keep them somewhat happy. But many will remember the time when the newsfeed would be full of Farmville and Mafia Wars crap ... it just does no longer work this way.

So when working with Facebook and such, you really need to be on the move. Right now it seems that groups and applications are more useful to get that viral dream going. A couple of apps such as Yahoo currently require you to install their app (which then may post to your wall on your behalf and get your personal information!) to follow a link shared this way, and then can actively encourage you to reshare. And messages sent to a "Facebook group" are more likely to reach people that aren't direct friends of yours. When friends actually "join" an event, this is currently showing up in the news feed. But all of this can change with 0 days notice.
It will be interesting to see if Facebook can on the long run keep up with Googles ability to integrate the +1 likes into search results. It probably takes just a few success stories in the SEO community to become the "next big thing" in SEO to get +1 instead of Facebook likes. Then Google just has to wait for them to virally spread +1 adoption. Google can wait - its Google Plus growth rates aren't bad, and they have a working business model already that doesn't rely on the extra growth - they are big already and make good profits.
Facebook however is walking on a surprisingly thin line. They need a tight control on the amount of data shared (which is probably why they try to do this with "magic"). People don't want to have the impression that Facebook is hiding something from them (although it is in fact suppressing a huge part of your friends activity!), but they also don't want to get all this data spammed onto them. And in particular, it needs to give the web publishers and app developers the right amount of that extra access to the users, while in turn keeping the major spam away from the users.

Independent of the technology and actual products, it will be really interesting to see if we manage to find some way to keep the balance in "social" one-to-many communication right. It's not a fault of Facebook that many people "spam" all their friends with all their "data". Googles Circles probably isn't the final answer either. The reason why email still works rather well was probably because it makes one-to-one communication easier than one-to-many, because it isn't realtime, and because people expect you to put enough effort into composing your mails and choosing the right receipients for the message. Current "social" communication is pretty much posting everything to everyone you know adressed as "to whoever it may concern". Much of it is in fact pretty non-personal or even non-social. We have definitely reached the point where more data is shared than is being read. Twitter is probably the most extreme example of a "write-only" medium. The average number a tweet is read by a human except the original poster must be way below 1, and definitely much less than the average number of "followers".
So in the end, the answer may actually be a good automatic address book, with automatic groups and rich clients, to enable everybody to easily use email more efficiently. On the other hand, separting "serious" communication from "entertainment" communication may be well worth having a separate communications channel, and email definitely is dated and is having spam problems.
2012-04-11 21:07 — Categories: English SEO WebPermaLink & Comments

Google Scholar, Mendeley and unreliable sources

Google Scholar and Mendeley need to do more quality control.
Take for example the article
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek
Scientific and Statistical Database Management (SSDBM 2008)
(part of my diploma thesis).
Apparently, someone screwed up entering the data into Mendeley and added the editors to the authors. Now Google happily imported this data into Google Scholar, and keeps on reporting the authors incorrectly, too. Of course, many people will again import this incorrect data into their bibtex files, upload it to Mendeley and others...
Yet, neither Google Scholar nor Mendeley has an option for reporting such an error. They don't even realize that maybe Springerlink - where the DOI points to - is the more reliable source.
On the contrary, Google Scholar just started suggesting to me that the editors are coauthors ...
They really need to add an option to fix such errors. There is nothing wrong with having errors in gathered data, but you need to have a way of fixing them.
2012-03-14 12:52 — Categories: English ResearchPermaLink & Comments

Faster web with NoScripts ABE

NoScript (sorry, Firefox only - there is no comparable functionality available in Chrome) is a must-have add-on for safer web surfing. It does not only prevent many clickjacking attacks, but it can do much more for you.

In the default setting, NoScript will block any script that you did not explicitely whitelist. While this is a bit annoying in the beginning - you will have to whitelist most of your everyday web pages - it will give you quite some insight in the amount of tracking that you are exposed to. A recent test showed that on a typical newspaper website, there will be tracking codes of more than 10 web sites (mostly ad websites and social networks). Accepting these will probably pull in another set. Most of this is happening in the background, and tracking you across various web sites this way.

NoScript will essentially force you to make a decision for each site: permanently allow it, temporarily allow it, or block it. Since it blocks by default, you will easily see what works without and what does not - if it doesn't work as expected, and you need the site, you can allow it with just a few clicks.

But there is more functionality hidden. NoScript has a function called ABE, "Application Boundaries Enforcer". This can be seen as a refinement of NoScript: you don't only whitelist web sites, but actually web site combinations. I'll give you a simple example of why and how this is useful. Consider these ABE rules:

# Only Facebook may embed Facebook
Site        .facebook.com .fbcdn.net .facebook.net
Accept from .facebook.com .fbcdn.net .facebook.net
Deny        INCLUSION POST

# Only Google may embed Google +1
Site        plusone.google.com
Accept from         google.com
Deny        INCLUSION POST

These rules are quite simple: essentially they say that no website may access facebook except facebook, and no website may access Google +1 except Google. I chose these rules for multiple reasons:

  1. I don't ever click on a "like" or "+1" button. I could as well not load them at all in the first place.
  2. These websites tend to track your behaviour.

Note that I did not block them altogether. I can still access the web pages as usual, if I want to. I even allowed links, but not scripts and similar embeddings.

And it doesn't just increase your privacy (read this current article in the NYT for an example of the amount of tracking happening these days). It also makes web pages load faster, because you don't load all their cruft all the time, and can live without them showing you videos from 3 different domains next to the actual article that you want to read ...

Update: I've learned that newer version of Chrome actually can filter on load (and not just display), and there is a similar extension available called ScriptNo. The main reason I'm currently moving away from Chromium is that it wastes more memory than Firefox, and I'm always short on RAM.

2012-02-22 21:23 — Categories: English Web SecurityPermaLink & Comments

Class management system

Dear Lazyweb.
A friend of mine is looking for a small web application to manage tiny classes (as in course, not as in computing). They usually span just four dates, and people will often sign up for the next class then. Usually 10-20 people per class, although some might not sign up via internet.
We deliberately don't want to require them to fully register for the web site and go through all that registration, email verification etc. trouble. Anything that takes more than filling out the obviously required form will just cause trouble.
At first it sounded like this is a common task, but in essence all the systems I've seen so far are totally overpowered for this. There are no grades, no working groups, no "customer relationship management". There isn't much more needed than the ability to easily configure the classes, have people book them, and get the list of singed up users into a spreadsheet easily (CSV will do).
It must be able to run on the typical PHP+MySQL web hoster and be open source.
Any recommendations? Drop me a comment or email at erich () debian org Thank you.
2011-10-09 15:04 — Categories: English LinuxPermaLink & Comments

Privacy in the public opinion

Many people in the united states seem to have the opinion, that the "public is willing to give up most of their privacy" in particular when dealing with online services such as Facebook. I believe in his keynote at ECML-PKDD, Albert-László Barabási of Harvard University expessed such a view, that this data will just become more and more available. I'm not sure if it was him or someone else (I believe it was someone else) that essentially claimed "privacy is irrelevant". Another popular opinion is that "it's only old people caring for privacy".
However, just like politics, these things tend to oscillate from one extreme to another. For example, the recent years in Europe, conservative parties were winning one election after another. Now in France, the socialist parties have just won the senate, the conservative parties in Germany are losing in one state after the other and so on. And this will change back again, too. Democracy also lives from changing roles in government, as this drives both progress and fights corruption.
We might be seeing the one extreme in the united states right now, where people are readily giving away their location and interests for free access to a web site. This can swing back any time.
In Germany, one of the government parties - the liberal democrats, FDP - just dropped out of the Berlin city government, down to 1.8% of voters. Yes, this is the party the German foreign minister, Guido Westerwelle is from. The pirate party [en.wikipedia.org] - much of their program is about privacy, civil rights, copyright reforms and the internet - which didn't even participate in the previous elections since they were was founded just 5 years ago jumped to 8.9%, scoring higher than the liberal democrats did in the previous elections. In 2009 they scored a surprising high 2% in the federal elections - current polls see them anywhere from 4% to 7% at the federal level, so they will probably get seats in parliament in 2013. (There are also other reasons why the liberal democrats have been losing voters so badly, though! Their current numbers indicate they might drop out of parliament in 2013.)
The Greens in Germany, which are also very much oriented towards privacy and civil rights, are also on the rise, and in march just became the second strongest party and senior partner in the governing coalition of Baden-Württemberg, which historically was a heartland of the conservatives.
So don't assume that privacy is irrelevant nowadays. The public opinion can swing quickly. In particular in democratic systems that have room for more than two parties - so probably not in the united states - such topics can actually influence elections a lot. Within 30 years, the Greens now frequently reach values of 20% in federal polls and up to 30% in some states. It doesn't look as if they are going to go away soon.
Also don't assume that it's just old people caring about privacy - in Germany, in particular the pirate party and the Greens are very much favored by the young people. The typical voter for the pirates is less than 30 years old, male, has a higher education and works in the media or internet business.
In Germany, much of the protest for more privacy - and against the too readily data collection by companies such as Facebook and Google - is driven by the young internet-users and -workers. I believe this will be similar in other parts of Europe - there are other pirate parties all over Europe. And this can happen to the united states any time, too.
Electronic freedom - e.g. pushed by the Electronic Frontier Foundation, but also the open source movement - does have quite a history in the united states. But in particular open source has made such a huge progress the last decade, these movements in the US could just be a bit out of breath right now. I'm sure they will come back with a strong push against the privacy invasions we're seeing right now. And that can likely take down a giant like Facebook, too. So don't bet on people continuing to give up their privacy!
2011-09-28 09:29 — Categories: English PoliticsPermaLink & Comments

ELKI 0.4 beta release

Two weeks ago, I've published the first beta of the upcoming ELKI 0.4 release. The accompanying publication at SSTD 2011 won the "Best Demonstration Paper Award"!
ELKI is a Java framework for developing data mining algorithms and index structures. It has indexes such as the R*-Tree and M-Tree, and a huge collection of algorithms and distance functions. These are all writen rather generic, so you can build all the combinations of indexes, algorithms and distances. There are evaluation and visualization modules.
Note that I'm using "data mining" in the broad, original sense that focuses on knowledge discovery by unsupervised methods such as clustering and outlier detection. Today, many people just think of machine learning and "artificial intelligence" - or even worse: large scale data collecting - when they hear data mining. But there is much more to it than just learning!
Java comes at a certain price. The latest version got already around 50% faster than the previous release just by reducing Java boxing and unboxing that puts quite some pressure on the memory management. So you could implement these things in C to become a lot faster; but this is not production software. I need code that I can put students on to work with it and extend it, this is much more important than getting the maximum speed. You can probably still use this for prototyping. See what works, then implement just that which you really need in a low level language for maximum performance.
You can do some of that in Java. You could work on a large chunk of doubles, and access them via the Unsafe class. But is that then still Java, or aren't you actually doing just plain C? In our framework, we want to be able to support non-numerical vectors and non-double distances, too. Even when they are only applicable to certain specialized use cases. Plus, generic and Java-style code is usually much more readable, and the performance cost is not critical for research use.
Release 0.4 has plenty of under the hood changes. It allows multiple indexes to exist in parallel, it support multi-relational data. There are also a dozen new algorithms, mostly from the geo/spatial outlier field, which were used for the demonstration. But for example, it also includes methods for rescaling the output of outlier detection methods to a more sensible numerical scale for visualization and comparison.
You can install ELKI on a Debian testing and unstable system by the usual "aptitude install elki" command. It will install a menu entry for the UI and also includes the command-line launcher "elki-cli" for batch operation. The "-h" flag can produce an extensive online help, or you can just copy the parameters from the GUI. By reusing Java packages such as Batik and FOP already in Debian, this also is a smaller download. I guess the package will at some point also transition to Ubuntu - since it is Java you can just download and install it anyway I guess.
2011-08-29 18:20 — Categories: English Coding Research ELKI JavaPermaLink & Comments

Missing in Firefox: anon and regular mode in parallel

The killer feature that Chrome has and Firefox is missing is quite simple: the ability to have "private" aka "anonymous" and non-private tabs open at the same time. As far as I can tell, with Firefox you can only be in one of these modes.
My vision of a next generation privacy browser would essentially allow you to have tabs in individual modes. With some simple rules for mode switching. Essentially, unknown sites should alwasy be opened in anonymous mode. Only sites that I register for should automatically switch to a tracked mode where cookies are kept, for my own convenience. And then there are sites in a safety category that should be isolated from any other contents such as my banking site. Going to these sites should require me to manually switch modes (except when using my bookmark). Embedding, framing and such things to these sites should be impossible.
On a side note, having TOR tabs would also be nice.
2011-08-22 18:46 — Categories: English WebPermaLink & Comments

Documenting fat-jar licenses

Dear Lazyweb,
What is the appropriate way to document the individual licenses of a fat jar (a jar archive that includes the program along with all its dependencies)?
I've been living in the happy world of Linux distributions, where one would just package the dependencies independently (or just specify which existing packages you use, actually, since most is already packaged by helpful other developers). But for the non-Linux people, a fat jar seems to be important so they can double-click it to run the application.
When building larger Java applications, you end up using a couple of external code such as various Apache Commons and Apache Batik. I'm currently including all their LICENSE.txt files in a directory named "legal", and I'm trying to make it as obvious as possible which license applies to which parts of the jar archive. Is there any best-practice of doing this? I don't want to reinvent the wheel; it'd also like to avoid any common legal pitfalls, obviously.
Feel free to respond either using the Disqus comment function or by email via erich () debian org
2011-08-16 17:23 — Categories: English Coding JavaPermaLink & Comments

Restricting Skype via iptables

Whenever I launch Skype on my computer, it gets banned from the university network within a few minutes; the ban expires again after a few minutes when I close Skype. This is likely due to the aggresive nature of Skype, maybe the firewalls think it is trying to do a DDoS attack. One of the known big issues of using Skype.
For Windows users, there are some known workaround to limit Skype that usually involve registry editing. These are however not available on Linux, unfortunately.
Therefore, I decided to play around with advanced iptables functionality. While you cannot match the originating process reliably (the owner match module seemed to include such functionality at some point, but it was deemed unreliable on multi-core systems). However, there are other and more efficient methods of achieving the same.
Here's my setup:
# Add a system group for Skype
addgroup --system skype
# Override permissions of skype (assuming Debian package!)
dpkg-statoverride --update --add root skype 2755 `which skype`
And these are the iptables rules I use:
iptables -I OUTPUT -p tcp -m owner --gid-owner skype \
    -m multiport ! --dports 80,443 -j REJECT
iptables -I OUTPUT -p udp -m owner --gid-owner skype -j REJECT
They allow outgoing connections by Skype only on ports 80 and 443, which supposedly do not trigger the firewall (in fact, this filter is recommended by our network administration for Skype).
Or wrapped as pyroman (my firewall configuration tool; aptitude install pyroman) module:
"""
Skype restriction to avoid firewall block.

Raw iptables commands.
"""
iptables(Firewall.output, "-p tcp -m owner --gid-owner skype -m multiport ! --dports 80,443 -j %s" % Firewall.reject)
iptables(Firewall.output, "-p udp -m owner --gid-owner skype -j %s" % Firewall.reject)
which I've put just after the conntrack default module, as 05_skype.py
2011-07-26 18:45 — Categories: English LinuxPermaLink & Comments

Google vs. Facebook

Let the games begin. It looks like Google and Facebook are going to fight it out.
Given the spectacular failures of Wave and Buzz, people are of course skeptic about the success of Google Plus. However, I'm rather confident it is going to stay. Here are some things I like about it:
  • Privacy: as far as I can tell, Plus design had started with privacy in mind, whereas for Facebook it is still an unloved child, a spare tire. Facebook keeps on getting bad reviews here; people don't get their UI and mis-share things. I read somewhere that FB is actually losing users in the US: kids who leave Facebook because their parents are getting on, and they don't want them to see what they shared with their friends.
  • UI: the Circles UI is very easy to use. The same functionality on Facebook is awful to manage. And the notifications UI in Plus also are a lot better than the tiny indicator in Facebook.
  • New stuff: Hangout and group chats are pretty interesting (well, actually I don't like Video at all ...). This puts pressure on Facebook to move, and it looks like they might present some Skype integration soon. But they need to make this really good to keep up with hangout. Just adding "Call me" buttons for all users that have specified a Skype account in their profile won't do the trick.
  • Future integration: Wave already had many of these things, but the wrong way. You would put a map into a chat; the Plus way will probably be to share a map session with a circle and add chat and collaborative editing there. Google "Office" is another place where it is trivial but very useful to add Plus.
Some people think that Google will not stand a chance against social giant Facebook. But after all, Google has more users - and they have tons of services people like to use. So when Google Maps has Plus integration, will the users use it to chat about where to meet, or will they go the long way and post the map URL on Facebook, without the option of updating it collaboratively?
Googles position is much stronger than many people believe, once you think about integration possibilities. The current Plus is just a fragment, the missing puzzle piece connecting the other apps. But imagine that Google now Plus-connects its various services: YouTube, Maps, Mail, Talk, ... - for them, this is just a matter of some engineering. I guess probably half of these are already in internal testing. And Facebook just can't keep up there. Sure, they do have Facebook Video. But actually, most people prefer YouTube. And while Google can integrate plus all the way there, Facebook cannot. And while Google is the master of search, Facebook is particularly weak there - they can't even properly search their own data. Google however will at some point offer an "find things that interest me" button; Sparks is just the beginning where you have to manually define your interests (which will probably remain an option due to privacy concerns!); it is way too static right now.
So essentially, Google doesn't need to copy Facebook. They just need to do what is obvious on their own products, and Facebook will have a hard time keeping up.
Plus, in my opinion, Google got the timing just right. Users aren't too happy with Facebook these days, there is just no big alternative around anymore; their friends are on Facebook and not on some other social network. Facebook doesn't seem to evolve much anymore. The mail functionality opened more security holes (apparently you can post to a group with a fake user name when you spoof the senders email address) than it contributed to functionality and usefulness. Privacy is still not in line with all countries such as Germany; but Facebook keeps telling those users essentially that they don't care. Spam and fraud still reappears every month following the same pattern again and again (Clickjacking). The search function of Facebook is still usually described as "useless" ... People waste time in games and annoy their friends with random game invitations and posts. Facebook should better make a major move now, too. More than a demo of Skype integration. But whenever Facebook changed, their users complained ...

Of course, Google+ still has a long way to go, too. There are still many things missing here, too. For example groups and events. I figure Google is already testing them in "dogfood", and they'll actually come out within the month. With groups I do not refer to the existing Google product, but to what would be "public circles" that you need to join and that are accessible to all the circle members instead of just the creator. And events are also a key function; probably one of the most used on Facebook. These may require much more careful design to integrate well with Calendar. But given the visual update of Calender these days, this may just be around the corner, too.
2011-07-03 12:21 — Categories: English WebPermaLink & Comments

Managing user configuration files

Dear Lazyweb,
How do you manage your user configuration files? I have around four home directories I frequently use. They are sufficiently well enough in sync, but I have been considering to actually use some file management to synchronize them better. I'm talking about files such as shell config, ssh config, .vimrc etc.
I had some discussions about this before, and the consensus had been that some version control system probably is best. Git seemed to be a good candidate; I remember having read about things like this a dozen years ago when CVS was still common and Subversion was new.
So dear lazyweb, what are your experiences with managing your user configuration? What setup would you recommend?
Update: See vcs-home for various related links and at least five different ways of doing this. mr, a multi-repository VCS wrapper seems particularly well at this.
2011-06-02 16:33 — Categories: English LinuxPermaLink & Comments

Dear Lazyweb, how to write multi-locale python code

Dear Lazyweb,
I've been toying around with a python WSGI application, i.e. a multi-threaded persistent web application. Now I'd like to add multi-language support to this application. I need to format datetimes to human readable formats, but I havn't found a way yet to do this in a sane way using strftime. Essentially, strftime will use the current application locale; however since I'm running multi-threaded, different threads might want to use different locales. So changing the locale is bound to cause race conditions.
So what is the best way to pretty-print (including week day names!) datetime, currency and similar values in a multi-threaded multi-locale context in python? Gettext and manually emulating strftime doesn't sound that sensible to me. And of course, I don't want to have to translate the weekday names myself into any language I choose to support...
2011-05-27 18:23 — Categories: English CodingPermaLink & Comments

AMD64 broken on Debian unstable - avoid libc6 2.13-3

Beware from upgrading on AMD64. Make sure to avoid version 2.1.3-3, as this will render your system unbootable and unusable. As simple as the reason is (a missing link) as severe.
Bug report with instructions on how to recover. If you are lucky you have a root shell open to restore the missing link. Otherwise, you need to reboot with parameters break=init rw, recover the link with cd root; ln -s lib lib64, sync, unmount, reboot. It's not really hard to do when you know how. But it is a lot easier to avoid upgrading to this version. My i386 mirror already has the fixed upload (but i386 is not affected anyway). So by tomorrow, it should be safe again (depening on your mirrors delay).
2011-05-12 20:39 — Categories: English Linux DebianPermaLink & Comments

Upcoming publications in data mining

Upcoming 2011 publications of my research:
Just presented at the SDM11 last weekend:
H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Interpreting and Unifying Outlier Scores
In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, 2011.
To be presented and published end of August:
T. Bernecker, M. E. Houle, H.-P. Kriegel, P. Kröger, M. Renz, E. Schubert, A. Zimek
Quality of Similarity Rankings in Time Series
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.
E. Achtert, A. Hettab, H.-P. Kriegel, E. Schubert, A. Zimek
Spatial Outlier Detection: Data, Algorithms, Visualizations
In Proceedings of the 12th International Symposium on Spatial and Temporal Databases (SSTD), Minneapolis, MN, 2011.
The latter will also accompany the release of version 0.4 of our data mining research software ELKI.
2011-05-05 10:57 — Categories: English Research ELKIPermaLink & Comments

SIAM SDM11 - Unified Outlier Scores

I'm currently in Phoenix, AZ at the 2011 SIAM International Conference on Data Mining.
My contribution is titled "Interpreting and Unifying Outlier Scores", a method that allows the combination, interpretation, visualization etc. of existing outlier algorithms. The method brings back a bit more statistics into a data mining area that has drifted away from the statistical roots.
We apply the method to a couple of outlier detection algorithms and combine them using a naive ensemble approach, that still outperforms existing outlier ensembles.
2011-04-28 15:24 — Categories: English ResearchPermaLink & Comments

Finding packages for deinstallation

On my netbook, I try to keep the amount of installed software limited. Aptitudes "automatically installed" markers are very helpful here, since they allow you to differentiate between packages that were deliberately installed and packages that were manually marked for installation. I quite often browse through the list of installed packages and recheck those that are not marked as "A".
However, packages that are "suggested" by some other package (but not "required") will be kept even when marked as automatically. This is quite sensible: when you deinstall the package that "suggested" them, they will be removed. So this is nice for having optional software also automatically removed.
However sometimes you need the core package but not this optional functionality. Aptitude can help you there, too. Here's an aptitude filter I used to find some packages for removal:
!?reverse-depends(~i) ~M !?essential
It will display only packages with no direct dependency from another installed package and that are marked as automatically installed (so they must be kept installed because of a weaker dependency.
Some examples of "suggested but not required" packages:
  • Accessibility extensions of Gnome
  • Spelling dictionaries
  • Optional functionality / extensions
Depending on your requirements, you might want to keep some of these and remove others.

Here is also a filter to find packages that you can put on "automatically installed":
~i !~M ?reverse-depends(~i) !?essential
This will catch "installed but not automatically installed packages, that another installed package depends on". Note that you should not blindly put all of these to "automatic" mode. For example "logrotate" depends on "cron | anacron | fcron". If you have both cron and anacron installed, aptitude will consider anacron to be unnecessary (it is - on a system with 24h uptime). So review this list, and see what happens when you set packages to "A", and reconsider your intentions. If it is a software you want for sure, leave it on manual.
2011-03-15 14:04 — Categories: English Debian LinuxPermaLink & Comments

GNOME3 in Debian experimental - python and dconf

As GNOME3 slowly enters Debian experimental, things become a bit ... experimental.
The file manager can be set to still draw icons on the desktop, but that doesn't entirely work yet (it will also open folders as desktop then...)
One machine had lost the keyboard settings. I could not set the fonts I wanted...
There is a tool called dconf-editor that will allow you to manually tweak some settings such as the fonts. But it doesn't seem to have support for value lists yet - and the keyboard mappings setting is a string list.
So here's sample python code to modify such a value:
from gi.repository import Gio
s = Gio.Settings.new("org.gnome.libgnomekbd.keyboard")
s.set_strv("layouts", ["de"])
Update: you could also install the optional libglib2.0-bin and use the gsettings command.
2011-03-15 00:41 — Categories: English Linux Debian GnomePermaLink & Comments

What is really happening at Fukushima?

As far as I can tell, neither the Japanese government, nor the operating companies tell the truth.
Here's my take of the story:
  • The tsunami not only destroyed the generators, but the complete cooling systems.
  • Thus, the cores will overheat and melt, they have no way to prevent this, they can only try to keep the damage as low as possible.
  • The cooling water turned into gas and disassociated into hydrogen and oxygen. Unfortunately, this is highly explosive. So the best they can do is to try to get the gas out of the core containment, and let is explode outside where it does not cause too much radioactive pollusion (are core explosion would be really bad). This already happened at reactors 1 and 3, and will happen at reactor 2 the next one or two days.
  • They use the sea water to slow down the core meltdown and keep the core containment stable. They probably do this by flooding the second containment, where there it no direct radioactive materials.
So essentially, I don't see much of a risk of a nuclear explosion, and I expect the radioactive pollution to be quite low, mostly via indirect radiation from the coolant in the outer containment. The reactor however is trashed and full of highly radioactive waste, that will require constant cooling for the next few years.
However, I'm really concerned that apparently, the government and the companies involed lied to us. This is happening a lot when it comes to nuclear power, they barely are truthful about what is happening.
Nuclear power is only safe as long it is operated by altruistic and responsible people. Once you bring free market, politics and money in, it is a dangerous toy that should not be toyed with by humans.
2011-03-14 16:51 — Categories: English PoliticsPermaLink & Comments

Google Circles rumor

The media are spreading a rumor about a social networking platform by Google, called Circles.
So far, these things have been debunked.
I do not believe that Google could succeed with a "Facebook clone". The market is already taken, Facebook is too big and had successfully been taking away international markets from local clones. For example in Germany, the other social networks used to be a lot bigger, but Facebook overtook them in just a few months, and the people I know pretty much quit the other networks then.
And I assume Google is aware that unless they have a strong strategy, the network would go the Wave-Buzz route.
At least here in Germany, people are surprisingly concerned by the amount of data Google might have on them, while at the same time they give it to Facebook for free. This is probably due to the media attention received with StreetView. It's not fair, but that's life.
However, "social network" is a broad term. Just think of what people actually use facebook for:
  • Games (often even not with their social circle!)
  • Microblogging
  • Photo sharing
  • Email
  • Automatic address book
Now, if you look closely at this list, Google pretty much has all of these, if we ignore the browser games (and they are all over the internet by now, Facebook and without!), then there is just one key ingredient missing for Google: the address book thing.
I believe this "Google circles" thing will largely by a "smart address book" that helps people managing their social contacts in a social network style. And obviously, this can be integrated in various products such as Mail, Picasa, Buzz, ...
If Google manages to launch a "contact manager" that makes it really easy for people to manage their social circles, this can be quite a killer. Facebook has "lists" but they are awful to use. It likes to "hide" friends to reduce the amount of information it throws at you. But it doesn't really organize it for you, for example by social circles or topics. These days for example Facebook could split the news feed into "Japan" and "everything else" I guess.
2011-03-14 13:05 — Categories: Web English GooglePermaLink & Comments

Taking Google Calendar to the limits

The Global Lindy Hop Map I've built as a toy project is actually a calendar with geo annotated events. It currently is backed by a custom database using Xapian for the search functionality to improve performance.
The data comes from around 150 Google calendars from various dancing communities. I'm preprocessing the data to have reliable geo information as well as doing some filtering and HTML formatting.
Instead of putting everything into my own database - which doesn't know about recurrence rules, but relies on materialized recurrences - I've also tried to sync all 150 calendars into one huge "master" calendar.
However, using the Google Gdata APIs to access this calendar takes way too long for the website to be usable. This is not too surprising, there are like 3500 instances in the calendar and some of this will require the computation of recurrence rules. And I cannot do the synchronization without a local ID mapping cache anyway (I can recover a lost cache from additional information I put into the calendar though). There are around 60 event instances per day, since most come from weekly repetitions.
The HTML embed rendering can take quite a while to load (albeit the results apparently are cached somewhere) - looking at december either hits some processing limit or times out after around 40 seconds.
Looks like I'm going beyond the scope Google Calendar was designed for. :-)
2011-02-22 18:14 — Categories: English Google WebPermaLink & Comments