> a = array(range(-1000000,1000001)) * 0.000001 > min(a), max(a), numpy.sum(a), math.fsum(a) (-1.0, 1.0, -2.3807511517759394e-11, 0.0)
> b = a + 1e15 > numpy.var(a), numpy.var(b) (0.33333366666666647, 0.33594164452917774) > mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2 (0.33333366666666647, -21532835718365184.0)(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)
> import timeit > for f in ["sum(a)", "math.fsum(a)"]: > print timeit.timeit(f, setup="import math; a=range(0,1000)") 30.6121790409 202.994441986So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won't arise.
for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.
ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).
$ gsettings set \ org.gnome.desktop.interface gtk-color-scheme ' os_chrome_fg_color:black; os_chrome_bg_color:#FED; os_chrome_selected_fg_color:black; os_chrome_selected_bg_color:#f5a089; selected_fg_color:white; selected_bg_color:#c50039; theme_selected_fg_color:black; theme_selected_bg_color:#c50039; tooltip_fg_color:black; tooltip_bg_color:#FFC; '
ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.
The key feature that sets ELKI apart from existing open source tools used in
data mining (e.g. Weka and R) is that it has support for index structures to
speed up algorithms, and a very modular architecture that allows various
combinations of data types, distance functions, index structures and
algorithms. When looking for performance regressions and optimization potential
in ELKI, I recently ran some
benchmarks on a data
set with 110250 images described by 8 dimensional color histograms. This is a
decently sized dataset: it takes long enough (usually in the range of 1-10
minutes) to measure true hotspots. When including Weka and R in the comarison I
was quite surprised: our k-means implementation runs at the same speed as Rs
implementation in C (and around twice that of the more flexible "flexclus"
version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order
of magnitude faster than Weka and R, and adding index support speeds up the
computation by another factor of 5-10x. In the most extreme case - DBSCAN in
Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2
minutes (ELKI) as opposed to
11 hours Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference,
in particular with DBSCAN and OPTICS there seems to be
something seriously broken. Update 11.1.2013: Eibe Frank from Weka
had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS
implementation actually is not even complete, and both implementations actually
copy all data out of Weka into a custom linear database, process it there, then
feed back the result into Weka. They should just drop that "extension"
altogether. The much newer and Weka-like LOF module is much more comparable.)
Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.
Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.
But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.
I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.
If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.
Update Dec 2012: ResearchGate still keeps on sending me their spam. Most of the colleagues I had that tried out RG now deleted their account there, apparently, so the invitation mails become fewer.
Please do not try to push this link on Wikipedia just because you are also annoyed by their emails. My blog is not a "reliable source" by Wikipedia standards. It solely reflects my personal view of that web site, not journalistic or scientific research.
The reason why I call ResearchGate spam is the weasel words they use to trick authors into sending the invitation spam. Here's the text coming with the checkbox you need to uncheck (from the ResearchGate "blog")
Add my co-authors that are already using ResearchGate as contacts and invite those who are not yet members.See how it is worded so it sounds much more like "link my colleagues that are already on researchgate" instead of "send invitation emails to my colleagues"? It deliberately avoids the mentioning of "email", too. And according to the researchgate news post, this is hidden in "Edit Settings", too (I never bothered to try it -- I do not see any benefit to me in their offers, so why should I?).
Original post below:
If you are in science, you probably already received a couple of copies of the ResearchGate spam. They are trying to build a "Facebook for scienctists", and so far, their main strategy seems to be aggressive inivitation spam.
So far, I've received around 5 of their "inivitations", which essentially sound like "Claim your papers now!" (without actually getting any benefit). When I asked my colleagues about these invitations none actually meant to invite me! This is why I consider this behaviour of ResearchGate to be spam. Plus, at least one of these messages was a reminder, not triggered by user interaction.
Right now, they claim to have 1.9 million users. They also claim "20% interact at least once a month". However, they have around 4000 Twitter followers and Facebook fans, and their top topics on their network are at like 10000-50000 users. That is probably a much more real user count estimation: 4k-40k. And these "20%" that interact, might just be those 20% the site grew in this timeframe and that happened to click on the sign up link. For a "social networking" site, these numbers are pointless anyway. That is probably even less than MySpace.
Because I do not see any benefit in their offers! Before going on an extremely aggressive marketing campaign like this, they really should consider to actually have something to offer...
And the science community is a lot about not wasting their time. It is a dangerous game that ResearchGate is playing here. It may appeal to their techies and investors to artificially inflate their user numbers in the millions. But if you pay for the user numbers with your reputation, that is a bad deal! Once you have the reputation as being a spammer (and mind it, every scientist I've talked to so far complained about the spam and "I clicked on it only to make it stop sending me emails") it's hard to be taken serious again. The scientific community is a lot about reputation, and ResearchGate is screwing up badly on this.
In particular, according to researchgate founder on quora, the invitations are opt-out on "claiming" a paper. Sorry, this is wrong. Don't make users annoy other users by sending them unwanted invitations to a worthless service!
And after all, there are alternatives such as Academia and Mendeley that do offer much more benefit. (I do not use these either, though. In my opinion, they also do not offer enough benefit to bother going to their website. I've mentioned the inaccuracy of Mendeleys data - and the lack of an option to get them corrected - before in an earlier blog post. Don't rely on Mendeley as citation manager! Their citation data is unreviewed.
I'm considering to send ResearchGate (they're Berlin based, but there maybe also is a US office you could direct this to) a cease and desist letter, denying them to store personal information on me, and to use my name on their websites to promote their "services". They may have visions of a more connected and more collaborative science, but they actually don't have new solutions. You can't solve everything by creating yet another web forum and "web2.0izing" everything. Although many of the web 2.0 bubble boys don't want to hear it: you won't solve world hunger and AIDS by doing another website. And there is a life outside the web.
A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms(part of my diploma thesis).
Hans-Peter Kriegel, Peer Kröger, Erich Schubert, Arthur Zimek
Scientific and Statistical Database Management (SSDBM 2008)
NoScript (sorry, Firefox only - there is no comparable functionality available in Chrome) is a must-have add-on for safer web surfing. It does not only prevent many clickjacking attacks, but it can do much more for you.
In the default setting, NoScript will block any script that you did not explicitely whitelist. While this is a bit annoying in the beginning - you will have to whitelist most of your everyday web pages - it will give you quite some insight in the amount of tracking that you are exposed to. A recent test showed that on a typical newspaper website, there will be tracking codes of more than 10 web sites (mostly ad websites and social networks). Accepting these will probably pull in another set. Most of this is happening in the background, and tracking you across various web sites this way.
NoScript will essentially force you to make a decision for each site: permanently allow it, temporarily allow it, or block it. Since it blocks by default, you will easily see what works without and what does not - if it doesn't work as expected, and you need the site, you can allow it with just a few clicks.
But there is more functionality hidden. NoScript has a function called ABE, "Application Boundaries Enforcer". This can be seen as a refinement of NoScript: you don't only whitelist web sites, but actually web site combinations. I'll give you a simple example of why and how this is useful. Consider these ABE rules:
# Only Facebook may embed Facebook Site .facebook.com .fbcdn.net .facebook.net Accept from .facebook.com .fbcdn.net .facebook.net Deny INCLUSION POST # Only Google may embed Google +1 Site plusone.google.com Accept from google.com Deny INCLUSION POST
These rules are quite simple: essentially they say that no website may access facebook except facebook, and no website may access Google +1 except Google. I chose these rules for multiple reasons:
Note that I did not block them altogether. I can still access the web pages as usual, if I want to. I even allowed links, but not scripts and similar embeddings.
And it doesn't just increase your privacy (read this current article in the NYT for an example of the amount of tracking happening these days). It also makes web pages load faster, because you don't load all their cruft all the time, and can live without them showing you videos from 3 different domains next to the actual article that you want to read ...
Update: I've learned that newer version of Chrome actually can filter on load (and not just display), and there is a similar extension available called ScriptNo. The main reason I'm currently moving away from Chromium is that it wastes more memory than Firefox, and I'm always short on RAM.
# Add a system group for Skype addgroup --system skype # Override permissions of skype (assuming Debian package!) dpkg-statoverride --update --add root skype 2755 `which skype`
They allow outgoing connections by Skype only on ports 80 and 443, which supposedly do not trigger the firewall (in fact, this filter is recommended by our network administration for Skype).iptables -I OUTPUT -p tcp -m owner --gid-owner skype \ -m multiport ! --dports 80,443 -j REJECT iptables -I OUTPUT -p udp -m owner --gid-owner skype -j REJECT
which I've put just after the conntrack default module, as 05_skype.py""" Skype restriction to avoid firewall block. Raw iptables commands. """ iptables(Firewall.output, "-p tcp -m owner --gid-owner skype -m multiport ! --dports 80,443 -j %s" % Firewall.reject) iptables(Firewall.output, "-p udp -m owner --gid-owner skype -j %s" % Firewall.reject)
!?reverse-depends(~i) ~M !?essentialIt will display only packages with no direct dependency from another installed package and that are marked as automatically installed (so they must be kept installed because of a weaker dependency.
~i !~M ?reverse-depends(~i) !?essentialThis will catch "installed but not automatically installed packages, that another installed package depends on". Note that you should not blindly put all of these to "automatic" mode. For example "logrotate" depends on "cron | anacron | fcron". If you have both cron and anacron installed, aptitude will consider anacron to be unnecessary (it is - on a system with 24h uptime). So review this list, and see what happens when you set packages to "A", and reconsider your intentions. If it is a software you want for sure, leave it on manual.
from gi.repository import Gio s = Gio.Settings.new("org.gnome.libgnomekbd.keyboard") s.set_strv("layouts", ["de"])