Vitavonni

ELKI data mining in 2013

ELKI, the data mining framework I use for all my research, is coming along nicely, and will see continued progress in 2013. The next release is scheduled for SIGMOD 2013, where we will be presenting the novel 3D parallel coordinates visualization we recently developed. This release will bear the version number 0.6.0.
Version 0.5.5 of ELKI is in Debian unstable since december (Version 0.5.0 will be in the next stable release) and Ubuntu raring. The packaged installation can share the dependencies with other Debian packages, so they are smaller than the download from the ELKI web site.
If you are developing cluster analysis or outlier detection algorithm, I would love to see them contributed to ELKI. If I get a clean and well-integrated code by mid june, your algorithm could be included in the next release, too. Publishing your algorithms in source code in a larger framework such as ELKI will often give you more citations. Because it is easier to compare with your algorithm then and to try it on new problems. And, well, citations counts are a measure that administration loves to judge researchers ...
So what else is happening with ELKI:
  • The new book "Outlier Analysis" by C. C. Aggarwal mentions ELKI for visual evaluation of outlier results as well as in the "Resources for the Practioner" section and cites around 10 publications closely related to ELKI.
  • Some classes for color feature extraction of ELKI have been contributed to jFeatureLib, a Java library for feature detection in image data.
  • I'd love to participate in the Google Summer of Code, but I need a contact at Google to "vouch" for the project, otherwise it is hard to get in. I've been sending a couple of emails, but so far have not heard back much yet.
  • As the performance of SVG/Batik is not too good, I'd like to see more OpenGL based visualizations. This could also lead to an Android based version for use on tablets.
  • As I'm not an UI guy, I would love to have someone make a fancier UI that still exposes all the rich functions we have. The current UI is essentially an automatically generated command line builder - which is nice, as new functionality shows up without the need to modify UI code. It's good for experienced users like me, but hard for beginners to get started.
  • I'd love to see integration of ELKI with e.g. OpenRefine / Google Refine to make it easier to do appropriate data cleaning and preprocessing
  • There is work underway for a distributed version running on Hadoop/YARN.
2013-02-28 10:55 — Categories: English Coding Debian ResearchPermaLink & Comments

Migrating from GNOME3 to XFCE

I have been a GNOME fan for years. I actually liked the switch from 1.x to 2.x, and at some point switched to 3.x when it became somewhat usable. At some point, I even started some small Gnome projects, one even was uploaded to the Gnome repositories. But I didn't have much time for my Linux hobby anymore back then.
However, I am now switching to XFCE. And for all I can tell, I am about the last one to make that switch. Everybody I know hates the new Gnome.
My reason is not emotional. It's simple: I have systems that don't work well with OpenGL, and thus don't work well with Gnome shell. Up to now, I can live fine with "Fallback mode" (aka: Gnome classic). It works really good for me, and does exactly what I need. But it has been all over the media: Gnome 3.8 will drop 'fallback' mode.
Now the choice is obvious: instead of switching to shell, I go to XFCE. Which is much closer to the original Gnome experience, and very productivity oriented.
There are tons of rants on GNOME 3 (for one of the most detailed ones, see Gnome rotting in threes, going through various issues). Something must be very wrong about what they are doing to receive this many sh*tstorms all the time. Every project receives some. I've even received a share of the Gnome 2 storms when Galeon (an early Gnome browser) made the move and started dropping some of the hard-to-explain and barely used options that would break with every other Mozilla release. And Mozilla embedding was a major pain these days. Yet, for every feature there would be some user somewhere that loved it, and as Debian maintainer of Galeon, I got to see all the complaints (and at the same time was well aware of the bugs caused by the feature overload).
Yet with Gnome 3, things are IMHO a lot different. In Gnome 2, it was a lot about making things more usable as they are, a bit cleaner and more efficient. With Gnome 3, it seems to be about experimenting with new stuff. Which is why it keeps on breaking APIs all the time. For example themeing GTK 3 is constantly broken; most of the themes available just don't work. Similar Gnome Shell extensions - most of them work with exactly one version of Gnome Shell (doesn't this indicate the author has abandoned Gnome shell?).
But the one thing that was really sticking out was when my I updated the PC of my dad. Apart from some glitches, he could not even shutdown his PC with Gnome-shell. Because you needed to press the Alt button to actually get a shutdown option.
This is indicative of where Gnome is heading: something undefined inbetween of PCs, tablets, media centers and mobile phones. They just decided that users don't need to shutdown anymore, so they could as well drop that option.
But the worst thing about the current state of GNOME is: They happily live with it. They don't care that they are losing users by the dozens. Because to them, these are just "complainers". Of cousre there is some truth in "Complainers gonna complain and haters gonna hate". But what Gnome is receiving is way above average. At some point, they should listen. 200 posts long comment chains from dozens of peopls on LWN are not just your average "complaints". It's an indicator that a key user base is unhappy with the software. In 2010 GNOME 2 had 45% market share in the LinuxQuestions poll, XFCE had 15%. In 2011, GNOME 3 had 19%, and XFCE jumped to 28%. And I wouldn't be surprised if GNOME 3 shell (not counting fallback mode) would clock at less than 10% in 2012 - despite being default.
Don't get me wrong: there is a lot on Gnome that I really like. But as they decided to drop my preferred UI, I am of course looking for alternatives. In particular, as I can get lots of the Gnome 3 benefits with XFCE. There is a lot in the Gnome ecosystem that I value, and that IMHO is driving Linux forward. Network-manager, Poppler, Pulseaudio, Clutter just to name a few. Usually, the stuff that is modular is really good. And in fact I have been a happy user of the "fallback" mode, too. Yet, the overall "desktop" Gnome 3 goals are in my opinion targeting the wrong user group. Gnome might need to target linux developers more again, to keep a healthy development community around. Frequently triggering sh*tstorms by high-profile people such as Linux Torvalds is not going to strengthen the community. There is nothing wrong in the FL/OSS community to encourage people to use XFCE. But these are developers that Gnome might need at some point.
On a backend / technical level (away from the Shell/UI stuff that most of the rants are about), my main concern about the Gnome future is GTK3. GTK2 was a good toolkit for cross-platform development. GTK3 as of now is not, but is largely a Linux/Unix only toolkit - in particular, because there apparently is no up to date Win32 port. With GTK 3.4 it was said that they are now working on Windows - but as of GTK 3.6 they are still nowhere to be found. So if you want to develop cross-platform, as of now, you better stay away from GTK 3. If this doesn't change soon, GTK might sooner or later lose the API battle to more portable libraries.
Update: Some people at reddit seem to read this as if I am switching out of protest. This is incorrect. As "fallback" mode is now officially discontinued, I switch to the next best choice for me: XFCE. And I do this switch before things start breaking with some random upgrade. I know that XFCE is a good choice, so why not switch early? In fact, I've right now only given XFCE a test drive, but it already feels right, and maybe even slightly better than fallback mode.
2012-11-13 14:14 — Categories: English Linux CodingPermaLink & Comments

DBSCAN and OPTICS clustering

DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.
An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow "noise" objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.
The key parameter to DBSCAN and OPTICS is the "minPts" parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won't be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I'd start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.
The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.
OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.
OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won't produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting "steep" areas.
The open source ELKI data mining framework (package "elki" in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I've benchmarked this against GNU R ("fpc" package") and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours Update 11.1.2013: 84 minutes (with the 1.0.3 revision of the DBSCAN extension, some performance issues were resolved) and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version ...
ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.
Some part of the code may be a bit confusing at first. The "Parameterizer" classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.
Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.
for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())
is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.
ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);
is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).
Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.
(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)
2012-11-02 14:52 — Categories: English CodingPermaLink & Comments

ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.

The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible "flexclus" version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to 11 hours Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, in particular with DBSCAN and OPTICS there seems to be something seriously broken. Update 11.1.2013: Eibe Frank from Weka had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that "extension" altogether. The much newer and Weka-like LOF module is much more comparable.)

Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.

Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.

But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.

I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

  • A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
  • Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I'd love to see this for ELKI, too.
  • Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won't even need to fit into main memory anymore. Yet, everybody now wants "big data", and parallel computation probably is the future. I'm currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true "elk yarn". I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
  • New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn't found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I'd love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
  • Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the "big data" stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don't have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
    For the "big data future" once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up "big data processing" to the average non-IT user, this will be big.
  • Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don't have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, ... just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.

2012-09-02 13:40 — Categories: English Research Java Coding ELKIPermaLink & Comments

ELKI 0.4 beta release

Two weeks ago, I've published the first beta of the upcoming ELKI 0.4 release. The accompanying publication at SSTD 2011 won the "Best Demonstration Paper Award"!
ELKI is a Java framework for developing data mining algorithms and index structures. It has indexes such as the R*-Tree and M-Tree, and a huge collection of algorithms and distance functions. These are all writen rather generic, so you can build all the combinations of indexes, algorithms and distances. There are evaluation and visualization modules.
Note that I'm using "data mining" in the broad, original sense that focuses on knowledge discovery by unsupervised methods such as clustering and outlier detection. Today, many people just think of machine learning and "artificial intelligence" - or even worse: large scale data collecting - when they hear data mining. But there is much more to it than just learning!
Java comes at a certain price. The latest version got already around 50% faster than the previous release just by reducing Java boxing and unboxing that puts quite some pressure on the memory management. So you could implement these things in C to become a lot faster; but this is not production software. I need code that I can put students on to work with it and extend it, this is much more important than getting the maximum speed. You can probably still use this for prototyping. See what works, then implement just that which you really need in a low level language for maximum performance.
You can do some of that in Java. You could work on a large chunk of doubles, and access them via the Unsafe class. But is that then still Java, or aren't you actually doing just plain C? In our framework, we want to be able to support non-numerical vectors and non-double distances, too. Even when they are only applicable to certain specialized use cases. Plus, generic and Java-style code is usually much more readable, and the performance cost is not critical for research use.
Release 0.4 has plenty of under the hood changes. It allows multiple indexes to exist in parallel, it support multi-relational data. There are also a dozen new algorithms, mostly from the geo/spatial outlier field, which were used for the demonstration. But for example, it also includes methods for rescaling the output of outlier detection methods to a more sensible numerical scale for visualization and comparison.
You can install ELKI on a Debian testing and unstable system by the usual "aptitude install elki" command. It will install a menu entry for the UI and also includes the command-line launcher "elki-cli" for batch operation. The "-h" flag can produce an extensive online help, or you can just copy the parameters from the GUI. By reusing Java packages such as Batik and FOP already in Debian, this also is a smaller download. I guess the package will at some point also transition to Ubuntu - since it is Java you can just download and install it anyway I guess.
2011-08-29 18:20 — Categories: English Coding Research ELKI JavaPermaLink & Comments

Pyroman IPv6 support

I've added IPv6 support to my firewall tool Pyroman, and uploaded a package to experimental. But of course you can just checkout the source code from Subversion and call it as bin/pyroman without installation.
Pyroman will try to produce a consistent set of rules for IPv4 and IPv6. Originally it was designed for complex firewalls with multiple interfaces, various rules and NAT. I have so far only tested this version on my single-host setup at home, in particular NAT might break.
Pyroman has extensive debug functions. You can try --print-verbose to see why it produced which rules. By invoking pyroman safe you will tell it to revert any changes unless you type OK at the prompt.
And if it fails to compute firewall rules, or there is some iptables error, it will also restore the previous state.
So you have plenty of options to give it a try without risking to produce a mess. Just start with configuring it to the point where you like the "--print" output. Then give the "safe" mode a try next.
Check the Pyroman homepage for the features. There is more. Pyroman is a lot faster than most other firewall tools, because it does not perform hundreds of iptables invocations but uses iptables-restore to bulk load them. This is the fastest way to bring the firewall from one configured state into another. For the 0.6 version of pyroman I plan to offer precomputing the firewall rules, and use a single iptables-restore call at bootup to setup your firewall, with dependency tracking to see if the precomputed file is still up to date.
2011-08-17 20:54 — Categories: Linux Debian CodingPermaLink & Comments

Documenting fat-jar licenses

Dear Lazyweb,
What is the appropriate way to document the individual licenses of a fat jar (a jar archive that includes the program along with all its dependencies)?
I've been living in the happy world of Linux distributions, where one would just package the dependencies independently (or just specify which existing packages you use, actually, since most is already packaged by helpful other developers). But for the non-Linux people, a fat jar seems to be important so they can double-click it to run the application.
When building larger Java applications, you end up using a couple of external code such as various Apache Commons and Apache Batik. I'm currently including all their LICENSE.txt files in a directory named "legal", and I'm trying to make it as obvious as possible which license applies to which parts of the jar archive. Is there any best-practice of doing this? I don't want to reinvent the wheel; it'd also like to avoid any common legal pitfalls, obviously.
Feel free to respond either using the Disqus comment function or by email via erich () debian org
2011-08-16 17:23 — Categories: English Coding JavaPermaLink & Comments

Dear Lazyweb, how to write multi-locale python code

Dear Lazyweb,
I've been toying around with a python WSGI application, i.e. a multi-threaded persistent web application. Now I'd like to add multi-language support to this application. I need to format datetimes to human readable formats, but I havn't found a way yet to do this in a sane way using strftime. Essentially, strftime will use the current application locale; however since I'm running multi-threaded, different threads might want to use different locales. So changing the locale is bound to cause race conditions.
So what is the best way to pretty-print (including week day names!) datetime, currency and similar values in a multi-threaded multi-locale context in python? Gettext and manually emulating strftime doesn't sound that sensible to me. And of course, I don't want to have to translate the weekday names myself into any language I choose to support...
2011-05-27 18:23 — Categories: English CodingPermaLink & Comments

Partial subversion mirror

Does anyone know how to setup a partial subversion mirror?

Essentially we have a source code tree, containing various subdirectories, including doc, test, src/tld/domain and src/experimentalcode/usernames.
We'd like to allow public access to the main source subtree, while not exposing the various user directories to the web server (the master subversion is separate anyway; since I want to use Trac I need a local SVN copy!)

I'm currently trying to setup "tailor" to do this, but it is not trivial: it seems easy at first to just ignore any change in a particular folder. But when you encounter "move" aka "rename" operations that rename a file from the non-exposed folder to the exposed folder, they will obviously fail on the "stripped" repository. So they need to be mapped to an "add" operation there!

Any hints? I might have it working using some hacks in tailor, but I'd like to know any simple solution, if there is any ...

2010-11-11 17:32 — Categories: English Linux CodingPermaLink & Comments