Flipping coins securely over the Internet

I have built Live Coin Flips after I’ve spent some time thinking how can two persons, in remote locations, flip a coin and agree on the result. It’s more challenging than what it looks like at first; flipping a coin while on a live video call is probably the easiest way, but it is not always feasible, and also any sleight-of-hand trick is much much easier to do remotely.

See for example these guys on YouTube predicting the coin flip outcome every time:

So I’ve come up with the idea of using a third-party, independent, unpredictable, universally agreed upon, continuous source of randomness. After some candidates I’ve singled out one: the Bitcoin price.

To make everything simpler, I’ve built a web application to calculate the outcome every minute. By deciding on a date in the future (it doesn’t need to be in the far future, just far enough for the Bitcoin price to not be published yet), you also commit on a certain coin flip outcome which will happen on that date.

You can see it in action here: Live Coin Flips

Let’s go back to static websites

In the era of continuously generated AI content, we can hardly describe the web as static. Yet, Amazon’s “static website” tool seems may be good enough for most.

I’ve collected a couple tools to create static websites on AWS S3, and manage the basic tasks (update SSL certificates, upload files and the like). Want to know why? Read on.

First, even by AWS price standards, static websites on S3 are dirt cheap. They scale with the amount of users, so for a small “experimental” website it might be pennies. This should be compared with the cost of a VPS that is in the 5$/month or so at the cheapest. If the traffic stays small, I have estimated that for around 100kB of data a page, the final cost for a million page loads would be around 10$. Egress fees is about the only cost here.

The other big advantage is scale. Serving a million page loads with a 5$/month VPS can be challenging. The few times by blog had thousands of visitors/hour, WordPress would struggle keeping up. For an S3 bucket however, that’s hardly a problem.

Third, why not; most of my blog viewers would perceive this website as “static” anyway (at least those that don’t leave comments… would you believe this was a thing? Leaving a comment in a website?). I have a feeling that anything that doesn’t need synchronous server-side logic can magically become “static”. Also platform maintenance is greatly simplified.

I’m probably not saying anything new here, but I always think about how much “the cloud” can scale up, without spending a minute thinking how much it can scale down. This is one of those occasions where it does. And actually, I wish it would happen more often.

Realistic, continuous hi-hat control module for Hydrogen (for edrums with CC pedal, with bonus cymbal choke)

I wrote this article back in 2012, to workaround some limitations in the Hydrogen drum machine. It was impossible to control the hi-hat from closed to open (and viceversa) seamlessly, so I wrote a MIDI router to fix it.

Here you can see it in action:

I wrote this script mostly in order to improve the Hydrogen hi-hat feeling, but I accidentally added a couple other features too.

It can seamlessly switch from closed to open hihat (and vice-versa), supports cymbals choke (both the ride, crash and via the hi-hat pedal), and slightly improves the dynamics range.

To use it, you will need:

  • an edrum (I used a Roland TD9)
  • several (at least 3) hi-hat samples from closed to open, i used those http://www.freesound.org/people/TicTacShutUp/packs/17/ plus some closed hihats from other sets
  • mididings (on Debian or Ubuntu, sudo apt-get install mididings)
  • this script: https://github.com/simonebaracchi/continuous-hh (if you have less or more than 7 hi-hat samples like I did, you’d better modify “hihats_pedal_range” in this script to make it match your number of samples)
  • in Hydrogen, setup your kit like this:
    • first sample from the top: footclick (will be note 36)
    • sample #2: completely closed hihat (will be note 37)
    • sample #3: slightly less closed hihat (will be note 38)
    • … and so on…
    • then, all the other instruments (it is crucial that all the hihats are consecutive, and my script expects to find the first on note #37)
  • in Hydrogen, disable all mute groups (they don’t sound great for this purpose), disable “ignore note-off”, and make it listen to channel 9 if it’s not already so (channel 10 does not seem to be good for noteoffs, maybe it’s faulty in mididings, i don’t know)
  • in your edrum: use channel 9, set footclick to note #36, hihat to note #37 (both open and closed if you have both), and hihat pedal to CC#04 (it’s called “foot(4)” on mine)

launch script (“mididings -f <script>”), route your midi data into mididings, then from mididings to hydrogen (or another sequencer, maybe for recording, if you like), start banging on your edrum, you should be good to go.

There used to be a discussion about it on the Hydrogen forums which are now unavailable, but you can still read it thanks to the Internet Wayback Machine.

 

Spaghetti Distance Index: measuring set similarity for sentences with a bag-of-words model (too literally)

Today we talk about how you shouldn’t measure sentence similarity.

“Bag of words” is a simplified model of handling sentences in natural language processing, which disregards grammar, punctuation, and word order and manages sentences as sets of words.

“Set similarity” is a measure of the distance between two sets (two collections of arbitrary items). There are many approaches in literature; I recently came across the Jaccard Index, while other alternatives are Sorensen Similarity Index and Hamming distance.
Jaccard Index (or Jaccard Distance) seems to be commonly used in text summarization algorithms, which use it to measure the similarity between two sentences (using a sentence as a set of words).

However Jaccard Index has some sub-optimal quirks. Its score is based on the number of items shared between the sets; when applied to sentence comparison, there are words that are very common (e.g. stopwords), especially in modern social network environments (e.g. Instagram hashtags), often used across different contexts (e.g. #picoftheday); I quickly realized not all words should be treated equally when measuring sentence similarity, and it is not trivial to evaluate if words are important or not.

Its score is also weighted on the set cardinality (the total number of words in the sentence); however this favours short sentences, for example it is way easier to have two sentences share all of their words when there is only one word in them. Big sets that have a large percentage of their items in common should get higher matching scores than smaller sets that have a similar percentage of matching items, since this has a smaller probability to happen for large sets.

This is where I tried to formulate a different, more context-aware, non-normalized, language-agnostic measure of similarity between sentences. I called this algorithm Spaghetti Distance Index. Its idea is, given two sentences, to measure the log-likelihood of ending up with a certain number of shared words. In formulas:

SDI(X,Y) = - \displaystyle\sum_{w \epsilon (X \cap Y)} log \left( |X| \ |Y| \ p(w) ^ 2 \right)

X and Y are the two sets, and |X| and |Y| are their size. p(w) is the probability of item w (simply put, the frequency of word w among all sentences). |X| p(w) is the probability of w appearing in set X, and similarly for Y. |X| \ |Y| \ p(w) ^ 2 is the probability an item w being in both sets X and Y. The less frequent is the item, the greater the probability of a relationship between the sets.

This algorithm has some advantages over Jaccard Index:

  • the similarity score is unbounded, so longer sentences can have higher scores;
  • it favours high percentages of matching words, non-shared words have a negative impact on score;
  • uncommon words contribute more to the score;
  • it maintains language agnosticity, and can be used to compare any set of items.

This algorithm assumes each word as chosen indipendently from the others; this is literally what a “bag of word” would get you if words are chosen blindly and randomly. On the other hand, this assumption is certainly not true: words in a sentence have a correlation between them. This is especially true for sentences belonging to a certain topic, i.e. if sentences are about pasta recipes, and one word is spaghetti, then there would be a somewhat high probability of another word being tomato.
Factorizing in the correlation between words would bring us to the Mutual Information algorithm, which relies exactly on the probability of a certain word given all the other words.

But if the problem is comparing sentence similarity, then possibly there are other domain-specific similarity measures such as tf-idf (which accounts for both term frequency in sentences, and for general term frequency among all words), BM25, or sentence clustering algorithms such as Latent Dirichlet allocation.

You can see the source code for Spaghetti Distance Index on GitHub.

POSIX Signals are not just bad, they’re awful

It all started with a problem which seem to be simple and already seen around in the real world:

I just need a simple script to restart a process if it crashes.

Simple enough right? So simple that I wanted to use a short bash script to do it. Turns out it’s not really that simple, especially if your thought process continues with

And I will use signals to control it.

Okay, now it’s really a bad idea which can fail for a lot of reasons:

 1. Checking if a process has crashed is not idiot-proof

The idiot being the one that writes a C program that exits main() with “return(-1);” or any other negative value. The exit status is a signed byte, thus a negative value will wrap around to a > 127 value and be indistinguishable from a process crash.

A program exiting with “return(-119);” has the same exit status as one killed with kill -9 (SIGKILL). The fun!

 2. You can’t always control signals in bash

According to POSIX specs, bash can’t change the signal handler for signals which were ignored in the parent process. So this script would be unusable if you happen to launch it from a process which has the relevant signals ignored. That’s where I got the idea to use a one-line python wrapper around my bash script to enable all signals before giving control to bash.

 3. Using a signal wrapper spawns a second process

It turned out quickly that it wasn’t a great idea. Using a python wrapper around a bash script will obviously generate a second process, one for bash and one for python.
So in just one move I spawned three more problems:

– the python wrapper has the process name you would expect (the wrapper name) and has every signal enabled, i.e. any signal would kill it
– the bash script has a different process name which makes it counter-intuitive which one is the process to send signals to
– the bash script starts after the python wrapper so if two wrappers start simultaneously, we have a funny race condition to deal with.

So, I just decided to drop bash and rewrite everything in python instead.

 4. There is no standard signal to ask a process to restart

Usually it’s SIGHUP, but it’s not universally true. If your controlled process can be restarted with a signal, that signal should be SIGHUP, but no guarantees.

 5. Signals are not setup immediately at startup

When your control script starts, all signals have their default handler. So for example if you launch your control script and then immediately decide to restart the controlled process (with a SIGHUP signal), it may happen that the controller gets killed instead, and the child process is left with no control.

 6. SIGKILL can’t be handled

If your control script is killed with SIGKILL, the child process is left running with no control.

 7. The child process exiting will have interesting race conditions

Say for example that the restart behaviour will be to send SIGHUP to the child process, wait for its exit, and launch the child process again.
If your control script is asked to restart the child process several times in a row, it might be sending several signals to a PID, which after the first SIGHUP might not correspond to any running process (not so bad) or correspond to a different process recently spawned (definitely not good).

In certain *NIX flavours you can setup a signal handler that will fire when the child process exits, but based on my research this is not true for every OS out there.

 8. There is no alternative to signals

I might have finished this list with better news, but unfortunately the only way to ask a process to terminate is, yes, signals.

The DOs:

 Use a lockfile to ensure a single instance of your controller is running.

 Use process groups.

 Handle SIGCHLD if your OS uses it to signal child processes exiting.

 Use systemd if your OS has it. It might get kind of long to config properly (especially for a dynamic list of processes), though.

PHP not parsing POST data?

If your PHP (or any other server side code) is receiving a single $_POST field of raw data, looking more or less like this, instead of receiving your form fields nicely:

——WebKitFormBoundaryVmWPyJJIBR3n18Wk
Content-Disposition: form-data; name=”form-file-1″; filename=”my document.doc”
Content-Type: application/octet-stream
——WebKitFormBoundaryVmWPyJJIBR3n18Wk
Content-Disposition: form-data; name=”form-field-1″

form-value-1
——WebKitFormBoundaryVmWPyJJIBR3n18Wk–

(that WebKitFormBoundary is probably the sign that you are using Chrome)

Then, chances are your javascript is overriding the content-type header. Check out the request headers, and if you find anything looking like this:

Content-Type:application/x-www-form-urlencoded;

or really anything else which is not this:

Content-Type:multipart/form-data; boundary=—-WebKitFormBoundaryVmWPyJJIBR3n18Wk

then you need to fix the content-type header. It might be simpler than you think; try setting it to `false` (in jQuery / Prototype / …) or disable anything javascript is setting there. The browser will take care of it better than what JS is doing.

In my case, I was getting this problem because I was using the js FormData object to extract form data and then sending it the same way it used to do with “regular” form data. Turns out the content-type header should be set differently. Maybe I saved you a couple of hours of searching.

Anagrhash, a anagram generator & reverse hash searching tool

Anagrhash is a tool which can generate anagrams, and test them against a hash.

It is a tool I wrote many years ago for educative purposes, to try out some different concurrency models and inter-thread communication methods in C, and get some basics of programming in the GNU/Linux environment.

It serves as an anagram generator, and can shuffle letters, but can also be set to shuffle whole words in a sentence, use tokens from a list (words which you don’t want to appear simultaneously), and specify separators between tokens (like spaces or commas).

It can also test anagrams which satisfies an exact target hash, or a hash specified with a regular expression.

You can get it on my GitHub page.

InconTRAMi, a “route planner for meeting points” with Milan public transports.

preview_incontrami

InconTRAMi is a route planner for Milan’s public transports network ATM, differently from “usual” route planners it will try to find the best meeting points for two persons. By inputing your starting points, it will show the more easily reachable places for both persons. By moving the mouse over the map it will show which lines will permit you to reach that place.

Short usage policy: it is very experimental, I decline every responsibility coming from the use or misuse of it. It considers just ATM’s network; it doesn’t consider line hopping; the data it’s using may be not up to date. It uses data kindly made available by Comune di Milano. See the application page for the full policy (in italian).

The source code is available on GitHub.

p.s. “route planner for meeting points” is a horrible name, I’m taking suggestions.