These days I built up an inter-thread communication via os.pipe(). While one thread is only writing to the “write end” of the pipe, the other thread is only reading from the “read end”. The latter is realized with select.select(), which observes the read end and raises an os.read() event whenever it recognizes data in the pipe. Hence, whenever something is written to the pipe on the one side, it is immediately read on the other side. This means one string written by one call to os.write() is read entirely by one call to os.read().
Really? What happens, when the writing thread pushes a hugh amout of data into the pipe — with high frequency? I wanted to test this.
Analysis
From the writing thread, I pushed a string with fixed length L (and terminated by a newline) through the pipe; N times — one after the other, within a loop. In the other thread, I os.read() from the pipe, based on select.select() events and checked the length of each line returned by one os.read(). To evaluate the data, I needed a basic histogram. How to realize fast counting in Python? Here is a neat way to count events:
import collections histogram = collections.defaultdict(int) for event in events: histogram[event] +=1
If you access a non-existing key in a defaultdict(callable), it is initialized automatically by calling callable() without arguments. For the histogram this can be used by creating a defaultdict of type int. Accessing histogram[non-existing-key] first initializes the value with int(), which gives zero. So histogram[event] +=1 is secure and just counts your events without a need to pre-define them
This is the way I used it:
lines = os.read(pipe_read,9999999).splitlines() for line in lines: histogram[len(line)] += 1
I don’t need a big visualization. A sorted listing in histogram style is enough for me:
for key in sorted(histogram, key=histogram.get, reverse=1): print '%7d %s' % (histogram[key], key)
This sorts the dictionary by its values in reversed order (small first) and prints the key/value pairs with a bit of formatting.
The optimal result would be only one datapoint in the histogram: N times L. This is what I got for short strings. But with increasing L, I got a probability of about 10^-5 for chopped lines. For L = 150, I e.g. got ~100000 times 150 chars, and 1-3 times 1-149 chars.
This means that in some cases strings returned by os.read() looked like "thisisaline\nthisisaline\nthisis": The last few characters of the last line are chopped and will be returned with the next os.read(): “aline\nthisisaline\nthisisaline\n” — and so on. The good thing was, that the sum of all received characters always matched the number of chars written to the pipe (as expected, but I felt good by measuring it).
Result and solution
In extreme situations (long strings, high writing frequency), relying on the “one os.write(string), one os.read() returns entire string” principle is a bad idea, because in seldom cases lines will not be read entirely and, hence, get returned chopped. The solution is to accomplish some post-processing to reassamble chopped lines:
new_data = prefix + os.read(pipe_read,9999999)
lines = new_data.splitlines(True)
prefix = ''
if not lines[-1].endswith("\n"):
prefix = lines[-1]
del lines[-1]
for line in lines:
histogram[len(line.rstrip())] += 1
In line 2, I keep the trailing newline characters in the splitted parts. In lines 4-6, I check if the last part has a trailing newline. If not, then this is a chopped line! It is deleted from the list of lines and prepended to the next string returned by os.read() (see line 1). Before putting resulting lines into the histogram, trailing newlines are .rstrip()ped. Can you imagine what the histogram says now? ![]()
E.g. "6000000 150", which means 6000000 times received a string of length 150. That’s the N times L datapoint I wanted!
Conclusion
It was shown that one should not rely on simple os.write() and os.read() calls on an os.pipe() to accomplish a communication channel between two threads. A “communication protocol” has to be applied. Then, with a small post-processing and reassembling piece of code, the communication between the two treads is reliable and foreseeable: every trailing-newline-labeled-line written to the pipe by Thread 1 is read and evaluated entirely by Thread 2. Great
.