Thursday, February 22, 2007

Not code, but WYSIWYG presentation production in your browser

As the title suggests, this post isn't code related. It's merely a quick (and obviously unofficial) video demonstration of a web based ad creation product I've been working on. The design applet is made in Adobe Flash, and lets the user create templates and template-based banners, in a WYSIWYG way. Slides, transitions and in-slide animations can be added, and output is instantly generated as a standalone .swf which can be uploaded and used in just about any ad feed system.

Some have pointed out the ironic nature of me writing an ad design application, seeing as I've already released the "FlashMute" utility to control the volume of flash movies playing in your browser.

Anyway, the video can be found at http://itu.dk/people/einar/bannerfactory.avi, and there's an alternative WMV version at http://itu.dk/people/einar/bannerfactory.wmv. It's all in Norwegian, and the pictures are less-than fancy shots of random people at my office. You'll have to excuse that :-)

Saturday, February 10, 2007

Iteration techniques put on a bench(mark)

A couple of weeks ago I covered some of the additions Microsoft has made to native C++ for VC7-8. One of these additions is the "for each" loop, which can e.g. be used for STL container iteration.

While the loop certainly looks nice; how good does it optimize? Or concretely, how fast is it? To shed some light on this, I did a series of tests on three different iteration techniques. All code generations are done using VS8 with SP1, and a plain Release Build.

First off is a plain for loop, which iterates by a subscript index:

int sum = 0;
vector<int> v;
// ...

size_t l = v.size();
for(size_t i = 0; i < l; ++i)
{
sum = v[i];
}


The second variation is also a plain for loop, but this time iterators are used:

for(vector<int>::iterator i = v.begin(); i != v.end(); ++i)
{
sum = *i;
}


And finally, the for each loop:

for each (int i in v)
{
sum = i;
}


In analyzing these iterations, two main concerns will have to be included in the comparison.
  • How well does the technique handle a high number of elements to iterate?
  • How well does the technique handle being called often?


In addition, one analysis will be done with _SECURE_SCL switched on, and one when it's switched off.

With _SECURE_SCL

Listing A: Index iteration
Listing B: Iterators
Listing C: for each

What's worth noticing here is the mainly the calls to _invalid_parameter_noinfo, which will go through, granted events such as either of the following
  • the v.begin() iterator points to null.

  • v.end() - v.begin() is suddenly less than i mid-loop, when the prerequisite for the iteration was that i < the vector length determined prior to entering the loop.
You will also spot that there's a fair difference in the generated code, for one thing the code size. Of course, instruction count doesn't translate directly to execution overhead. The actual speed depends on your CPU architecture, and the fact that some instruction sequences optimize better (e.g. execute in parallell on some CPUs). Nevertheless, here's a table of comparison for the three techniques:

_SECURE_SCLNO _SECURE_SCL
First runEach iterationFirst runEach iteration
Index2413156
Iterator201495
for each2816105


In the above table, the "first run" column indicates how many instructions are required to setup the iteration sequence, and complete the first run of the iteration body. The "each iteration" columns indicate how many instructions are used for each following iteration. Again, the instruction count doesn't directly relate to the execution speed, but there it is. As seen from Listing C above, the for each loop has more verification code than that of the iterator technique. I actually find the generated code somewhat odd,

mov eax,dword ptr [v+8]
mov ecx,dword ptr [v+4]
cmp ecx,eax
push ebx
push esi
push edi
mov ebx,eax
jbe noproblem
call _invalid_parameter_noinfo
mov eax,dword ptr [v+8]
mov ecx,dword ptr [v+4]
noproblem:
cmp ecx,eax
mov esi,ecx


Notice the second initialization of eax and ecx. That code is never reached, as the application would terminate given that _invalid_parameter_noinfo is called. Now this particular construct may be there for the sake of optimizations, but I can't see how it would help.

Further down, there's another peculiar thing going on:

test edi,edi
je problem
cmp edi,offset v
je noproblem
problem:
call _invalid_parameter_noinfo
noproblem:
cmp esi,ebx
je endofloop
test edi,edi
...


The test at the top of this source block is the first instruction for each repeated iteration. The edi register will simply hold a pointer to the vector. Granted that no problems are detected, a typical flow (not listing the jumps) will look like:

test edi,edi
cmp edi,offset v
cmp esi,ebx
test edi,edi
...


So edi is being tested twice, and there's no way it can be changed inbetween. That really seems like redundancy to me.

Those remarks having been made, none of it really matters if you compile your application with _SECURE_SCL switched off.

Without _SECURE_SCL

Listing D: Index iteration
Listing E: Iterators
Listing F: for each

As the table shown earlier indicates, the instruction count is quite different for all three of the techniques, and there's really not much to say about any of them at this point. What does remain, is to see an indication of how fast they all run:

_SECURE_SCLNO _SECURE_SCL
100M items, 1 run10 items, 10M runs100M items, 1 run10 items, 10M runs
Index350410315280
Iterator490695350385
for each585800340385


All results in the above table are in milliseconds. 'M' indicates millions, 'k' thousands.

While this has not been a scientific test, but rather my mere findings on a standard desktop computer; it does give an indication of how the iterations may perform compared to each other. The index iteration is the quicker, regardless of the _SECURE_SCL setting, how many items are iterated and how frequently the iteration sequence is made. What's interesting to see, is that the slowdown experienced from the "for each" to the iterator technique goes away entirely when _SECURE_SCL is disabled. This may strengthen my notion of the "for each" not being optimally generated, but jumping to that conclusion from these (highly) non-scientific benchmarks is a long stretch. It does make you wonder, though. In either case, the speed difference isn't horrific (except perhaps that of between the index iteration and "for each" seen in the seconc column).