0
Answered

Read and write performance in arrays

Michael_ 5 years ago updated by Pavel Holoborodko 5 years ago 15
Hello.

Following example:

Double values:
A=zeros(6000,6000);
tic
for ii = 1:6000
for jj = 1:6000
A(ii,jj)=1.234;
end
end
toc
Elapsed time is 2.038870 seconds

Now without defining A before the for-loop
tic
for ii = 1:6000
for jj= 1:6000
A(ii,jj)=1.234;
end
end
toc
Elapsed time is 120.038870 seconds

Of course the performance is better if I define the array in advance.

Now the same for an mp array with 34 digits:
A=mp(zeros(200,200));
tic
for ii =1:200
for jj = 1:200
A(ii,jj)=mp('1.234');
end
end
toc
Elapsed time is 22.309700 seconds.

Now without defining A before the for-loop
tic
for ii =1:200
for jj = 1:200
A(ii,jj)=mp('1.234');
end
end
toc
Elapsed time is 14.581011 seconds.

How is this possible? This even becomes worse if the matrix is bigger.

Thanks for your answer.
Under review
Hello Michael,

This is a very good question!

At first, let me show results for 'vpa' matrices for the sake of completeness and fairness:

A=vpa(zeros(200,200));
tic
for ii =1:200
for jj = 1:200
A(ii,jj)=vpa('1.234');
end
end
toc

Elapsed time is 128.775 seconds.

tic
for ii =1:200
for jj = 1:200
A(ii,jj)=vpa('1.234');
end
end
toc

Elapsed time is 370.221 seconds.

Our computers are comparable in performance (as I see from your timings) - so that you should see approximately the same values on your system.

***
There are two natural questions:
1. Why we are faster (with or without pre-allocation)?
2. Why pre-allocation doesn't make any difference in our case?

Although MATLAB's language is simple to learn and great for computations, it was born as procedural, not object-oriented (which is needed to support custom types, like "mp", efficiently). 

Recently TMW has been working on adding OOP features, but still MATLAB provides only limited OOP-capabilities.
In particular, it doesn't allow to overload the destructor and basic assignment operator "=" for custom types (e.g. in case of A = B).

As a result, third-party toolboxes are not able to manage its own data and have to rely on legacy MEX API, which is slow and heavily based on copying the data on every call (copy-overhead).

When it comes to basic array operations (e.g. indexing, assignment, etc.) MATLAB provides its own routines for that. Such routines are highly optimized for usual arrays of "double" numbers. But they are extremely slow for arrays of custom objects.

If we combine all of these imperfections - we will get the terrible performance. Good example is VPA - it is based on standard OOP capabilities of  MATLAB and it is slower than "mp" by a large margin.

We do not think this is acceptable and we have been fighting with the situation since the beginning.
First, we are using undocumented MEX API to avoid the copy-overhead [1].
Secondly, we re-implemented all array manipulation operations from the scratch - in a specific way for "mp" objects [2].

Although this allows us to be much faster than VPA, still MEX/MATLAB overhead is of 70-80% of the total time!
Pre-allocation cannot provide predictable merit in these conditions too.

Next version of array manipulation engine is under development - it will introduce some new workarounds and hopefully it will be 5-10 times faster (at least for the simple case as you demonstrated). Pre-allocation will start to play an important role then.

Unfortunately most of the issues are related to restrictions in MATLAB itself. If they would allow more flexible plug-in architecture & OOP features - such operations won't take much longer than with built-in "double" precision arrays.

Sorry for such long e-mail, this is just ongoing PITA topic for us for a long time.

[1] http://goo.gl/6AvHfH
[2] http://goo.gl/AJGPg2

Hello.

Thank you for the quick answer. The increasing performance for array operations will give a great overall performance improvement. In my application, 98% of the time is needed to build the system of linear equations Ax=b (A[600,600]) and only 2% for solving it. :-) If I use the double precision from matlab it is like 40% to 60%.
Is there any possibility for vectorization?
A(1:n,1:m) = ...
The actual problem is a bit more complicated as the shown one. :) I already use as much vectorization as possible. (A(ii,1:end)) I also have a lot of function evaluations which take a long time in mp as well.
Michael,

Just want to let you know that the newest version of toolbox (available for download from website as usual) gives the following result:

A = mp(zeros(200,200));
p = mp('pi');
tic
for ii = 1:200
for jj = 1:200
  A(ii,jj) = p;
end
end
toc

Elapsed time is 2.97645 seconds.


It is approximately 5 times faster now :-). Referencing and concatenation is faster too (up to 2 times).  

I think I finally found the way to access the internal data of MATLAB directly without crashes.

Please test it in your environment. Would appreciate your feedback. 
Hello.

Thanks for the update.
I used the new version to build my system of equations A*x=b and it was about two times faster than before. (old 1100 seconds, new 540 seconds) The time to build the system in double precision is around 4 seconds. I do like 1.5 million function evalutions of nested functions with vectors of the length 2000. I get a slightly different result compared to the old version, but i will check again if I do not have an error here.
Great, this is at least something.

I guess now the major bottleneck is not in assignment and referencing.

By any chance, could you send me the code so that I can trace it?
I am sorry but I can not give you the complete code. I will try to make some minimal examples of routines which take the most time. But I guess one part will be the call of functions which have mp values as input and mp values inside the function.
Just minimal example would be enough. Very interesting what routines are called from toolbox in your functions.
Hi.

I went through my complete code step by step now.

Most evaluations takes less than 20 times that of double precision like the evaluation of nested functions. The evaluation of nested functions involving bessel functions took me 200 times that of double, but thats not a big problem because they take only a small portion of the complete evaluation time. Following is the problem in my case:

Only a dummy multiplication:

mp.Digits(34);
A=mp(ones(600,6000))+mp(ones(600,6000))*i;
B=mp(ones(600,6000))+mp(ones(600,6000))*i;
tic
A*B.';
toc
Elapsed time is 88.208659 seconds.

In double:

A=ones(600,6000)+ones(600,6000)*i;
B=ones(600,6000)+ones(600,6000)*i;
tic
A*B.';
toc
Elapsed time is 0.368915 seconds.

I need to do such multiplications like 4 times in my code which means about 360 seconds are needed in mp precision for that compared to 1.6 seconds in double precision. If the performance in all functions would be 20 times that of double would be great. :-)

Another question: Did you change the bessel functions already? Because I get a slightly different result since the update.

Thanks.
Hi Michael!

Matrix multiplication is already polished to the extreme.
At the moment, the only thing what we can do for speed-up - is to use CPU with more cores :). 

This will change in future, when we will finish porting to GPU. Then we can expect x5-x10 times increase in performance.

The Bessel module is being actively modified. In the last release I changed algorithms for Y0, Y1, J0 and J1. Still I do not consider it finished or mature. Classical recommended algorithms (I use) suffer from accuracy drop when argument is large and close to being pure imaginary. No publicly available references mention the situation. 

The main issue is information on good algorithms on computing Bessel for complex arguments is scarce. Actually there are only two published papers on the topic.

MATLAB itself uses 30-year old FORTRAN library developed by Amos (http://www.netlib.org/amos/). Now I am trying to make the underling algorithms precision-independent and convert it to C++.

Until then please increase precision of argument proportionally to its magnitude, or/and use the built-in MATLAB function.
Sorry for such inconvenience - hopefully will fix it soon.
Hello.

Thank you for the answer. I figured that quadruple precision should take about 20 times that of double according to this poster. (http://atip.org/SGWORKSHOP/Posters/Poster%20-%20MUKINOKI.pdf)
But maybe the reason for that is, that they use GPUs and not CPUs.
There are several ways to represent "quadruple".

We rely on IEEE 754 standard - http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format

Whereas what the authors use is not actually quadruple, but 31-decimal digit surrogate with reduced exponent range and accuracy issues with really small numbers.

However "double-double" has several advantages - it is better suitable for GPU and low-level CPU optimization tricks.

Some time ago I have implemented my own "double-double" library for tests on GPU. If I find it superior in speed - I will add it to the toolbox as an option.
Dear Michael,

Please update the toolbox to 3.7.6.8114 (released today - Dec. 31).
It has a little bit faster matrix multiplication (10%-50% depending on matrix) and more functions are using the multi-core optimizations among other things.