a family of powerful software tools
designed to automatically parallelize or vectorize code for improved application performance. All code restructuring can be done automatically at the click
of a mouse. The original source files are retained in unmodified form.
VAST-F/Parallel can automatically convert serial code into code designed for multi-core or multi-processor systems and is ideal for use with compilers, which do not include auto-parallel capabilities. VAST-F/Parallel is also an excellent option for use with auto-parallelizing compilers (such as Absoft) to add additional optimizations and allow you to further explore multi-threaded opportunities. All code restructuring is done automatically and the original source files are retained in unmodified form. VAST-F/Parallel also provides OpenMP support.
VAST-F/Vector is designed to fully leverage the added performance capabilities of the hardware vector (AltiVec) units in IBM's POWER family of processors (G4/G5 Mac/PPC and Linux/POWER). On certain applications customers have reported up to 10x increase in application performance when utilizing Absoft Fortran and VAST-F/Vector.
Key Features of
VAST Include :
- Full Loop Nest Analysis. Loops are analyzed in simple and complicated
loop nests; loops containing the largest amount of work are parallelized.
Loops do not have to be tightly nested.
- Extended Parallel Regions. VAST/Parallel extends parallel regions
to include multiple parallel loops and intervening scalar code. This
cuts down on parallel overhead.
- Threshold testing. All parallel systems have some overhead. When
VAST/Parallel finds a parallel region, if the amount of work in the
region is not clear at compile time, then VAST/Parallel creates a
run-time test. Through this run-time test, the parallel region will
only be executed if there is enough work; otherwise, the original
serial version is executed.
- Dependence Analysis. VAST/Parallel has very sophisticated data dependency
analysis capabilities that allow it to optimize complicated situations.
All loop nests are examined to see if they can be executed in parallel
safely. VAST/Parallel can resolve ambiguous subscripting by examining
variable assignments outside of loops, and restructure the use of
variables to avoid certain other dependencies.
- Potential Dependence Testing. When
dependencies are unclear at compile time, sometimes VAST/Parallel
can generate run-time tests to allow parallelism to proceed.
- Special Reduction Optimization. Summations and other reductions
are parallelized through the use of locks or critical regions.
- Shared/Private Determination. All variables in a parallel loop are
categorized as shared (seen by all threads) or private (copy in each
thread). VAST/Parallel can detect and create private arrays.
- Interprocedural Analysis for Parallel Calls. VAST/Parallel can examine
call chains to determine their dependencies, and then parallelize
loops containing calls or groups of calls outside loops.
- Automatic recognition of parallel cases. When sections of code deal
with disjoint operations, VAST/Parallel can process each section in
a separate parallel case.
- Superscalar optimizations. VAST/Parallel includes scalar optimizations
to boost performance even in a single thread. Parallel optimizations
can be done to outer loops while inner loops are optimized for efficient
execution on one thread.
- Array Syntax. VAST/Parallel can in general parallelize and optimize
multi-dimensional array syntax just as efficiently as loop nests.
- Choice of static or dynamic partitioning of loop iterations. Load
balancing can tradeoff with loop overhead. Use dynamic partitioning
when you need more load balancing, static partioning when you are
concerned about overhead.
- Number of threads can be set with an environment variable. This
allows degree of parallelism to be changed from run to run. When the
system is busy you can run with two threads, when it is empty you
can run with eight threads, without recompiling your program.
- Choice of thread waiting strategy. You can select either busy waiting
or sleep waiting for threads, so that the parallel program can adapt
to loaded or dedicated workloads on the target system. Use busy waiting
on a lightly loaded system, and sleep waiting when another job might
need the cycles.
- Optimization of entire loop nests, not just inner loops. Critical
optimizations include loop fusion (squeezing multiple loops into one
loop), outer loop unrolling (unrolling an outer loop inside an inner
loop), loop collapse (making one long loop from a multiple dimension
loop), and loop interchange (changing the order of the loops in a
loop nest to get more efficient memory access).
- Unrolled vector loops. Unrolling vectorized loops is very important
in making sure that the vector instructions are overlapped the the
maximum extent possible.
- Vectorization of reduction loops. Includes array summations, dot
products, minimum and maximum element of an array, product of array
elements, etc. These operations take a large fraction of the CPU time
for many programs.
- Vectorization of conditional loops. "if" statements and
conditional operators are vectorized.
- Non-aligned vectors can be vectorized efficiently. VAST introduces
"permute" operations to align vectors "on the fly"
prior to computation.
- 32-bit float and 8, 16 and 32-bit integer vectorization. Integers
can be signed and unsigned. Also, VAST can vectorize loops that contain
mixed data sizes.
- ALIGNED pragma so that the user can inform VAST-C about arrays that
are aligned on 16-byte boundaries. Also the -Valigned command line
- -Vmessages switch to get vectorization messages for all loops in
the program. Find out what constructs are inhibiting vectorization
of your important loops.
- DISJOINT, NODEPCHK pragmas for disambiguating data dependencies.
Especially useful if the target program uses lots of pointers rather
than array notation.
- -L parameter for assertion levels to allow vectorization in the
presence of pointer arguments. Can be very useful if the program is
written to pass most of the data as pointer arguments.
- Vector load lifting. Move all loads to the top of the loop, as far
as they will go (safely). Allows the compiler to do a better job of
- Vectorization of complex data type. Uses the permute instructions
to reorder interleaved complex data so that it can be operated on
with the vector unit.
- Testing for stride one on loops with variable stride. Inserts a
run-time test to see if variable array strides are all one; executes
a vector version of the loop if the strides are one, otherwise executes
the original scalar loop.
- Partial vectorization of loops with strided or gather/scatter vectors.
- Vectorization of "table lookup" loops. Loops that have
a branch out of the loop can be vectorized in certain cases.
Performance Gains on a Single CPU system:
VAST/Parallel's superscalar optimization technology can enhance the
performance of certain types of code on standard, single CPU systems.
If your programs spend large amounts of time in nested loops or operating
on large arrays, a performance improvement of over 35% may be possible.
On other types of code, VAST/Parallel may have little impact.
Performance Gains on Dual CPU System:
VAST/Parallel can automatically parallelize your code and also provides
full OpenMP support to enable user-directed parallelization. VAST/Parallel
contains sophisticated data dependency analysis technology to detect
when optimized execution will be safe, has very advanced in-lining capabilities,
and uses interprocedural analysis to optimize across procedure boundaries.
Ease of Use
VAST/Parallel fully supports the OpenMP standard. For calculations
where you know exactly what you want parallelized, OpenMp provides a
portable way to specify this. VAST/Parallel supports all OpenMP directives/pragmas
and functions, and provides diagnostics on incorrect use of the directives.
- Thread private common (choice of methods)
- Orphan directives
- Nested parallelism
- Reduction optimizations
- Environment variables
- Efficient library implementation
that comes with VAST Vector
(AltiVec) combines VAST and the compiler(s) in a transparent way, so
that (for example) compilation can be as easy as replacing gcc with
vcc or f90 with v90 in your makefiles.
There are several
ways to use VAST. If your program
spends most of its time in clean loops, then VAST may be able to vectorize
your program automatically. Often with C programs, depending on the
programming style they are written it, a potential "data dependency"
between pointers and arrays may prevent some vectorization, and some
simple assertions from the user can improve the amount of vectorization.
VAST can provide messages that help you understand what parts of your
program have been successfully optimized and what parts have not been
Advanced users may
choose to write clean loops
for new applications and have VAST automatically generate AltiVec code,
rather than doing AltiVec coding by hand. Very advanced users may wish
to modify the VAST intermediate C code and change the order or nature
of vector operations that VAST generates.
bought the VAST auto parallelization tool to use with Absoft Fortran on
our Dual CPU machine. As a serial program our code ran quite quickly but
we wanted to leverage both processors. VAST automatically converted our
serial code into code designed for the dual processors. Before VAST the
run time was 18.5 seconds, after VAST it was reduced to 8.1 seconds!!
This was such a dramatic improvement, especially since it did not require
any recoding on our part. Everybody here was elated. We think that if
we install dual core CPUs, i.e., four CPUs we will be where we dreamed
to be in terms of speed.
Dr. Kosta J. Leontaritis
Flow Assurance Advisor - AsphWax, Inc.
All Absoft Compilers Include FREE Technical Support!
Experienced Support Engineers are available via phone at
248-853-0095 or email
9am to 3pm EST (M-F)
to answer your Absoft Fortran questions!
|VAST-F Vector (Mac/Linux
(Mac/Linux PPC Only)