NOTE: Please see the original version of this post HERE. Crossposted here per suggestion of @Vitaliy Kaurov.
Since Mathematica is a symbolic system, with symbolic evaluator much more
general than in numericallybased languages, it is not surprising that performancetuning can be
more tricky here. There are many techniques, but they can all be understood
from a single main principle. It is: Avoid full Mathematica symbolic evaluation process as much as possible. All techniques seem to reflect some facet of it. The main idea here is that most of the
time, a slow Mathematica program is such because many Mathematica functions are very general. This
generality is a great strength, since it enables the language to support better and more powerful abstractions, but in many places in the program such generality, used without care, can be a (huge) overkill. I won't be able to give many illustrative examples in the limited space, but they can be found in
several places, including some WRI technical reports (Daniel Lichtblau's
one on efficient data structures in Mathematica comes to mind), a very good
book of David Wagner on Mathematica programming, and most notably, many Mathgroup
posts. I also discuss a limited subset of them in my book. I will supply more references soon. Here are a few most common ones (I only list those available within Mathematica
language itself, not mentioning CUDA \ OpenCL, or links to other languages, which are
of course also the possibilities):
Push as much work into the kernel at once as possible, work with as large
chunks of data at a time as possible, without breaking them into pieces 1.1. Use builtin functions whenever possible. Since they are implemented
in the kernel, in a lowerlevel language (C), they are typically (but not always!)
much faster than userdefined ones solving the same problem. The more specialized
version of a builtin function you are able to use, the more chances you have for a speedup. 1.2. Use functional programming (Map, Apply , and friends). Also, use pure functions
in #& notation when you can, they tend to be faster than Functions with named
arguments or those based on patterns (especially for not computationallyintensive
functions mapped on large lists). 1.3. Use structural and vectorized operations (Transpose, Flatten,
Partition, Part and friends), they are even faster than functional. 1.4. Avoid using procedural programming (loops etc), because this programming
style tends to break large structures into pieces (array indexing etc).
This pushes larger part of the computation outside of the kernel and makes it slower.
Use machineprecision whenever possible 2.1. Be aware and use Listability of builtin numerical functions, applying them to
large lists of data rather than using Map or loops. 2.2. Use Compile , when you can. Use the new capabilities of Compile , such as CompilationTarget>"C" ,
and making our compile functions parallel and Listable. 2.3. Whenever possible, use vectorized operations (UnitStep, Clip, Sign, Abs , etc)
inside Compile , to realize "vectorized control flow" constructs such as If , so that
you can avoid explicit loops (at least as innermost loops) also inside Compile . This
can move you in speed from Mathematica bytecode to almost native C speed, in some cases. 2.4. When using Compile , make sure that the compiled function doesn't bail out to noncompiled evaluation. See examples in this MathGroup thread.
Be aware that Lists are implemented as arrays in Mathematica 3.1. Preallocate large lists 3.2. Avoid Append, Prepend, AppendTo and PrependTo in loops, for building
lists etc (because they copy entire list to add a single element, which leads
to quadratic rather than linear complexity for listbuilding) 3.3. Use linked lists (structures like {1,{2,{3,{}}}} ) instead of plain lists
for list accumulation in a program. The typical idiom is a = {new element, a} .
Because a is a reference, a single assignment is constanttime. 3.4. Be aware that patternmatching for sequence patterns (BlankSequence,
BlankNullSequence) is also based on Sequences being arrays. Therefore, a rule
{fst_,rest___}:>{f[fst],g[rest]} will copy the entire list when applied. In particular, don't
use recursion in a way which may look natural in other languages. If you want to use recursion on lists, first convert your lists to linked lists.
Avoid inefficient patterns, construct efficient patterns 4.1. Rulebased programming can be both very fast and very slow, depending on how
you build your structures and rules, but in practice it is easier to inadvertently
make it slow. It will be slow for rules which force the patternmatcher to make many
a priory doomed matching attempts, for example by underutilizing each run of the
patternmatcher through a long list (expression). Sorting elements is a good example:
list//.{left___,x_,y_,right___}/;x>y:>{left,y,x,right}  has a cubic complexity in the
size of the list (explanation is e.g. here). 4.2. Build efficient patterns, and corresponding structures to store your data, making
patternmatcher to waste as little time on false matching attempts as possible. 4.3. Avoid using patterns with computationally intensive conditions or tests. The
patternmatcher will give you the most speed when patterns are mostly syntactic in
nature (test structure, heads, etc). Every time when condition (/;) or pattern test (?)
is used, for every potential match, the evaluator is invoked by the patternmatcher,
and this slows it down.
Be aware of immutable nature of most Mathematica builtin functions Most Mathematica builtin functions which process lists create a copy of an original list and
operate on that copy. Therefore, they may have a linear time (and space) complexity in the
size of the original list, even if they modify a list in only a few places. One universal
builtin function that does not create a copy, modifies the original expression and does not
have this issue, is Part . 5.1. Avoid using most listmodifying builtin functions for a large number of
small independent list modifications, which can not be formulated as a single step
(for example, NestWhile[Drop[#,1]&,Range[1000],#<500&] ) 5.2. Use extended functionality of Part to extract and modify a large number of
list (or more general expression) elements at the same time. This is very fast,
and not just for packed numerical arrays (Part modifies the original list). 5.3. Use Extract to extract many elements at different levels at once, passing
to it a possibly large list of element positions.
Use efficient builtin data structures The following internal data structures are very efficient and can be used in
many more situations than it may appear from their stated main purpose. Lots of such examples can be found by searching the Mathgroup archive, particularly contributions of Carl Woll. 6.1. Packed arrays 6.2. Sparse arrays
Use hash  tables.
Starting with version 10, immutable associative arrays are available in Mathematica (Associations) 7.1 Associations the fact that they are immutable does not prevent them to have efficient insertion and deletion of keyvalue pairs (cheap copies different from the original association by the presence, or absence, of a given keyvalue pair). They represent the idiomatic associative arrays in Mathematica, and have very good performance characteristics. For earlier versions,the following alternatives work pretty well, being based on internal Mathematica's hashtables: 7.2. Hashtables based on DownValues or SubValues 7.3. Dispatch
Use element  position duality Often you can write faster functions to work with positions of elements rather than
elements themselves, since positions are integers (for flat lists). This can give you
up to an order of magnitude speedup, even compared to generic builtin functions
(Position comes to mind as an example).
Use Reap  Sow Reap and Sow provide an efficient way of collecting intermediate results, and generally
"tagging" parts you want to collect, during the computation. These commands also go well
with functional programming.
Use caching, dynamic programming, lazy evaluation 10.1. Memoization is very easily implemented in Mathematica, and can save a lot of execution
time for certain problems. 10.2. In Mathematica, you can implement more complex versions of memoization, where you can
define functions (closures) at runtime, which will use some precomputed parts in their
definitions and therefore will be faster. 10.3. Some problems can benefit from lazy evaluation. This seems more relevant to memory 
efficiency, but can also affect runtime efficiency. Mathematica's symbolic constructs make
it easy to implement. A successful performance  tuning process usually employs a combination of these techniques,
and you will need some practice to identify cases where each of them will be beneficial.
