For scientific codes to achieve good performance on computer with hierarchical memories, it is necessary that the ratio of memory references to arithmetic operations be low. In this paper, we show that Level 3 BLAS linear algebra kernels can be used to satisfy this requirement to produce an efficient implementation of a parallel finite element solver on a shared memory parallel computer with a fast cache memory.