A matrix multiplication module and matrix multiplication method are provided that use a variable number of multiplier-accumulator units based on the amount of data elements of the matrices are available or needed for processing at a particular point or stage in the computation process. As more data elements become available or are needed, more multiplier-accumulator units are used to perform the necessary multiplication and addition operations. To multiply an N×M matrix by an M×N matrix, the total (maximum) number of used MAC units is “2*N−1”. The number of MAC units used starts with one (1) and increases by two at each computation stage, that is, at the beginning of reading of data elements for each new row of the first matrix. The sequence of the number of MAC units is {1, 3, 5, . . . , 2*N−1} for computation stages each of which corresponds to reading of data elements for each new row of the left hand matrix, also called the first matrix. For the multiplication of two 8×8 matrices, the performance is 16 floating point operations per clock cycle. For an FPGA running at 100 MHz, the performance is 1.6 Giga floating point operations per second. The performance increases with the increase of the clock frequency and the use of larger matrices when FPGA resources permit. Very large matrices are partitioned into smaller blocks to fit in the FPGA resources. Results from the multiplication of sub-matrices are combined to form the final result of the large matrices.