Speech Emotion Recognition (SER) has become a popular research topic due to having a significant role in many practical applications and is considered a key effort in Human-Computer Interaction (HCI). Previous works in this field have mostly focused on global features or time series feature representation with deep learning models. However, the main focus of this work is to design a simple model for SER by adopting multivariate time series feature representation. This work also used the Echo State Network (ESN) including parallel reservoir layers as a special case of the Recurrent Neural Network (RNN) and applied Principal Component Analysis (PCA) to reduce the high dimension output from reservoir layers. The late grouped fusion has been applied to capture additional information independently of the two reservoirs. Additionally, hyperparameters have been optimized by using the Bayesian approach. The high performance of the proposed SER model is proved when adopting the speaker-independent experiments on the SAVEE dataset and FAU Aibo emotion Corpus. The experimental results show that the designed model is superior to the state-of-the-art results.