chacha/asm/chacha-armv8.pl: replace 3+1 code paths with 4+1.

The change is triggered by ThunderX2 where 3+1 was slower than scalar
code path, but it helps all processors [to handle <512 inputs].

Reviewed-by: Tim Hudson <tjh@openssl.org>
Reviewed-by: Richard Levitte <levitte@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/8776)
1 file changed