why sub and mul operation of f32x4 is slower than scalar version?

```rust
//cost 13ms when data.length is 4000w
fn sum_scalar(data: &[f32]) -> f32 {
    let mut sum = 0.0;
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        sum += (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
    sum
}

//cost 16ms when data.length is 4000w
fn sum_simd(data: &[f32]) -> f32 {
    let pos = f32x4::load_or_default(&vec![2000.0, 2000.0, 2000.0]);
    let dir = f32x4::load_or_default(&vec![0.8, 0.6, 0.0]);

    let mut sum = 0.0;
    for i in 0..data.len() / 4 {
        let values = f32x4::from_array([
            data[i * 4 + 0],
            data[i * 4 + 1],
            data[i * 4 + 2],
            data[i * 4 + 3],
        ]);
        sum += ((values - pos) * dir).reduce_sum();
    }

    sum
}
```
cpu: Intel(R) Core(TM) i5-14400F   2.50 GHz
rustc 1.81.0-nightly (c1b336cb6 2024-06-21)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

why sub and mul operation of f32x4 is slower than scalar version? #426

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

why sub and mul operation of f32x4 is slower than scalar version? #426

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions