numpy - PyTorch complex matrix-vector multiplication is slow on CPU

admin管理员组
文章数量:1026390

I found pyTorch to be much slower than numpy when doing complex-valued matrix-vector multiplication on CPU:

A few notes:

This is true for me across multiple systems
Memory is not an issue
complex multiplication on torch does not max out the cores (unlike the other three cases)
torch version: 2.5.1+cu124
numpy version: 1.26.4
Cuda version: 12.6
NVidia driver: 560.35.03
I verified the results of the calculations are the same
Both use double precision (ie 64-bits for real and 128 bits for complex)
Switching to float (torch.cfloat) makes things slightly faster, but not much

Perhaps I have misconfigured something?

Code to produce above plots:

import torch
import numpy as np
import matplotlib.pyplot as plt
import time

maxn = 3000
nrep = 100

def conv(M,latype):
    if latype=='numpy':
        return np.array(M)
    if latype.startswith('torch,'):
        return torch.tensor(M,device=latype[7:])

def multtest(A,b):
    t0 = time.time()
    for i in range(nrep):
        b = A@b
    t1 = time.time()
    return (t1-t0)/nrep

ns = np.array(np.linspace(100,maxn,100),dtype=int)
numpyts = np.zeros(len(ns))
torchts = np.zeros(len(ns))

fig,axes = plt.subplots(1,2)
for ax,dtype in zip(axes,['real','complex']):
    Aorig = np.random.rand(maxn,maxn)
    borig = np.random.rand(maxn)
    if dtype == 'complex':
        Aorig = Aorig + 1.j*np.random.rand(maxn,maxn)
        borig = borig + 1.j*np.random.rand(maxn)

    for latype in ['numpy','torch, cpu']:
        A = conv(Aorig,latype)
        b = conv(borig,latype)
        ts = np.zeros(len(ns))
        for i,n in enumerate(ns):
            ts[i] = multtest(A[:n,:n],b[:n])
        ax.plot(ns,ts,label=latype)

    ax.legend()
    ax.set_title(dtype)
    ax.set_xlabel('vector/matrix size')
    ax.set_ylabel('mean matrix-vector mult time (sec)')

fig.tight_layout()
plt.show()

I found pyTorch to be much slower than numpy when doing complex-valued matrix-vector multiplication on CPU:

A few notes:

This is true for me across multiple systems
Memory is not an issue
complex multiplication on torch does not max out the cores (unlike the other three cases)
torch version: 2.5.1+cu124
numpy version: 1.26.4
Cuda version: 12.6
NVidia driver: 560.35.03
I verified the results of the calculations are the same
Both use double precision (ie 64-bits for real and 128 bits for complex)
Switching to float (torch.cfloat) makes things slightly faster, but not much

Perhaps I have misconfigured something?

Code to produce above plots:

import torch
import numpy as np
import matplotlib.pyplot as plt
import time

maxn = 3000
nrep = 100

def conv(M,latype):
    if latype=='numpy':
        return np.array(M)
    if latype.startswith('torch,'):
        return torch.tensor(M,device=latype[7:])

def multtest(A,b):
    t0 = time.time()
    for i in range(nrep):
        b = A@b
    t1 = time.time()
    return (t1-t0)/nrep

ns = np.array(np.linspace(100,maxn,100),dtype=int)
numpyts = np.zeros(len(ns))
torchts = np.zeros(len(ns))

fig,axes = plt.subplots(1,2)
for ax,dtype in zip(axes,['real','complex']):
    Aorig = np.random.rand(maxn,maxn)
    borig = np.random.rand(maxn)
    if dtype == 'complex':
        Aorig = Aorig + 1.j*np.random.rand(maxn,maxn)
        borig = borig + 1.j*np.random.rand(maxn)

    for latype in ['numpy','torch, cpu']:
        A = conv(Aorig,latype)
        b = conv(borig,latype)
        ts = np.zeros(len(ns))
        for i,n in enumerate(ns):
            ts[i] = multtest(A[:n,:n],b[:n])
        ax.plot(ns,ts,label=latype)

    ax.legend()
    ax.set_title(dtype)
    ax.set_xlabel('vector/matrix size')
    ax.set_ylabel('mean matrix-vector mult time (sec)')

fig.tight_layout()
plt.show()

Share Improve this question edited Nov 18, 2024 at 16:54 asked Nov 16, 2024 at 19:51 cshelton 3703 silver badges9 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

I can reproduce your problem on Windows 10, with CPython 3.8.1, Numpy 1.24.3, Torch 49444c3e (the packages are the default ones installed via pip). Torch is setup to use my (i5-9600KF) CPU. Here is the result:

I can also see that torch uses only 1 core while Numpy uses multiple cores (only) for complex numbers. Numpy uses OpenBLAS internally by default. Torch probably uses another implementation which is not optimized for that (no parallelism for an unknown reason). I can see that the real version uses OpenMP internally while the complex one does not. Both does not appear to call any (dynamic) BLAS function internally (which tends to confirm they use their own implementation unless they statically linked a BLAS).

Assuming they also use a BLAS but the default one use is not efficient, then you can certainly compile/package it so to link another faster BLAS implementation (possibly OpenBLAS or another one like BLIS or the Intel MKL).

If they uses their own implementation, then you can open an issue about this so to use OpenMP also in the complex version.

AFAIK, Torch is optimized for real simple-precision computations on GPUs and not really complex double-precision computations on CPUs. Thus, maybe they did not care about this yet.

Side notes

Note I can see the following warning during the execution by the way:

<ipython-input-85-1e20a6760269>:18: RuntimeWarning: overflow encountered in matmul
  b = A@b
<ipython-input-85-1e20a6760269>:18: RuntimeWarning: overflow encountered in matmul
  b = A@b
<ipython-input-85-1e20a6760269>:18: RuntimeWarning: invalid value encountered in matmul
  b = A@b

When I run the code I get a different plot:

torch: 2.3.1
numpy: 1.26.4
cuda: 12.2
NVIDIA-Driver: 535.183.01 (Ubuntu)

I found pyTorch to be much slower than numpy when doing complex-valued matrix-vector multiplication on CPU:

A few notes:

This is true for me across multiple systems
Memory is not an issue
complex multiplication on torch does not max out the cores (unlike the other three cases)
torch version: 2.5.1+cu124
numpy version: 1.26.4
Cuda version: 12.6
NVidia driver: 560.35.03
I verified the results of the calculations are the same
Both use double precision (ie 64-bits for real and 128 bits for complex)
Switching to float (torch.cfloat) makes things slightly faster, but not much

Perhaps I have misconfigured something?

Code to produce above plots:

import torch
import numpy as np
import matplotlib.pyplot as plt
import time

maxn = 3000
nrep = 100

def conv(M,latype):
    if latype=='numpy':
        return np.array(M)
    if latype.startswith('torch,'):
        return torch.tensor(M,device=latype[7:])

def multtest(A,b):
    t0 = time.time()
    for i in range(nrep):
        b = A@b
    t1 = time.time()
    return (t1-t0)/nrep

ns = np.array(np.linspace(100,maxn,100),dtype=int)
numpyts = np.zeros(len(ns))
torchts = np.zeros(len(ns))

fig,axes = plt.subplots(1,2)
for ax,dtype in zip(axes,['real','complex']):
    Aorig = np.random.rand(maxn,maxn)
    borig = np.random.rand(maxn)
    if dtype == 'complex':
        Aorig = Aorig + 1.j*np.random.rand(maxn,maxn)
        borig = borig + 1.j*np.random.rand(maxn)

    for latype in ['numpy','torch, cpu']:
        A = conv(Aorig,latype)
        b = conv(borig,latype)
        ts = np.zeros(len(ns))
        for i,n in enumerate(ns):
            ts[i] = multtest(A[:n,:n],b[:n])
        ax.plot(ns,ts,label=latype)

    ax.legend()
    ax.set_title(dtype)
    ax.set_xlabel('vector/matrix size')
    ax.set_ylabel('mean matrix-vector mult time (sec)')

fig.tight_layout()
plt.show()

I found pyTorch to be much slower than numpy when doing complex-valued matrix-vector multiplication on CPU:

A few notes:

This is true for me across multiple systems
Memory is not an issue
complex multiplication on torch does not max out the cores (unlike the other three cases)
torch version: 2.5.1+cu124
numpy version: 1.26.4
Cuda version: 12.6
NVidia driver: 560.35.03
I verified the results of the calculations are the same
Both use double precision (ie 64-bits for real and 128 bits for complex)
Switching to float (torch.cfloat) makes things slightly faster, but not much

Perhaps I have misconfigured something?

Code to produce above plots:

import torch
import numpy as np
import matplotlib.pyplot as plt
import time

maxn = 3000
nrep = 100

def conv(M,latype):
    if latype=='numpy':
        return np.array(M)
    if latype.startswith('torch,'):
        return torch.tensor(M,device=latype[7:])

def multtest(A,b):
    t0 = time.time()
    for i in range(nrep):
        b = A@b
    t1 = time.time()
    return (t1-t0)/nrep

ns = np.array(np.linspace(100,maxn,100),dtype=int)
numpyts = np.zeros(len(ns))
torchts = np.zeros(len(ns))

fig,axes = plt.subplots(1,2)
for ax,dtype in zip(axes,['real','complex']):
    Aorig = np.random.rand(maxn,maxn)
    borig = np.random.rand(maxn)
    if dtype == 'complex':
        Aorig = Aorig + 1.j*np.random.rand(maxn,maxn)
        borig = borig + 1.j*np.random.rand(maxn)

    for latype in ['numpy','torch, cpu']:
        A = conv(Aorig,latype)
        b = conv(borig,latype)
        ts = np.zeros(len(ns))
        for i,n in enumerate(ns):
            ts[i] = multtest(A[:n,:n],b[:n])
        ax.plot(ns,ts,label=latype)

    ax.legend()
    ax.set_title(dtype)
    ax.set_xlabel('vector/matrix size')
    ax.set_ylabel('mean matrix-vector mult time (sec)')

fig.tight_layout()
plt.show()

Share Improve this question edited Nov 18, 2024 at 16:54 asked Nov 16, 2024 at 19:51 cshelton 3703 silver badges9 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

If they uses their own implementation, then you can open an issue about this so to use OpenMP also in the complex version.

AFAIK, Torch is optimized for real simple-precision computations on GPUs and not really complex double-precision computations on CPUs. Thus, maybe they did not care about this yet.

Side notes

Note I can see the following warning during the execution by the way:

<ipython-input-85-1e20a6760269>:18: RuntimeWarning: overflow encountered in matmul
  b = A@b
<ipython-input-85-1e20a6760269>:18: RuntimeWarning: overflow encountered in matmul
  b = A@b
<ipython-input-85-1e20a6760269>:18: RuntimeWarning: invalid value encountered in matmul
  b = A@b

When I run the code I get a different plot:

torch: 2.3.1
numpy: 1.26.4
cuda: 12.2
NVIDIA-Driver: 535.183.01 (Ubuntu)

本文标签： numpyPyTorch complex matrixvector multiplication is slow on CPUStack Overflow

版权声明：本文标题：numpy - PyTorch complex matrix-vector multiplication is slow on CPU - Stack Overflow 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://it.en369.cn/questions/1745650320a2161278.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

369IT编程

numpy - PyTorch complex matrix-vector multiplication is slow on CPU - Stack Overflow

2 Answers 2

Side notes

2 Answers 2

Side notes

更多相关文章

numpy - PyTorch complex matrix-vector multiplication is slow on CPU - Stack Overflow

发表评论

推荐文章

javascript - Can I use a ternary operator with angularjs outside of the class attribute? - Stack Overflow

javascript - Add socket emit from client side - Stack Overflow

Block Adsense on specific page

How to include the &#39;current-menu-ancestor&#39; class on a custom post type menu in Wordpress?

Enlarge images in my html page using JavaScript - Stack Overflow

热门文章

c# - When trying to convert a method to expression tree, got an error &#39;System.Void&#39; cannot be used for return ty

javascript - Select Row in Single Click in Datatable and Export Selected Data as Excel - Stack Overflow

javascript - How to deal with React Native animated.timing in same child components - Stack Overflow

java - nested exception is org.hibernate.HibernateException: Found shared references to a collection - Stack Overflow

java - The blocking queue of the thread pool is not full, but a rejection exception is thrown - Stack Overflow

javascript - How do I get karma to set webpack&#39;s mode to development? - Stack Overflow

javascript - Get variable name instead of value - Stack Overflow

javascript - Extjs Creating a tagfield without the list - Stack Overflow

php - Why ajax doesn&#39;t work on certain wordpress hooks and reload the page instead?

javascript - Change placeholder text based on the dropdown selection? - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

程序员刚毕业，先去大厂镀金还是先去小厂攒经验？

万象2008清空boss账户密码

【Tools】GitBook简明教程

oracle exadata celldisk 闪存盘受损导致性能下降

SDUT 2138 图结构练习——BFSDFS——判断可达性

javascript - d3.csv returns index.html content in console.log - Stack Overflow

javascript - Grails: How to make a g:textfield autocomplete? - Stack Overflow

ssl - How to make a TLS connection between two peers using Rust tokio_native_tls - Stack Overflow

javascript - Dropdown inside dropdown - Stack Overflow

javascript - On keydown trigger next tab index (focus next element) - Stack Overflow

How to include the 'current-menu-ancestor' class on a custom post type menu in Wordpress?

c# - When trying to convert a method to expression tree, got an error 'System.Void' cannot be used for return ty

javascript - How do I get karma to set webpack's mode to development? - Stack Overflow

php - Why ajax doesn't work on certain wordpress hooks and reload the page instead?