You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Because of some special reasons I want to use spawn method to create worker in DataLoader of Pytorch, but it crashed with this error in topic.
Port 55733 is listened by training processes before so it will crash. But I want to know, why port will be bind repeatedly when multiprocessing_context is spawn?
Update: when I use #pytorch only, the problem disappeared. It occurs in lightning.Fabric.
# https://pastebin.com/BqA9mjiE
Epoch 0
Epoch 0
……
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address.
The server socket has failed to bind to [::]:55733 (errno: 98 - Address already in use).
The server socket has failed to bind to 0.0.0.0:55733 (errno: 98 - Address already in use).
Environment
Current environment
CUDA:
GPU:
NVIDIA RTX 5000 Ada Generation
NVIDIA A40
NVIDIA A40
available: True
version: 12.1
Lightning:
lightning: 2.4.0
lightning-utilities: 0.11.9
pytorch-lightning: 2.4.0
torch: 2.2.2
torchaudio: 2.2.2
torchdata: 0.7.1
torchmetrics: 1.6.0
torchvision: 0.17.2
Packages:
absl-py: 2.1.0
affine: 2.4.0
aiobotocore: 2.13.2
aiodns: 3.2.0
aiohappyeyeballs: 2.3.7
aiohttp: 3.10.4
aiohttp-client-cache: 0.11.1
aioitertools: 0.11.0
aiosignal: 1.3.1
aiosqlite: 0.20.0
annotated-types: 0.7.0
appdirs: 1.4.4
argon2-cffi: 23.1.0
argon2-cffi-bindings: 21.2.0
asciitree: 0.3.3
async-retriever: 0.17.0
attrs: 24.2.0
autocommand: 2.2.2
backports.tarfile: 1.2.0
black: 24.8.0
bleach: 6.1.0
bokeh: 3.5.1
boto3: 1.34.131
botocore: 1.34.131
branca: 0.7.2
brotli: 1.1.0
bump2version: 1.0.1
cachetools: 5.5.0
cartopy: 0.23.0
cattrs: 23.2.3
certifi: 2024.8.30
cffi: 1.17.0
cfgrib: 0.9.14.0
cftime: 1.6.4
chardet: 5.2.0
charset-normalizer: 3.3.2
click: 8.1.7
click-plugins: 1.1.1
cligj: 0.7.2
cloudpickle: 3.0.0
codetiming: 1.4.0
colorama: 0.4.6
contourpy: 1.2.1
cryptography: 43.0.0
cupy: 13.3.0
cycler: 0.12.1
cytoolz: 0.12.3
dask: 2024.8.1
dask-expr: 1.1.11
dataretrieval: 1.0.10
deepspeed: 0.16.1
defusedxml: 0.7.1
dgl: 2.2.1+cu121
distributed: 2024.8.1
docutils: 0.21.2
eccodes: 1.7.1
einops: 0.8.0
et-xmlfile: 1.1.0
exceptiongroup: 1.2.2
fasteners: 0.19
fastrlock: 0.8.2
filelock: 3.15.4
findlibs: 0.0.5
flake8: 7.1.1
flexcache: 0.3
flexparser: 0.3.1
folium: 0.17.0
fonttools: 4.53.1
frozenlist: 1.4.1
fsspec: 2024.6.1
geopandas: 1.0.1
gmpy2: 2.1.5
greenlet: 3.0.3
grpcio: 1.62.2
h2: 4.1.0
h5netcdf: 1.3.0
h5py: 3.11.0
hjson: 3.1.0
hpack: 4.0.0
hydrodataset: 0.1.13
hydrodatasource: 0.0.8
hydroerr: 1.24
hydrosignatures: 0.17.0
hydrotopo: 0.0.6
hydroutils: 0.0.12
hyperframe: 6.0.1
idna: 3.7
igraph: 0.11.6
importlib-metadata: 8.2.0
importlib-resources: 6.4.0
inflect: 7.3.1
iniconfig: 2.0.0
intake: 2.0.6
itsdangerous: 2.2.0
jaraco.classes: 3.4.0
jaraco.context: 5.3.0
jaraco.functools: 4.0.2
jaraco.text: 3.12.1
jeepney: 0.8.0
jinja2: 3.1.4
jmespath: 1.0.1
joblib: 1.4.2
kaggle: 1.6.17
kerchunk: 0.2.6
keyring: 25.3.0
kiwisolver: 1.4.5
lightning: 2.4.0
lightning-utilities: 0.11.9
llvmlite: 0.43.0
locket: 1.0.0
loguru: 0.7.2
lxml: 5.3.0
lz4: 4.3.3
markdown: 3.6
markdown-it-py: 3.0.0
markupsafe: 2.1.5
matplotlib: 3.9.2
mccabe: 0.7.0
mdurl: 0.1.2
minio: 7.2.8
more-itertools: 10.4.0
mpmath: 1.3.0
msgpack: 1.0.8
multidict: 6.0.5
mypy-extensions: 1.0.0
netcdf4: 1.7.1.post2
networkx: 3.3
nh3: 0.2.18
ninja: 1.11.1.3
nuitka: 2.4.7
numba: 0.60.0
numcodecs: 0.13.0
numpy: 1.26.4
nvidia-ml-py: 12.535.161
nvitop: 1.3.2
openpyxl: 3.1.5
ordered-set: 4.1.0
owslib: 0.31.0
packaging: 24.1
pandas: 2.2.2
partd: 1.4.2
pathspec: 0.12.1
pillow: 10.4.0
pint: 0.24.3
pint-pandas: 0.6.2
pint-xarray: 0.4
pip: 24.2
pkginfo: 1.10.0
platformdirs: 4.2.2
pluggy: 1.5.0
polars: 1.17.1
protobuf: 4.25.3
psutil: 6.0.0
psycopg2-binary: 2.9.9
py-cpuinfo: 9.0.0
pyarrow: 17.0.0
pyarrow-hotfix: 0.6
pycairo: 1.27.0
pycares: 4.4.0
pycodestyle: 2.12.1
pycparser: 2.22
pycryptodome: 3.20.0
pydantic: 2.8.2
pydantic-core: 2.20.1
pyflakes: 3.2.0
pygeohydro: 0.17.0
pygeoogc: 0.17.0
pygeoutils: 0.17.0
pygments: 2.18.0
pykalman: 0.9.7
pynhd: 0.17.0
pyogrio: 0.9.0
pyparsing: 3.1.2
pyproj: 3.6.1
pyshp: 2.3.1
pysocks: 1.7.1
pytest: 8.3.2
python-dateutil: 2.9.0
python-slugify: 8.0.4
pytorch-lightning: 2.4.0
pytz: 2024.1
pyyaml: 6.0.2
rasterio: 1.3.10
readme-renderer: 44.0
requests: 2.32.3
requests-cache: 1.2.1
requests-toolbelt: 1.0.0
rfc3986: 2.0.0
rich: 13.7.1
rioxarray: 0.17.0
s3fs: 2024.6.1
s3transfer: 0.10.2
scikit-learn: 1.5.1
scipy: 1.14.0
seaborn: 0.13.2
secretstorage: 3.3.3
setuptools: 72.2.0
shap: 0.45.1
shapely: 2.0.1
six: 1.16.0
slicer: 0.0.8
snuggs: 1.4.7
sortedcontainers: 2.4.0
sqlalchemy: 2.0.32
sympy: 1.13.2
tblib: 3.0.0
tbparse: 0.0.9
tensorboard: 2.17.1
tensorboard-data-server: 0.7.0
termcolor: 2.5.0
text-unidecode: 1.3
texttable: 1.7.0
threadpoolctl: 3.5.0
tomli: 2.0.1
toolz: 0.12.1
torch: 2.2.2
torchaudio: 2.2.2
torchdata: 0.7.1
torchmetrics: 1.6.0
torchvision: 0.17.2
tornado: 6.4.1
tqdm: 4.66.5
triton: 2.2.0
twine: 5.1.1
typeguard: 4.3.0
typing-extensions: 4.12.2
tzdata: 2024.1
tzfpy: 0.15.5
ujson: 5.10.0
url-normalize: 1.4.3
urllib3: 2.2.2
webencodings: 0.5.1
werkzeug: 3.0.3
wget: 3.2
wheel: 0.44.0
wrapt: 1.16.0
xarray: 2024.7.0
xlrd: 2.0.1
xyzservices: 2024.6.0
yarl: 1.9.4
zarr: 2.18.2
zict: 3.0.0
zipp: 3.20.0
zstandard: 0.23.0
System:
OS: Linux
architecture:
64bit
ELF
processor: x86_64
python: 3.11.9
release: 5.4.0-195-generic
version: Demos #215-Ubuntu SMP Fri Aug 2 18:28:05 UTC 2024
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
Because of some special reasons I want to use
spawn
method to create worker inDataLoader
of Pytorch, but it crashed with this error in topic.Port 55733 is listened by training processes before so it will crash. But I want to know, why port will be bind repeatedly when
multiprocessing_context
isspawn
?Update: when I use #pytorch only, the problem disappeared. It occurs in lightning.Fabric.
Hope for your reply.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: