OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

byingyang · 2024-12-31T20:36:02Z

🐛 Bug

When I created all the edges files for an OnDiskDataset where I casted all the src and dst to int32 type (since we do not have billions of nodes yet), the preprocessing stage crashed with an int32 overflow error:

The on-disk dataset is re-preprocessing, so the existing preprocessed dataset has been removed.
Start to preprocess the on-disk dataset.

RuntimeError: [20:25:19] /opt/dgl/src/array/cpu/spmat_op_impl_coo.cc:749: Check failed: (coo.row->shape[0]) <= 0x7FFFFFFFL (2283022784 vs. 2147483647) : int32 overflow for argument coo.row->shape[0].
Stack trace:
  [bt] (0) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(+0x61fbc4) [0x7f34bc81fbc4]
  [bt] (1) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::CSRMatrix dgl::aten::impl::COOToCSR<(DGLDeviceType)1, int>(dgl::aten::COOMatrix)+0x121) [0x7f34bc82ac81]
  [bt] (2) /databricks/python/lib/python3.11/site-packages/dgl/libdgl.so(dgl::aten::COOToCSR(dgl::aten::COOMatrix)+0x451) [0x7f34bc5b43a1]
  [bt] (3) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::COOToCSC(std::shared_ptr<dgl::sparse::COO> const&)+0x17d) [0x7f3394a77f2d]
  [bt] (4) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::_CreateCSC()+0x14d) [0x7f3394a7c14d]
  [bt] (5) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCPtr()+0x5d) [0x7f3394a7c24d]
  [bt] (6) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(dgl::sparse::SparseMatrix::CSCTensors()+0x13) [0x7f3394a7ce63]
  [bt] (7) /databricks/python3/lib/python3.11/site-packages/dgl/dgl_sparse/libdgl_sparse_pytorch_2.4.0.so(std::_Function_handler<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&), torch::class_<dgl::sparse::SparseMatrix>::defineMethod<torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()> >(std::string, torch::detail::WrapMethod<std::tuple<at::Tensor, at::Tensor, std::optional<at::Tensor> > (dgl::sparse::SparseMatrix::*)()>, std::string, std::initializer_list<torch::arg>)::{lambda(std::vector<c10::IValue, std::allocator<c10::IValue> >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator<c10::IValue> >&)+0x82) [0x7f3394a65802]
  [bt] (8) /databricks/python/lib/python3.11/site-packages/torch/lib/libtorch_python.so(+0xa80f7e) [0x7f357f678f7e]

----> 2 dataset = gb.OnDiskDataset(base_dir, force_preprocess=True).load()
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:688, in OnDiskDataset.__init__(self, path, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    678 def __init__(
    679     self,
    680     path: str,
   (...)
    685     # Always call the preprocess function first. If already preprocessed,
    686     # the function will return the original path directly.
    687     self._dataset_dir = path
--> 688     yaml_path = preprocess_ondisk_dataset(
    689         path,
    690         include_original_edge_id,
    691         force_preprocess,
    692         auto_cast_to_optimal_dtype,
    693     )
    694     with open(yaml_path) as f:
    695         self._yaml_data = yaml.load(f, Loader=yaml.loader.SafeLoader)
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:407, in preprocess_ondisk_dataset(dataset_dir, include_original_edge_id, force_preprocess, auto_cast_to_optimal_dtype)
    404 if "graph" not in input_config:
    405     raise RuntimeError("Invalid config: does not contain graph field.")
--> 407 sampling_graph = _graph_data_to_fused_csc_sampling_graph(
    408     dataset_dir,
    409     input_config["graph"],
    410     include_original_edge_id,
    411     auto_cast_to_optimal_dtype,
    412 )
    414 # 3. Record value of include_original_edge_id.
    415 output_config["include_original_edge_id"] = include_original_edge_id
File /databricks/python/lib/python3.11/site-packages/dgl/graphbolt/impl/ondisk_dataset.py:166, in _graph_data_to_fused_csc_sampling_graph(dataset_dir, graph_data, include_original_edge_id, auto_cast_to_optimal_dtype)
    161 sparse_matrix = spmatrix(
    162     indices=torch.stack((coo_src, coo_dst), dim=0),
    163     shape=(total_num_nodes, total_num_nodes),
    164 )
    165 del coo_src, coo_dst
--> 166 indptr, indices, edge_ids = sparse_matrix.csc()
    167 del sparse_matrix
    169 if auto_cast_to_optimal_dtype:
File /databricks/python/lib/python3.11/site-packages/dgl/sparse/sparse_matrix.py:201, in SparseMatrix.csc(self)
    172 def csc(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    173     r"""Returns the compressed sparse column (CSC) representation of the
    174     sparse matrix.
    175 
   (...)
    199     (tensor([0, 0, 0, 1, 2, 3]), tensor([1, 1, 2]), tensor([0, 2, 1]))
    200     """
--> 201     return self.c_sparse_matrix.csc()

To Reproduce

Steps to reproduce the behavior:

Create OnDiskDataset with edges in npy files that have all ints casted to int32, with a # of edges > int32 size.
Load dataset and preprocess

Expected behavior

In order to get around this issue, I have to double my CPU memory usage by not casting the ints. So then there seems to be no memory savings when we switched to graphbolt.

Environment

DGL Version (e.g., 1.0):
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

byingyang commented Dec 31, 2024 •

edited

Loading

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

OnDiskDataset Preprocessing crashes with graph more than 2B edges #7850

Comments

byingyang commented Dec 31, 2024 • edited Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

byingyang commented Dec 31, 2024 •

edited

Loading