Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using MLDatasets is very slow #126

Closed
CarloLucibello opened this issue May 7, 2022 · 4 comments · Fixed by #128
Closed

using MLDatasets is very slow #126

CarloLucibello opened this issue May 7, 2022 · 4 comments · Fixed by #128

Comments

@CarloLucibello
Copy link
Member

CarloLucibello commented May 7, 2022

In a fresh julia 1.7 session

julia> @time using MLDatasets
 13.485246 seconds (20.31 M allocations: 1.158 GiB, 7.56% gc time, 61.69% compilation time)

Is there a way to conditionally import packages?

julia> for pkg in [:ImageCore, :CSV, :HDF5, :JLD2, :JSON3]; print(pkg); @time @eval using $pkg; end
ImageCore  2.141235 seconds (3.02 M allocations: 200.377 MiB, 4.50% gc time, 32.08% compilation time)
CSV  3.817959 seconds (6.14 M allocations: 348.493 MiB, 9.81% gc time, 90.11% compilation time)
HDF5  0.723358 seconds (1.34 M allocations: 73.225 MiB, 1.69% gc time, 93.80% compilation time)
JLD2  1.139716 seconds (1.36 M allocations: 78.966 MiB, 3.95% gc time, 60.77% compilation time)
JSON3  0.033367 seconds (49.09 k allocations: 3.014 MiB)

julia> for pkg in [:DataFrames, :MLUtils, :Pickle, :NPZ, :MAT]; print(pkg); @time @eval using $pkg; end
DataFrames  1.789793 seconds (2.03 M allocations: 137.197 MiB, 4.63% gc time)
MLUtils  1.743072 seconds (2.07 M allocations: 117.900 MiB, 4.83% gc time, 47.32% compilation time)
Pickle  0.130685 seconds (159.17 k allocations: 9.751 MiB, 17.77% compilation time)
NPZ  0.504406 seconds (1.19 M allocations: 61.838 MiB, 4.05% gc time, 98.87% compilation time)
MAT  0.009792 seconds (22.84 k allocations: 1.044 MiB)

Related discourse thread

@johnnychen94
Copy link
Member

julia> versioninfo()
Julia Version 1.9.0-DEV.351
Commit 385762b444 (2022-04-08 21:50 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: 24 × 12th Gen Intel(R) Core(TM) i9-12900K
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, goldmont)
  Threads: 1 on 24 virtual cores
julia> @time_imports using MLDatasets
     18.5 ms        ┌ Preferences
     26.9 ms      ┌ JLLWrappers
     30.0 ms    ┌ OpenSSL_jll
      0.3 ms    ┌ Zlib_jll
     31.5 ms  ┌ HDF5_jll
      9.2 ms    ┌ MacroTools
      9.7 ms  ┌ ZygoteRules
      1.6 ms  ┌ GZip
      1.0 ms  ┌ ZipFile
      0.1 ms      ┌ IteratorInterfaceExtensions
      0.3 ms    ┌ TableTraits
    101.7 ms    ┌ SentinelArrays
     25.3 ms    ┌ Parsers
      0.9 ms    ┌ Compat
      4.9 ms    ┌ OrderedCollections
      1.3 ms      ┌ TranscodingStreams
      2.5 ms    ┌ CodecZlib
      0.1 ms    ┌ DataValueInterfaces
      9.5 ms    ┌ FilePathsBase
      5.0 ms      ┌ InlineStrings
      0.8 ms      ┌ DataAPI
     16.9 ms    ┌ WeakRefStrings
      8.5 ms    ┌ Tables
     10.7 ms    ┌ PooledArrays
   1348.2 ms  ┌ CSV
      0.4 ms  ┌ DefineSingletons
      0.4 ms    ┌ NaNMath
      0.9 ms      ┌ Adapt
     36.4 ms      ┌ OffsetArrays
     39.2 ms    ┌ PaddedViews
     24.4 ms    ┌ FixedPointNumbers
     48.3 ms      ┌ ChainRulesCore
     49.2 ms    ┌ ChangesOfVariables
      5.8 ms    ┌ AbstractFFTs
      0.3 ms    ┌ OpenLibm_jll
      1.6 ms      ┌ StackViews
      1.4 ms      ┌ MappedArrays
      4.1 ms    ┌ MosaicViews
      0.6 ms    ┌ InverseFunctions
      0.1 ms    ┌ Reexport
      3.7 ms    ┌ DocStringExtensions
    107.3 ms      ┌ ColorTypes
     76.2 ms      ┌ Colors
    185.5 ms    ┌ Graphics
      3.4 ms    ┌ IrrationalConstants
      0.4 ms    ┌ TensorCore
      1.4 ms      ┌ LogExpFunctions
    215.0 ms      ┌ OpenSpecFun_jll
    252.4 ms    ┌ SpecialFunctions
     80.6 ms    ┌ ColorVectorSpace
    888.7 ms  ┌ ImageCore
      0.5 ms    ┌ Requires
      0.9 ms    ┌ ConstructionBase
     24.4 ms    ┌ Setfield
     14.2 ms    ┌ InitialValues
     81.5 ms  ┌ BangBang
      0.3 ms    ┌ MbedTLS_jll
     32.6 ms  ┌ MbedTLS
     11.6 ms  ┌ FunctionWrappers
     57.0 ms  ┌ DataStructures
      0.2 ms    ┌ CompositionsBase
     21.0 ms  ┌ Accessors
      0.3 ms  ┌ InternedStrings
     18.4 ms    ┌ URIParser
    113.0 ms  ┌ BinDeps
     10.4 ms  ┌ StructTypes
      1.3 ms  ┌ ContextVariablesX
      0.3 ms    ┌ StatsAPI
      0.4 ms    ┌ SortingAlgorithms
      7.9 ms    ┌ Missings
     27.0 ms  ┌ StatsBase
      7.7 ms    ┌ MicroCollections
      0.4 ms    ┌ ArgCheck
      8.8 ms    ┌ SplittablesBase
     23.7 ms    ┌ Baselet
     87.6 ms  ┌ Transducers
      0.5 ms    ┌ Libiconv_jll
      3.7 ms  ┌ StringEncodings
      9.6 ms  ┌ JSON3
      2.5 ms  ┌ InvertedIndices
     97.1 ms    ┌ FileIO
    352.8 ms  ┌ NPZ
      2.4 ms    ┌ BufferedStreams
    715.8 ms    ┌ HDF5
    719.9 ms  ┌ MAT
    203.6 ms  ┌ MLStyle
      0.5 ms  ┌ IniFile
      0.9 ms  ┌ Glob
      6.0 ms    ┌ URIs
     54.3 ms  ┌ HTTP
      4.3 ms  ┌ DataDeps
    161.8 ms  ┌ JLD2
      1.7 ms  ┌ PrettyPrint
     29.3 ms    ┌ Crayons
      0.8 ms      ┌ Formatting
     88.3 ms    ┌ PrettyTables
    790.5 ms  ┌ DataFrames
      7.8 ms  ┌ ShowCases
    393.5 ms  ┌ FoldsThreads
      0.3 ms  ┌ FLoopsBase
      0.6 ms  ┌ NameResolution
      3.5 ms  ┌ JuliaVariables
      1.4 ms    ┌ TupleTools
     33.8 ms  ┌ Strided
      7.0 ms  ┌ FLoops
     17.2 ms  ┌ Pickle
      7.0 ms  ┌ MLUtils
   6173.8 ms  MLDatasets

@johnnychen94
Copy link
Member

johnnychen94 commented May 7, 2022

Is there a way to conditionally import packages?

One way is to let FileIO handle the IO of various formats. If you use the load/save interface, then you don't need to explicitly call using XXX in src/MLDatasets.jl

For NPZ and CSV we can delegate to CSVFiles.jl and NPZ.jl. Is JLD2 currently used in this package?

Might need to ask MAT to provide FileIO support

@johnnychen94
Copy link
Member

If you want to build our own version of lazy loading, then we need to use invokelatest function. An small example can be found in https://github.com/JuliaIO/ImageIO.jl/blob/d8bbac7bb9c4367b4bf145b9cca5d49abc9e42ab/src/ImageIO.jl#L51-L61

@CarloLucibello
Copy link
Member Author

I'll try to explore both the FileIO and the custom lazy loading.

Filed an issue for .mat JuliaIO/FileIO.jl#361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants