Skip to content

Commit

Permalink
Improve performance of finding indexables
Browse files Browse the repository at this point in the history
Currently, all folders and files in the current tree are turned into
IndexablePath, and then excluded files are filtered out after.

When there are large file trees that are meant to be excluded, this
results in a lot of unnecessary work.

ActiveStorage stores files in the `tmp` directory in many many small
folders. So does Bootsnap. Additionally, node_modules can become quite
large. Ruby LSP has to traverse all of these files,
even though the entire directory should just be ignored.

Instead we can skip any top-level directories whose paths have been
excluded.

We still need to loop through all IndexablePath objects compare them to the
exclude_patterns, in case nested folders or file name patterns were excluded.
Still, skipping some large top-level directories proves to be a big
performance improvement.

Before this PR in my Rails app, `indexables` took 76 seconds to run. Now
it takes, 0.17 seconds. Before and after code both return the same exact
file list.
  • Loading branch information
natematykiewicz committed Sep 25, 2024
1 parent b8d3954 commit 39ca825
Showing 1 changed file with 57 additions and 12 deletions.
69 changes: 57 additions & 12 deletions lib/ruby_indexer/lib/ruby_indexer/configuration.rb
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,15 @@ def initialize
@workspace_path = T.let(Dir.pwd, String)
@excluded_gems = T.let(initial_excluded_gems, T::Array[String])
@included_gems = T.let([], T::Array[String])
@excluded_patterns = T.let([File.join("**", "*_test.rb"), File.join("tmp", "**", "*")], T::Array[String])

@excluded_patterns = T.let(
[
File.join("**", "*_test.rb"),
File.join("tmp", "**", "*"),
File.join("node_modules", "**", "*"),
],
T::Array[String],
)

path = Bundler.settings["path"]
if path
Expand Down Expand Up @@ -52,6 +60,18 @@ def initialize
)
end

sig { returns(String) }
def merged_excluded_file_pattern
# This regex looks for @excluded_patterns that follow the format of "something/**/*", where
# "something" is one or more non-"/"
#
# Returns "/path/to/workspace/{tmp,node_modules}/**/*"
@excluded_patterns
.reject { |pattern| File.absolute_path?(pattern) }
.filter_map { |pattern| pattern.match(%r{\A([^/]+)/\*\*/\*\z})&.captures&.first }
.then { |dirs| File.join(@workspace_path, "{#{dirs.join(",")}}/**/*") }
end

sig { returns(T::Array[IndexablePath]) }
def indexables
excluded_gems = @excluded_gems - @included_gems
Expand All @@ -60,21 +80,46 @@ def indexables
# NOTE: indexing the patterns (both included and excluded) needs to happen before indexing gems, otherwise we risk
# having duplicates if BUNDLE_PATH is set to a folder inside the project structure

flags = File::FNM_PATHNAME | File::FNM_EXTGLOB

# In order to speed up indexing, only traverse into top-level directories that are not entirely excluded.
# For example, if "tmp/**/*" is excluded, we don't need to traverse into "tmp" at all. However, if
# "vendor/bundle/**/*" is excluded, we will traverse all of "vendor" and `reject!` out all "vendor/bundle" entries
# later.
included_paths = Dir.glob(File.join(@workspace_path, "*/"), flags)
.reject { |dir| File.fnmatch?(merged_excluded_file_pattern, dir, flags) }
.map do |included_path|
relative_path = included_path
.delete_prefix(@workspace_path)
.tap { |path| path.delete_prefix!("/") }
.tap { |path| path.delete_suffix!("/") }

[included_path, relative_path]
end

indexables = T.let([], T::Array[IndexablePath])

# Add user specified patterns
indexables = @included_patterns.flat_map do |pattern|
@included_patterns.each do |pattern|
load_path_entry = T.let(nil, T.nilable(String))

Dir.glob(File.join(@workspace_path, pattern), File::FNM_PATHNAME | File::FNM_EXTGLOB).map! do |path|
path = File.expand_path(path)
# All entries for the same pattern match the same $LOAD_PATH entry. Since searching the $LOAD_PATH for every
# entry is expensive, we memoize it until we find a path that doesn't belong to that $LOAD_PATH. This happens
# on repositories that define multiple gems, like Rails. All frameworks are defined inside the current
# workspace directory, but each one of them belongs to a different $LOAD_PATH entry
if load_path_entry.nil? || !path.start_with?(load_path_entry)
load_path_entry = $LOAD_PATH.find { |load_path| path.start_with?(load_path) }
end
included_paths.each do |included_path, relative_path|
relative_pattern = pattern.delete_prefix(File.join(relative_path, "/"))

IndexablePath.new(load_path_entry, path)
next unless pattern.start_with?("**") || pattern.start_with?(relative_path)

Dir.glob(File.join(included_path, relative_pattern), flags).each do |path|
path = File.expand_path(path)
# All entries for the same pattern match the same $LOAD_PATH entry. Since searching the $LOAD_PATH for every
# entry is expensive, we memoize it until we find a path that doesn't belong to that $LOAD_PATH. This
# happens on repositories that define multiple gems, like Rails. All frameworks are defined inside the
# current workspace directory, but each one of them belongs to a different $LOAD_PATH entry
if load_path_entry.nil? || !path.start_with?(load_path_entry)
load_path_entry = $LOAD_PATH.find { |load_path| path.start_with?(load_path) }
end

indexables << IndexablePath.new(load_path_entry, path)
end
end
end

Expand Down

0 comments on commit 39ca825

Please sign in to comment.