Skip to content

[BUG] Error with Cluster Index Handling in Version 1.1.4 #12

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luna2terra opened this issue Apr 27, 2025 · 3 comments
Open

[BUG] Error with Cluster Index Handling in Version 1.1.4 #12

luna2terra opened this issue Apr 27, 2025 · 3 comments
Labels
bug: fixed Bug fixed in latest update. 此错误已在最新更新中解决 bug: R Issues in the R package. R包中的错误或问题

Comments

@luna2terra
Copy link

I encountered an issue while using the mLLMCelltype package (version 1.1.4). It seems that the cluster index handling is not functioning as expected. Specifically, I received errors related to negative indices when processing my CSV input files.

Steps to Reproduce:

  1. Install mLLMCelltype version 1.1.4.
  2. Prepare a CSV file with cluster indices.
  3. Run the interactive_consensus_annotation() function using this file.

Expected Behavior:

The function should accept the input files without errors related to cluster indices, ensuring that they start from 0.

Actual Behavior:

I encountered the following error:

Error: Negative index (-1) found when processing cluster indices.

Environment:

  • OS: [Insert your operating system]
  • R Version: [Insert your R version]
  • mLLMCelltype Version: 1.1.4

Additional Context:

I noticed that the recent update mentions strict validation for input cluster indices, but it appears that there might still be issues affecting users with specific datasets.

Suggested Fix:

Could you please investigate this issue? It might be helpful to include additional validation checks in the code to handle negative indices more gracefully.

Thank you for your assistance!

@luna2terra luna2terra added the bug label Apr 27, 2025
@cafferychen777
Copy link
Owner

Cat_Heart_markers.csv

Hello @luna2terra,

Thank you for reporting this issue with the cluster index handling in mLLMCelltype version 1.1.4. I'd like to help troubleshoot this problem as soon as possible.

Could you please share your processing code that encountered this error? It would be extremely helpful to see exactly how you're setting up and calling the functions.

If possible, please also send your dataset CSV to my email at [email protected] so I can reproduce the issue directly. Rest assured, your data will only be used for debugging purposes.

For reference, I've created an example script that demonstrates how we process CSV files with cluster indices in a way that works correctly. You can see this example below:

# First install the latest version of mLLMCelltype
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R")

# mLLMCelltype example using CSV file as input
# This script directly uses marker genes from the CSV file, avoiding recalculation through Seurat each time

# Load necessary packages
library(mLLMCelltype)

# Create cache and log directories
cache_dir <- "/Users/apple/Research/mLLMCelltype/R/examples/cache"
log_dir <- "/Users/apple/Research/mLLMCelltype/R/examples/logs"
dir.create(cache_dir, showWarnings = FALSE, recursive = TRUE)
dir.create(log_dir, showWarnings = FALSE, recursive = TRUE)

# Read CSV file content
cat_heart_markers_file <- "/Users/apple/Research/LLMCelltype/data/reference/Cat_Heart_markers.csv"
file_content <- readLines(cat_heart_markers_file)

# Skip header row
data_lines <- file_content[-1]

# Check data structure
cat("Number of data rows: ", length(data_lines), "\n")
cat("First data row: ", data_lines[1], "\n")

# Convert data to list format, using numeric indices as keys
marker_genes_list <- list()
cluster_names <- c()

# First collect all cluster names
for(line in data_lines) {
  parts <- strsplit(line, ",", fixed = TRUE)[[1]]
  cluster_names <- c(cluster_names, parts[1])
}

# Then create marker_genes_list with numeric indices
for(i in 1:length(data_lines)) {
  line <- data_lines[i]
  parts <- strsplit(line, ",", fixed = TRUE)[[1]]
  
  # First part is the cluster name
  cluster_name <- parts[1]
  
  # Use index as key (0-based index, compatible with Seurat)
  cluster_id <- as.character(i - 1)
  
  # Remaining parts are genes
  genes <- parts[-1]
  
  # Filter out NA and empty strings
  genes <- genes[!is.na(genes) & genes != ""]
  
  # Add to marker_genes_list
  marker_genes_list[[cluster_id]] <- list(genes = genes)
  
  # Print mapping relationship
  cat(sprintf("Mapping cluster '%s' to index %s\n", cluster_name, cluster_id))
}

# Print the processed marker_genes_list structure
cat("\nProcessed marker_genes_list structure:\n")
for(cluster in names(marker_genes_list)) {
  cat(sprintf("Cluster: %s, Genes: %s\n", 
              cluster, 
              paste(head(marker_genes_list[[cluster]]$genes, 5), collapse=", ")))
}

# Set API keys
api_keys <- list(
  gemini = "YOUR_GEMINI_API_KEY",
  qwen = "YOUR_QWEN_API_KEY",
  grok = "YOUR_GROK_API_KEY",
  openrouter = "YOUR_OPENROUTER_API_KEY"
)

# Run consensus annotation
cat("\nStarting interactive_consensus_annotation...\n")
consensus_results <- 
  interactive_consensus_annotation(
    input = marker_genes_list,
    tissue_name = "cat heart", # Cat heart data
    models = c("gemini-2.0-flash", 
              "gemini-1.5-pro", 
              "qwen-max-2025-01-25", 
              "grok-3-latest", 
              "anthropic/claude-3-7-sonnet-20250219",
              "openai/gpt-4o"),
    api_keys = api_keys,
    controversy_threshold = 0.6,
    entropy_threshold = 1.0,
    max_discussion_rounds = 3,
    cache_dir = cache_dir,
    log_dir = log_dir
  )

# Save results
saveRDS(consensus_results, "/Users/apple/Research/mLLMCelltype/R/examples/cat_heart_results.rds")

# Print results summary
cat("\nResults summary:\n")
cat("Available fields:", paste(names(consensus_results), collapse=", "), "\n\n")

# Print final annotations
cat("Final cell type annotations:\n")
for(cluster in names(consensus_results$final_annotations)) {
  cat(sprintf("%s: %s\n", cluster, consensus_results$final_annotations[[cluster]]))
}

# Print controversial clusters
cat("\nControversial clusters:", paste(consensus_results$controversial_clusters, collapse=", "), "\n")

# Check number of clusters
cat("\nCluster count check:\n")
cat("Number of input clusters:", length(marker_genes_list), "\n")
cat("Number of finally annotated clusters:", length(consensus_results$final_annotations), "\n")
cat("Number of controversial clusters:", length(consensus_results$controversial_clusters), "\n")

# Check if additional clusters were added
all_clusters <- unique(c(
  names(marker_genes_list),
  names(consensus_results$final_annotations),
  consensus_results$controversial_clusters
))
cat("All occurring clusters:", paste(all_clusters, collapse=", "), "\n")

if(length(all_clusters) > length(marker_genes_list)) {
  extra_clusters <- setdiff(all_clusters, names(marker_genes_list))
  cat("Warning: Additional clusters found:", paste(extra_clusters, collapse=", "), "\n")
}

The key points to note in this approach:

  1. We create a 0-based index system (to be compatible with Seurat): cluster_id <- as.character(i - 1)
  2. We carefully check for and filter out NA and empty strings
  3. We use numeric string indices as keys in the marker_genes_list

If you could adapt this approach for your own CSV format and data paths, it might resolve the issue. The main thing to ensure is that all cluster indices are non-negative and that they follow a consistent pattern.

Once I receive your code and data, I'll be able to investigate further and provide a more targeted solution.

Thank you for your patience and for helping improve mLLMCelltype!

Best regards,
Caffery

@cafferychen777
Copy link
Owner

Important Note: Before running the script, please make sure to:

  1. Delete any existing cache directory to ensure a fresh start:

    # Remove cache directory if it exists
    unlink("/path/to/your/cache", recursive = TRUE)
  2. Force reinstallation of the latest version of mLLMCelltype to make sure you're using the most up-to-date code:

    # Force reinstall the latest version
    devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R", force = TRUE)

These steps will help ensure that you're working with a clean environment and the latest bug fixes.

@cafferychen777 cafferychen777 added bug: R Issues in the R package. R包中的错误或问题 bug: fixed Bug fixed in latest update. 此错误已在最新更新中解决 and removed bug labels Apr 28, 2025
@cafferychen777
Copy link
Owner

こんにちは @luna2terra さん、

mLLMCelltype バージョン 1.1.4 におけるクラスターインデックス処理の問題をご報告いただき、ありがとうございます。前回のコメントで解決策を提案させていただきました。

提案したアプローチで問題は解決しましたでしょうか?共有した例示コードを使用して、CSVファイルを正常に処理できましたか?

まだ問題が発生している場合は、お知らせください。さらなるサポートを提供させていただきます。パッケージの改善に向けて、あなたのフィードバックは非常に貴重です。

よろしくお願いいたします。
Caffery

Hello @luna2terra,

Thank you for reporting the issue with cluster index handling in mLLMCelltype version 1.1.4. We've provided a potential solution in our previous comments.

I wanted to follow up and check if the suggested approach resolved your problem? Were you able to successfully process your CSV files using the example code we shared?

If you're still experiencing issues, please let us know and we'd be happy to provide further assistance. Your feedback is invaluable in helping us improve the package.

Best regards,
Caffery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: fixed Bug fixed in latest update. 此错误已在最新更新中解决 bug: R Issues in the R package. R包中的错误或问题
Projects
None yet
Development

No branches or pull requests

2 participants