Datasets:
messages list | source string | category string |
|---|---|---|
[
{
"role": "user",
"content": "Can you provide detailed steps on how to modify the code to merge c\\_task\\_response and c\\_step\\_response datasets, use the renderer job API to transfer data to sftp, and filter datasets based on a specific task identifier, such as \"task\\_id=123\"? Your solution should in... | openhermes | unknown_general |
[
{
"role": "user",
"content": "You are given a string `s`. Your task is to write a recursive function in Python that reverses the string without using slicing. The function should handle strings of any length, including empty strings.\n\n**Input:**\n- A single string `s` (0 ≤ |s| ≤ 1000).\n\n**Output:**\n- A... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nbook_cover_api.getbookcoverimage_for_book_cover_api: This is the subfunction for tool 'book_cover_api', you can use this tool.The description of this function is: 'Get a book cover image by... | agent_flan | tool_calling_cot |
[
{
"role": "user",
"content": "You are tasked with implementing a Python class named `Employee` to manage employee data. The class should have the following attributes: `name`, `position`, `department`, and `salary`. Implement the following functionalities:\n\n1. A constructor to initialize the attributes.\n... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "I want to use the Titanic dataset available in the seaborn library. Can you explain the process of acquiring the data, the steps involved in cleaning the data, and suggest a suitable data visualization chart? Additionally, I would like to incorporate gender, survival status, an... | openhermes | glaive-code-assist_general |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \ni_eksepeti_data.get_comments_from_product_id_for_i_eksepeti_data: This is the subfunction for tool 'i_eksepeti_data', you can use this tool.The description of this function is: 'Retrieve co... | agent_flan | tool_calling_cot |
[
{
"role": "user",
"content": "Can you find the boundary of the hexagon with H3 index 617313877342776319 and the address corresponding to the coordinates 51.5074, -0.1278?"
},
{
"role": "model",
"content": "[{\"name\": \"query_for_hexagon_by_h3index\", \"arguments\": {\"h3index\": 617313877342776... | xlam | database |
[
{
"role": "user",
"content": "I'm a city planner working on a new sports district development. First, I need to apply for a demolition permit for an old building at 123 Stadium Ave. Do you know the process or who I should contact for that?"
},
{
"role": "tool",
"content": "{\n \"Basketball\": {\... | toolmind | toolmind_graphsyn_backfill |
[
{
"role": "user",
"content": "You are tasked with implementing a `Book` class to manage a library's inventory. The class should have the following attributes:\n- `title`: a string representing the title of the book.\n- `author`: a string representing the author of the book.\n- `isbn`: a string representing ... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "Which phishing URLs have been observed in the last week?"
},
{
"role": "model",
"content": "[{\"name\": \"v1\", \"arguments\": {\"time\": \"week\", \"filter1\": \"phishing\", \"filter2\": \"url\"}}]"
}
] | xlam | database |
[
{
"role": "user",
"content": "You are given an `n × n` chessboard and a list of squares imbued with a magical aura. No two rooks can be placed on the same row or column, and if a rook is placed on a square that is under a magical aura, it must move to a neighboring row or column, provided that the new squar... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "Can you reverse the string \"Hello, World!\" for me?"
},
{
"role": "tool",
"content": "{\"result\": \"!dlroW ,olleH\"}"
},
{
"role": "model",
"content": "The reversed string of \"Hello, World!\" is \"!dlroW ,olleH\"."
},
{
"role": "user",
"co... | toolmind | glaive_general |
[
{
"role": "user",
"content": "You are tasked with implementing a recursive function to compute the factorial of a non-negative integer. Your function should handle invalid inputs (negative integers) by raising a `ValueError` with the message \"Factorial is not defined for negative numbers\". \n\n**Input:**\... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You are given a string `value` representing a monetary amount in a specific currency format, which may include various currency symbols, commas, and spaces. Your task is to write a function `convert_currency` that processes this string and returns the numerical value of the mon... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You are given a string consisting solely of alphabetic characters. Your task is to write a function that counts the number of vowels and consonants in the string. The function should return a tuple containing the count of vowels and consonants, respectively.\n\n**Input:**\n- A ... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nall_crypto_news_feed.get_twitter_feed_for_all_crypto_news_feed: This is the subfunction for tool 'all_crypto_news_feed', you can use this tool.The description of this function is: 'Returns ... | agent_flan | tool_calling_cot |
[
{
"role": "user",
"content": "Retrieve the profile information for Instagram users 'selenagomez' and 'lewishamilton'."
},
{
"role": "model",
"content": "[{\"name\": \"profile_info\", \"arguments\": {\"username\": \"selenagomez\"}}, {\"name\": \"profile_info\", \"arguments\": {\"username\": \"lew... | xlam | database |
[
{
"role": "user",
"content": "I need to find some recent Arabic news articles about political changes in the Arab world for my school project. Could you suggest reliable sources or specific articles from the past year?"
},
{
"role": "tool",
"content": "[\n {\n \"headline\": \"التحولات السياسية ف... | toolmind | graphsyn_multi_hop |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nadmiralbet.odds_for_admiralbet: This is the subfunction for tool 'admiralbet', you can use this tool.The description of this function is: 'Latest odds_for_admiralbet by Admiralbet.'\nInput ... | agent_flan | tool_calling_multistep |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nFinalAction: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are una... | agent_flan | tool_calling_multistep |
[
{
"role": "user",
"content": "How popular is the '#veganrecipes' hashtag on TikTok?"
},
{
"role": "model",
"content": "[{\"name\": \"hashtag_info\", \"arguments\": {\"hashtag\": \"veganrecipes\"}}]"
}
] | xlam | web_search |
[
{
"role": "user",
"content": "You are given a JSON file named `books.json` that contains a list of books. Each book is represented as an object with the following properties: `title`, `author`, `isbn`, and `published_year`. Your task is to write a Python script that reads this JSON file, validates that each... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You are tasked with creating a Python class that simulates a simple signal processing flow using GNU Radio. Your task is to complete the implementation of the `SignalProcessor` class by adding the missing method `process_signal`. The `process_signal` method should take a list o... | oss_instruct | algorithmic_problem_solving |
[
{
"role": "user",
"content": "I'm planning a trip to Costa Rica and considering the MSC Seaview cruise. Could you tell me about its itineraries in the region and when it's scheduled to dock in San José?"
},
{
"role": "tool",
"content": "{\n \"vessel_name\": \"MSC Seaview\",\n \"vessel_type\": \"... | toolmind | toolmind_graphsyn_backfill |
[
{
"role": "user",
"content": "You are given a string `s`. Your task is to determine if it is a palindrome, considering only alphanumeric characters and ignoring cases. A string is a palindrome if it reads the same backward as forward.\n\n**Input:**\n- A single string `s` (1 ≤ |s| ≤ 10^5), where |s| denotes ... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r... | openthought | reasoning_trajectory |
[
{
"role": "user",
"content": "What is the BMI for a person who is 1.65 meters tall and weighs 65 kg, and another who is 1.9 meters tall and weighs 100 kg?"
},
{
"role": "model",
"content": "[{\"name\": \"bmi\", \"arguments\": {\"weight\": 65, \"height\": 165, \"system\": \"metric\"}}, {\"name\":... | xlam | file_operations |
[
{
"role": "user",
"content": "Hello, I need to confirm a grooming appointment for Whiskers at PetCare at 10 AM on October 15th. Additionally, I want to make sure he meets all grooming requirements based on his vaccination records. Can you help me with this?"
},
{
"role": "tool",
"content": "[\n ... | toolmind | button_instruct |
[
{
"role": "user",
"content": "What are some of the major factors that contribute to global warming according to scientific research? Can you provide a Java code that calculates the carbon dioxide emissions from burning fossil fuels and explain how it contributes to global warming? Also, can you explain how ... | openhermes | EvolInstruct_70k_general |
[
{
"role": "user",
"content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r... | openthought | reasoning_trajectory |
[
{
"role": "user",
"content": "I will ask you a question, then you should help me operate a MySQL database with SQL to answer the question.\nYou have access to the following actions: [{'name': 'SQL', 'description': 'standard SQL tool to run command in MySQL'}, {'name': 'Answer', 'description': 'if you have t... | agent_flan | web_agent_react |
[
{
"role": "user",
"content": "Hey, can you check how popular the #SummerVibes hashtag is on TikTok? I'd like to see the view count and any related hashtags to get a sense of its reach."
},
{
"role": "tool",
"content": "{\n \"hashtag\": \"bts\",\n \"video_count\": 1250000,\n \"related_hashtags\":... | toolmind | graphsyn_multi_hop |
[
{
"role": "user",
"content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r... | openthought | reasoning_trajectory |
[
{
"role": "user",
"content": "How does UV radiation affect the degradation and stability of polypropylene (PP) and polyethylene (PE) polymers? Explain the scientific principles behind the photo-degradation process and suggest potential strategies for improving the UV stability of these polymers."
},
{
... | openhermes | CamelAI_general |
[
{
"role": "user",
"content": "I need to retrieve the email address of a subscriber. How can I access that information?"
},
{
"role": "model",
"content": "To retrieve the email address of a subscriber, I will need their **subscriber ID**. Could you please provide the ID of the subscriber you're l... | toolmind | toolmind_graphsyn_backfill |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nFinish: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are unable t... | agent_flan | tool_calling_multistep |
[
{
"role": "user",
"content": "Fetch AI companies from pages 15 and 16. Also, provide box office details for the movie 'tt56789012'. Furthermore, give me trending tags for the page 'mno234'."
},
{
"role": "model",
"content": "[{\"name\": \"getcompanies\", \"arguments\": {\"page\": 15}}, {\"name\"... | xlam | database |
[
{
"role": "user",
"content": "You are tasked with implementing a function to generate the Fibonacci sequence up to the `n`-th term using an iterative approach. The Fibonacci sequence is defined as follows: the first term is 0, the second term is 1, and each subsequent term is the sum of the two preceding on... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You are given a binary file that contains a sequence of integers stored in little-endian 32-bit format. Your task is to write a Python program that reads these integers from the binary file and writes them to a text file, with each integer on a new line. The program should be e... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You have access to the following API:\n{'name': 'list_of_cocktails_for_the_cocktail_db', 'description': 'This is the subfunction for tool \"the_cocktail_db\", you can use this tool.The description of this function is: \"This endpoint responds with a list of cocktails that inclu... | agent_flan | tool_calling_single |
[
{
"role": "user",
"content": "Fetch the contents of '/docs' in the 'gina305/switch-code' repository and get the details of a software with name 'Zoom' and feature 'video conferencing'."
},
{
"role": "model",
"content": "[{\"name\": \"list_folder_contents\", \"arguments\": {\"repo\": \"gina305/sw... | xlam | file_operations |
[
{
"role": "user",
"content": "You are given a string `s`. Your task is to find the longest palindromic substring within `s`. If there are multiple palindromic substrings of the same maximum length, return the one that appears first. \n\n**Input:**\n- A single string `s` (1 ≤ |s| ≤ 1000), consisting of lower... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "You are given a dictionary `student_grades` where each key is a student's name and each value is a list of integers representing the grades the student has received. Your task is to implement a function `calculate_average_grades(student_grades)` that computes the average grade ... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r... | openthought | reasoning_trajectory |
[
{
"role": "user",
"content": "You are given a list of integers representing the dimensions of a hyperrectangle. Your task is to write a function `calculate_volume` that computes the volume of the hyperrectangle. The volume is defined as the product of all dimensions. You must implement the multiplication of... | opencodeinstruct | coding_trajectory |
[
{
"role": "user",
"content": "1. I need the boundary information of Istanbul, Turkey. 2. Can you also provide the geolocation information of Moscow, Russia? 3. Lastly, I want to fetch fonts for vector tiles with a range of 'symbols' and font stack 'Tahoma, Verdana'."
},
{
"role": "model",
"conte... | xlam | database |
[
{
"role": "user",
"content": "Hi there! I need some help with order #W5199551. Could you assist me with making a change to the payment method?"
},
{
"role": "model",
"content": "To assist you with changing the payment method for order #W5199551, I first need to verify your identity. Could you pl... | toolmind | tau_tool_usage |
[
{
"role": "user",
"content": "You have access to the following API:\n{'name': 'genre_for_deezer', 'description': 'This is the subfunction for tool \"deezer\", you can use this tool.The description of this function is: \"A genre object\"', 'parameters': {'type': 'object', 'properties': {'is_id': {'type': 'st... | agent_flan | tool_calling_single |
[
{
"role": "user",
"content": "Answer the following questions as best you can. You have access to the following APIs. \nFinalAnswer: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are una... | agent_flan | tool_calling_multistep |
[
{
"role": "user",
"content": "Could you find out the top 15 songs in Slovakia on New Year's Eve 2022? Additionally, provide the songs from positions 21 to 30 on that day."
},
{
"role": "model",
"content": "[{\"name\": \"slovakia_songs\", \"arguments\": {\"range\": \"top-15\", \"date\": \"2022-12... | xlam | file_operations |
[
{
"role": "user",
"content": "Retrieve information about an Avito offer with the ID 'item456'. Also, what are the rewards for the 'cryptoart' collection?"
},
{
"role": "model",
"content": "[{\"name\": \"singleoffer\", \"arguments\": {\"singleav\": \"item456\"}}, {\"name\": \"retrieve_rewards\", ... | xlam | web_search |
Dataset Token Filter
This repository/folder contains a Python script (filter_dataset.py) for filtering JSONL datasets by their token context lengths.
Files
filter_dataset.py: The script to process and filter the dataset.whiteswan_agentic_129k_sft.jsonl: The original dataset containing 129k entries.whiteswan_filtered_under_4400.jsonl: The output filtered dataset containing only the samples shorter than 4,400 tokens.
How it Works
The script iterates through a .jsonl file line by line. It parses each line as JSON, extracts the content field recursively from the messages list, and calculates the exact amount of tokens present using OpenAI's tiktoken library (with the cl100k_base encoding algorithm, which is equivalent to the GPT-4 tokenizer).
If the total sum of tokens across all messages in that sample is less than 4,400 tokens, the entire original JSON object is written to the new output .jsonl dataset.
Setup Requirements
Before running the script, the only library dependency required is tiktoken, which the script attempts to install automatically on the first run.
If you somehow need to install it manually:
pip install tiktoken
Running the Script
Simply execute the Python script:
python3 filter_dataset.py
The script will track its progress every 10,000 lines and print the final counts of processed over kept samples when completed.
- Downloads last month
- 67