Dataset Viewer
Auto-converted to Parquet Duplicate
messages
list
source
string
category
string
[ { "role": "user", "content": "Can you provide detailed steps on how to modify the code to merge c\\_task\\_response and c\\_step\\_response datasets, use the renderer job API to transfer data to sftp, and filter datasets based on a specific task identifier, such as \"task\\_id=123\"? Your solution should in...
openhermes
unknown_general
[ { "role": "user", "content": "You are given a string `s`. Your task is to write a recursive function in Python that reverses the string without using slicing. The function should handle strings of any length, including empty strings.\n\n**Input:**\n- A single string `s` (0 ≤ |s| ≤ 1000).\n\n**Output:**\n- A...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nbook_cover_api.getbookcoverimage_for_book_cover_api: This is the subfunction for tool 'book_cover_api', you can use this tool.The description of this function is: 'Get a book cover image by...
agent_flan
tool_calling_cot
[ { "role": "user", "content": "You are tasked with implementing a Python class named `Employee` to manage employee data. The class should have the following attributes: `name`, `position`, `department`, and `salary`. Implement the following functionalities:\n\n1. A constructor to initialize the attributes.\n...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "I want to use the Titanic dataset available in the seaborn library. Can you explain the process of acquiring the data, the steps involved in cleaning the data, and suggest a suitable data visualization chart? Additionally, I would like to incorporate gender, survival status, an...
openhermes
glaive-code-assist_general
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \ni_eksepeti_data.get_comments_from_product_id_for_i_eksepeti_data: This is the subfunction for tool 'i_eksepeti_data', you can use this tool.The description of this function is: 'Retrieve co...
agent_flan
tool_calling_cot
[ { "role": "user", "content": "Can you find the boundary of the hexagon with H3 index 617313877342776319 and the address corresponding to the coordinates 51.5074, -0.1278?" }, { "role": "model", "content": "[{\"name\": \"query_for_hexagon_by_h3index\", \"arguments\": {\"h3index\": 617313877342776...
xlam
database
[ { "role": "user", "content": "I'm a city planner working on a new sports district development. First, I need to apply for a demolition permit for an old building at 123 Stadium Ave. Do you know the process or who I should contact for that?" }, { "role": "tool", "content": "{\n \"Basketball\": {\...
toolmind
toolmind_graphsyn_backfill
[ { "role": "user", "content": "You are tasked with implementing a `Book` class to manage a library's inventory. The class should have the following attributes:\n- `title`: a string representing the title of the book.\n- `author`: a string representing the author of the book.\n- `isbn`: a string representing ...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "Which phishing URLs have been observed in the last week?" }, { "role": "model", "content": "[{\"name\": \"v1\", \"arguments\": {\"time\": \"week\", \"filter1\": \"phishing\", \"filter2\": \"url\"}}]" } ]
xlam
database
[ { "role": "user", "content": "You are given an `n × n` chessboard and a list of squares imbued with a magical aura. No two rooks can be placed on the same row or column, and if a rook is placed on a square that is under a magical aura, it must move to a neighboring row or column, provided that the new squar...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "Can you reverse the string \"Hello, World!\" for me?" }, { "role": "tool", "content": "{\"result\": \"!dlroW ,olleH\"}" }, { "role": "model", "content": "The reversed string of \"Hello, World!\" is \"!dlroW ,olleH\"." }, { "role": "user", "co...
toolmind
glaive_general
[ { "role": "user", "content": "You are tasked with implementing a recursive function to compute the factorial of a non-negative integer. Your function should handle invalid inputs (negative integers) by raising a `ValueError` with the message \"Factorial is not defined for negative numbers\". \n\n**Input:**\...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You are given a string `value` representing a monetary amount in a specific currency format, which may include various currency symbols, commas, and spaces. Your task is to write a function `convert_currency` that processes this string and returns the numerical value of the mon...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You are given a string consisting solely of alphabetic characters. Your task is to write a function that counts the number of vowels and consonants in the string. The function should return a tuple containing the count of vowels and consonants, respectively.\n\n**Input:**\n- A ...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nall_crypto_news_feed.get_twitter_feed_for_all_crypto_news_feed: This is the subfunction for tool 'all_crypto_news_feed', you can use this tool.The description of this function is: 'Returns ...
agent_flan
tool_calling_cot
[ { "role": "user", "content": "Retrieve the profile information for Instagram users 'selenagomez' and 'lewishamilton'." }, { "role": "model", "content": "[{\"name\": \"profile_info\", \"arguments\": {\"username\": \"selenagomez\"}}, {\"name\": \"profile_info\", \"arguments\": {\"username\": \"lew...
xlam
database
[ { "role": "user", "content": "I need to find some recent Arabic news articles about political changes in the Arab world for my school project. Could you suggest reliable sources or specific articles from the past year?" }, { "role": "tool", "content": "[\n {\n \"headline\": \"التحولات السياسية ف...
toolmind
graphsyn_multi_hop
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nadmiralbet.odds_for_admiralbet: This is the subfunction for tool 'admiralbet', you can use this tool.The description of this function is: 'Latest odds_for_admiralbet by Admiralbet.'\nInput ...
agent_flan
tool_calling_multistep
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nFinalAction: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are una...
agent_flan
tool_calling_multistep
[ { "role": "user", "content": "How popular is the '#veganrecipes' hashtag on TikTok?" }, { "role": "model", "content": "[{\"name\": \"hashtag_info\", \"arguments\": {\"hashtag\": \"veganrecipes\"}}]" } ]
xlam
web_search
[ { "role": "user", "content": "You are given a JSON file named `books.json` that contains a list of books. Each book is represented as an object with the following properties: `title`, `author`, `isbn`, and `published_year`. Your task is to write a Python script that reads this JSON file, validates that each...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You are tasked with creating a Python class that simulates a simple signal processing flow using GNU Radio. Your task is to complete the implementation of the `SignalProcessor` class by adding the missing method `process_signal`. The `process_signal` method should take a list o...
oss_instruct
algorithmic_problem_solving
[ { "role": "user", "content": "I'm planning a trip to Costa Rica and considering the MSC Seaview cruise. Could you tell me about its itineraries in the region and when it's scheduled to dock in San José?" }, { "role": "tool", "content": "{\n \"vessel_name\": \"MSC Seaview\",\n \"vessel_type\": \"...
toolmind
toolmind_graphsyn_backfill
[ { "role": "user", "content": "You are given a string `s`. Your task is to determine if it is a palindrome, considering only alphanumeric characters and ignoring cases. A string is a palindrome if it reads the same backward as forward.\n\n**Input:**\n- A single string `s` (1 ≤ |s| ≤ 10^5), where |s| denotes ...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r...
openthought
reasoning_trajectory
[ { "role": "user", "content": "What is the BMI for a person who is 1.65 meters tall and weighs 65 kg, and another who is 1.9 meters tall and weighs 100 kg?" }, { "role": "model", "content": "[{\"name\": \"bmi\", \"arguments\": {\"weight\": 65, \"height\": 165, \"system\": \"metric\"}}, {\"name\":...
xlam
file_operations
[ { "role": "user", "content": "Hello, I need to confirm a grooming appointment for Whiskers at PetCare at 10 AM on October 15th. Additionally, I want to make sure he meets all grooming requirements based on his vaccination records. Can you help me with this?" }, { "role": "tool", "content": "[\n ...
toolmind
button_instruct
[ { "role": "user", "content": "What are some of the major factors that contribute to global warming according to scientific research? Can you provide a Java code that calculates the carbon dioxide emissions from burning fossil fuels and explain how it contributes to global warming? Also, can you explain how ...
openhermes
EvolInstruct_70k_general
[ { "role": "user", "content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r...
openthought
reasoning_trajectory
[ { "role": "user", "content": "I will ask you a question, then you should help me operate a MySQL database with SQL to answer the question.\nYou have access to the following actions: [{'name': 'SQL', 'description': 'standard SQL tool to run command in MySQL'}, {'name': 'Answer', 'description': 'if you have t...
agent_flan
web_agent_react
[ { "role": "user", "content": "Hey, can you check how popular the #SummerVibes hashtag is on TikTok? I'd like to see the view count and any related hashtags to get a sense of its reach." }, { "role": "tool", "content": "{\n \"hashtag\": \"bts\",\n \"video_count\": 1250000,\n \"related_hashtags\":...
toolmind
graphsyn_multi_hop
[ { "role": "user", "content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r...
openthought
reasoning_trajectory
[ { "role": "user", "content": "How does UV radiation affect the degradation and stability of polypropylene (PP) and polyethylene (PE) polymers? Explain the scientific principles behind the photo-degradation process and suggest potential strategies for improving the UV stability of these polymers." }, { ...
openhermes
CamelAI_general
[ { "role": "user", "content": "I need to retrieve the email address of a subscriber. How can I access that information?" }, { "role": "model", "content": "To retrieve the email address of a subscriber, I will need their **subscriber ID**. Could you please provide the ID of the subscriber you're l...
toolmind
toolmind_graphsyn_backfill
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nFinish: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are unable t...
agent_flan
tool_calling_multistep
[ { "role": "user", "content": "Fetch AI companies from pages 15 and 16. Also, provide box office details for the movie 'tt56789012'. Furthermore, give me trending tags for the page 'mno234'." }, { "role": "model", "content": "[{\"name\": \"getcompanies\", \"arguments\": {\"page\": 15}}, {\"name\"...
xlam
database
[ { "role": "user", "content": "You are tasked with implementing a function to generate the Fibonacci sequence up to the `n`-th term using an iterative approach. The Fibonacci sequence is defined as follows: the first term is 0, the second term is 1, and each subsequent term is the sum of the two preceding on...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You are given a binary file that contains a sequence of integers stored in little-endian 32-bit format. Your task is to write a Python program that reads these integers from the binary file and writes them to a text file, with each integer on a new line. The program should be e...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You have access to the following API:\n{'name': 'list_of_cocktails_for_the_cocktail_db', 'description': 'This is the subfunction for tool \"the_cocktail_db\", you can use this tool.The description of this function is: \"This endpoint responds with a list of cocktails that inclu...
agent_flan
tool_calling_single
[ { "role": "user", "content": "Fetch the contents of '/docs' in the 'gina305/switch-code' repository and get the details of a software with name 'Zoom' and feature 'video conferencing'." }, { "role": "model", "content": "[{\"name\": \"list_folder_contents\", \"arguments\": {\"repo\": \"gina305/sw...
xlam
file_operations
[ { "role": "user", "content": "You are given a string `s`. Your task is to find the longest palindromic substring within `s`. If there are multiple palindromic substrings of the same maximum length, return the one that appears first. \n\n**Input:**\n- A single string `s` (1 ≤ |s| ≤ 1000), consisting of lower...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "You are given a dictionary `student_grades` where each key is a student's name and each value is a list of integers representing the grades the student has received. Your task is to implement a function `calculate_average_grades(student_grades)` that computes the average grade ...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "[System Instructions: Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, r...
openthought
reasoning_trajectory
[ { "role": "user", "content": "You are given a list of integers representing the dimensions of a hyperrectangle. Your task is to write a function `calculate_volume` that computes the volume of the hyperrectangle. The volume is defined as the product of all dimensions. You must implement the multiplication of...
opencodeinstruct
coding_trajectory
[ { "role": "user", "content": "1. I need the boundary information of Istanbul, Turkey. 2. Can you also provide the geolocation information of Moscow, Russia? 3. Lastly, I want to fetch fonts for vector tiles with a range of 'symbols' and font stack 'Tahoma, Verdana'." }, { "role": "model", "conte...
xlam
database
[ { "role": "user", "content": "Hi there! I need some help with order #W5199551. Could you assist me with making a change to the payment method?" }, { "role": "model", "content": "To assist you with changing the payment method for order #W5199551, I first need to verify your identity. Could you pl...
toolmind
tau_tool_usage
[ { "role": "user", "content": "You have access to the following API:\n{'name': 'genre_for_deezer', 'description': 'This is the subfunction for tool \"deezer\", you can use this tool.The description of this function is: \"A genre object\"', 'parameters': {'type': 'object', 'properties': {'is_id': {'type': 'st...
agent_flan
tool_calling_single
[ { "role": "user", "content": "Answer the following questions as best you can. You have access to the following APIs. \nFinalAnswer: If you believe that you have obtained a result that can answer the task, please call this function to provide the final answer. Alternatively, if you recognize that you are una...
agent_flan
tool_calling_multistep
[ { "role": "user", "content": "Could you find out the top 15 songs in Slovakia on New Year's Eve 2022? Additionally, provide the songs from positions 21 to 30 on that day." }, { "role": "model", "content": "[{\"name\": \"slovakia_songs\", \"arguments\": {\"range\": \"top-15\", \"date\": \"2022-12...
xlam
file_operations
[ { "role": "user", "content": "Retrieve information about an Avito offer with the ID 'item456'. Also, what are the rewards for the 'cryptoart' collection?" }, { "role": "model", "content": "[{\"name\": \"singleoffer\", \"arguments\": {\"singleav\": \"item456\"}}, {\"name\": \"retrieve_rewards\", ...
xlam
web_search
End of preview. Expand in Data Studio

Dataset Token Filter

This repository/folder contains a Python script (filter_dataset.py) for filtering JSONL datasets by their token context lengths.

Files

  • filter_dataset.py: The script to process and filter the dataset.
  • whiteswan_agentic_129k_sft.jsonl: The original dataset containing 129k entries.
  • whiteswan_filtered_under_4400.jsonl: The output filtered dataset containing only the samples shorter than 4,400 tokens.

How it Works

The script iterates through a .jsonl file line by line. It parses each line as JSON, extracts the content field recursively from the messages list, and calculates the exact amount of tokens present using OpenAI's tiktoken library (with the cl100k_base encoding algorithm, which is equivalent to the GPT-4 tokenizer). If the total sum of tokens across all messages in that sample is less than 4,400 tokens, the entire original JSON object is written to the new output .jsonl dataset.

Setup Requirements

Before running the script, the only library dependency required is tiktoken, which the script attempts to install automatically on the first run.

If you somehow need to install it manually:

pip install tiktoken

Running the Script

Simply execute the Python script:

python3 filter_dataset.py

The script will track its progress every 10,000 lines and print the final counts of processed over kept samples when completed.

Downloads last month
67