Back to Clickhouse

urlCluster

docs/en/sql-reference/table-functions/urlCluster.md

26.4.1.1-new3.1 KB
Original Source

urlCluster Table Function

Allows processing files from URL in parallel from many nodes in a specified cluster. On initiator it creates a connection to all nodes in the cluster, discloses asterisk in URL file path, and dispatches each file dynamically. On the worker node it asks the initiator about the next task to process and processes it. This is repeated until all tasks are finished.

Syntax {#syntax}

sql
urlCluster(cluster_name, URL, format, structure)

Arguments {#arguments}

ArgumentDescription
cluster_nameName of a cluster that is used to build a set of addresses and connection parameters to remote and local servers.
URLHTTP or HTTPS server address, which can accept GET requests. Type: String.
formatFormat of the data. Type: String.
structureTable structure in 'UserID UInt64, Name String' format. Determines column names and types. Type: String.

Returned value {#returned_value}

A table with the specified format and structure and with data from the defined URL.

Examples {#examples}

Getting the first 3 lines of a table that contains columns of String and UInt32 type from HTTP-server which answers in CSV format.

  1. Create a basic HTTP server using the standard Python 3 tools and start it:
python
from http.server import BaseHTTPRequestHandler, HTTPServer

class CSVHTTPServer(BaseHTTPRequestHandler):
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-type', 'text/csv')
        self.end_headers()

        self.wfile.write(bytes('Hello,1\nWorld,2\n', "utf-8"))

if __name__ == "__main__":
    server_address = ('127.0.0.1', 12345)
    HTTPServer(server_address, CSVHTTPServer).serve_forever()
sql
SELECT * FROM urlCluster('cluster_simple','http://127.0.0.1:12345', CSV, 'column1 String, column2 UInt32')

Globs in URL {#globs-in-url}

Patterns in curly brackets { } are used to generate a set of shards or to specify failover addresses. Supported pattern types and examples see in the description of the remote function. Character | inside patterns is used to specify failover addresses. They are iterated in the same order as listed in the pattern. The number of generated addresses is limited by glob_expansion_max_elements setting.