Oxylabs E-Commerce Scraper API

如何使用 Oxylabs Scraper API [第 1 部分]:Oxylabs 电子商务 Scraper API

Do you know how to use Oxylabs E-Commerce Scraper API? This is the most comprehensive introduction from OxyLabs official.

Oxylabs E-Commerce Scraper API pricing

Quick Start

Scraper API is built to help you in your heavy-duty data retrieval operations. You can use Scraper API to access various public pages. It enables effortless web data extraction without any delays or errors.

Scraper API uses basic HTTP authentication that requires sending username and password.

This is by far the fastest way to start using Scraper API. You will make a request to https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html using Realtime integration method from United States geo-location and retrieve already parsed data in JSON. If you wish to get page HTML instead of parsed data, simply remove 解析 和 parser_type parameters. Don't forget to replace USERNAME 和 PASSWORD with your proxy user credentials.

卷曲 --user "USERNAME:PASSWORD" 'https://realtime.oxylabs.io/v1/queries' -H "内容类型:应用程序/json" -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "geo-location": "United States", "parser_type": "ecommerce_product", "parse": true}'

If you have any questions not covered by this documentation, please contact your account manager or our support staff at [email protected].


Integration Methods

Scraper API supports three integration methods which have their unique benefits:

  • Push-Pull. Using this method it is now required to mainain an active connection with our endpoint to retrieve the data. Upon making a request, our system is able to automatically ping users server when the job is done (see Callback). This method saves computing resources and can be scaled easily.
  • Realtime. The method requires user to maintain an active connection with our endpoint in order to get the results successfully when the job is completed. This method can be implemented into one service while Push-Pull method is a two step process.
  • SuperAPI. This method is very similar to Realtime but instead posting data to our endpoint, user can use HTML Cralwer as a proxy. To retrieve the data, user must set up a proxy endpoint and make GET request to a desired URL. Additional parameters must be added using headers.

Our recommended data extraction method is Push-Pull.


Push-Pull

This is the most simple yet the most reliable and recommended data delivery method. In Push-Pull scenario you send us a query, we return you job id, and once the job is done you can use that id to retrieve content from /results endpoint. You can check job completion status yourself, or you can set up a simple listener that is able to accept POST queries. This way, we will send you a callback message once the job is ready to be retrieved. In this particular example the results will be automatically uploaded to your S3 bucket named YOUR_BUCKET_NAME.


Single Query

The following endpoint will handle single queries for one keyword or URL. The API will return a confirmation message containing job information, including job id. You can check job completion status using that id, or you can ask us to ping your callback endpoint once the scraping task is finished by adding callback_url in the query.

POST https://data.oxylabs.io/v1/queries

You need to post query parameters as data in the JSON body.

cURLPythonPHP超文本传输协定
curl --user user:pass1\
'https://data.oxylabs.io/v1/queries' \
-H "内容类型:应用程序/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "callback_url": "https://your.callback.url", "storage_type": "s3", "storage_url": "YOUR_BUCKET_NAME"}'
舶来品 requests
from pprint 舶来品 pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'callback_url': 'https://your.callback.url',
    'storage_type': 's3',
    'storage_url': 'YOUR_BUCKET_NAME'
}

# Get response.
回应 = requests.要求(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    授权=('user', 'pass1'),
    json=payload,
)

# Print prettified response to stdout.
pprint(回应.json())
<?php

$params = 矩阵(
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'callback_url' => 'https://your.callback.url',
    'storage_type' => 's3',
    'storage_url' => 'YOUR_BUCKET_NAME'
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = 矩阵();
$headers[] = "内容类型:应用程序/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

The API will respond with query information in JSON format, by printing it in the response body, similar to this:

{
  "callback_url": "https://your.callback.url",
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": 错误,
  "渲染": null,
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "source": "universal_ecommerce",
  "start_page": 1,
  "status": "pending",
  "storage_type": "s3",
  "storage_url": "YOUR_BUCKET_NAME/12345678900987654321.json",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:01",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

Check Job Status

If your query had a callback_url, we will send you a message containing a link to the content once the scraping task is done. However, if there was no callback_url in the query, you will need to check the job status yourself. For that, you need to use the URL in href under rel:self in the response message you received after submitting your query to our API. It should look similar to this: http://data.oxylabs.io/v1/queries/12345678900987654321.

GET https://data.oxylabs.io/v1/queries/{id}

Querying this link will return the job information, including its status. There are three possible status values:

pending The job is still in the queue and has not been completed.
done The job is completed, you may retrieve the result by querying the URL in href under rel:results : http://data.oxylabs.io/v1/queries/12345678900987654321/results
faulted There was an issue with the job, and we could not complete it, most likely due to a server error on the target site's side.
cURLPythonPHP超文本传输协定
curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321'
舶来品 requests
from pprint 舶来品 pprint

# Get a response from the stats endpoint.
回应 = requests.要求(
    method=GET,
    网址='http://data.oxylabs.io/v1/queries/12345678900987654321',
    授权=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(回应.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

The API will respond with query information in JSON format, by printing it in the response body. Notice that job status has been changed to done. You can now retrieve content by querying http://data.oxylabs.io/v1/queries/12345678900987654321/results.

You can also see that the task has been updated_at 2019-10-01 00:00:15 – the query took 14 seconds to complete.

{
  "client_id": 5,
  "created_at": "2019-10-01 00:00:01",
  "domain": "com",
  "geo_location": null,
  "id": "12345678900987654321",
  "limit": 10,
  "locale": null,
  "pages": 1,
  "parse": 错误,
  "渲染": null,
  "url": "sofa",
  "source": "universal_ecommerce",
  "start_page": 1,
  "status": "done",
  "subdomain": "www",
  "updated_at": "2019-10-01 00:00:15",
  "user_agent_type": "desktop",
  "_links": [
    {
      "rel": "self",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321",
      "method": "GET"
    },
    {
      "rel": "results",
      "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
      "method": "GET"
    }
  ]
}

Retrieve Job Content

Once you know the job is ready to be retrieved by checking its status, you can GET it using the URL in href under rel:results in our initial response. It should look similar to this: http://data.oxylabs.io/v1/queries/12345678900987654321/results.

GET https://data.oxylabs.io/v1/queries/{id}/results

The results can be automatically retrieved without periodically checking job status by setting up Callback service. User needs to specfy the IP or domain of the server where the Callback service is running. When our system completes a job, it will send a message to the provided IP or domain and the Callback service will download the results as described in the Callback implementation example.

cURLPythonPHP超文本传输协定
curl --user user:pass1 'http://data.oxylabs.io/v1/queries/12345678900987654321/results'
舶来品 requests
from pprint 舶来品 pprint

# Get response from the stats endpoint.
回应 = requests.要求(
    method=GET,
    网址='http://data.oxylabs.io/v1/queries/12345678900987654321/results',
    授权=('user', 'pass1'),
)

# Print the prettified JSON response to stdout.
pprint(回应.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://data.oxylabs.io/v1/queries/12345678900987654321/results");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

The API will return job content:

{
  "results": [
    {
      "content": "<!doctype html><html>
        CONTENT      
      </html>",
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "page": 1,
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

Callback

A callback is a 职位 request we send to your machine, informing that the data extraction task is completed and providing URL to download scraped content. This means that you no longer need to check job status manually. Once the data is here, we will let you know, and all you need to do now is retrieve it.

cURLPythonPHP超文本传输协定
# Please see the code samples in Python and PHP.
# This is a simple Sanic web server with a route listening for callbacks on localhost:8080.
# It will print job results to stdout.
舶来品 requests
from pprint 舶来品 pprint
from sanic 舶来品 Sanic, 回应


AUTH_TUPLE = ('user', 'pass1')

app = Sanic()


# Define /job_listener endpoint that accepts POST requests.
@app.route('/job_listener', methods=['POST'])
async def job_listener(要求):
    try:
        res = 要求.json
        links = res.获取('_links', [])
        for 链接 in links:
            如果 链接['rel'] == 'results':
                # Sanic is async, but requests are synchronous, to fully take
                # advantage of Sanic, use aiohttp.
                res_response = requests.要求(
                    method=GET,
                    网址=链接['href'],
                    授权=AUTH_TUPLE,
                )
                pprint(res_response.json())
                break
    except Exception as e:
        打印("Listener exception: {}".格式(e))
    return 回应.json(status=200, 机构={'status': 'ok'})


如果 __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)
<?php
$stdout = fopen('php://stdout', 'w');

如果 (isset($_POST)) {
    $result = array_merge($_POST, (矩阵) json_decode(file_get_contents('php://input')));

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/".$result['id'].'/results');
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
    curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

    $result = curl_exec($ch);
    fwrite($stdout, $result);

    如果 (curl_errno($ch)) {
        echo 'Error:' . curl_error($ch);
    }
    curl_close ($ch);
}
?>
HTTP method is currently not supported

Sample callback output

{  
   "created_at":"2019-10-01 00:00:01",
   "updated_at":"2019-10-01 00:00:15",
   "locale":null,
   "client_id":163,
   "user_agent_type":"desktop",
   "source":"universal_ecommerce",
   "pages":1,
   "subdomain":"www",
   "status":"done",
   "start_page":1,
   "parse":0,
   "渲染":null,
   "priority":0,
   "ttl":0,
   "起源":"api",
   "persist":,
   "id":"12345678900987654321",
   "callback_url":"http://your.callback.url/",
   "url":"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
   "domain":"de",
   "limit":10,
   "geo_location":null,
   {...}
   "_links":[
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321",
         "method":"GET",
         "rel":"self"
      },
      {  
         "href":"https://data.oxylabs.io/v1/queries/12345678900987654321/results",
         "method":"GET",
         "rel":"results"
      }
   ],
}

Batch Query

Scraper API also accepts multiple keywords per query, up to 1,000 keywords with each batch. The following endpoint will submit multiple keywords to the extraction queue.

POST https://data.oxylabs.io/v1/queries/batch

You need to post query parameters as data in the JSON body.

The system will handle every keyword as a separate request. If you provided a callback URL, you will get a separate call for each keyword. Otherwise, our initial response will contain job ids for all keywords. For example, if you sent 50 keywords, we will return 50 unique job ids.

Important! 询问 is the only parameter that can have multiple values. All other parameters are the same for that batch query.

cURLPythonPHP超文本传输协定
curl --user user:pass1 'https://data.oxylabs.io/v1/queries/batch' -H 'Content-Type: application/json' \
 -d '@keywords.json'
舶来品 requests
舶来品 json
from pprint 舶来品 pprint


# Get payload from file.
with open('keywords.json', 'r') as f:
    payload = json.loads(f.read())

回应 = requests.要求(
    'POST',
    'https://data.oxylabs.io/v1/queries/batch',
    授权=('user', 'pass1'),
    json=payload,
)

# Print prettified response.
pprint(回应.json())
<?php

$paramsFile = file_get_contents(realpath("keywords.json"));
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries/batch");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $paramsFile);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = 矩阵();
$headers[] = "内容类型:应用程序/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

keywords.json content:

{  
   "url":[  
      "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
      "https://books.toscrape.com/catalogue/soumission_998/index.html"
   ],
   "source": "universal_ecommerce",
   "callback_url": "https://your.callback.url"
}

The API will respond with query information in JSON format, by printing it in the response body, similar to this:

{
  "queries": [
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678900987654321",
      {...}
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678900987654321/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "12345678901234567890",
      {...}
      "url": "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/12345678901234567890/results",
          "method": "GET"
        }
      ]
    },
    {
      "callback_url": "https://your.callback.url",
      {...}
      "created_at": "2019-10-01 00:00:01",
      "domain": "com",
      "id": "01234567899876543210",
      {...}
      "url": "https://books.toscrape.com/catalogue/soumission_998/index.html",
      "source": "universal_ecommerce",
      {...}
          "rel": "results",
          "href": "http://data.oxylabs.io/v1/queries/01234567899876543210/results",
          "method": "GET"
        }
      ]
    }
  ]
}

Get Notifier IP Address List

You may want to whitelist the IPs sending you callback messages or get the list of these IPs for other purposes. This can be done by 获取ing this endpoint: https://data.oxylabs.io/v1/info/callbacker_ips.

cURLPythonPHP超文本传输协定
curl --user user:pass1 'https://data.oxylabs.io/v1/info/callbacker_ips'
舶来品 requests
from pprint 舶来品 pprint

# Get response from the callback IPs endpoint.
回应 = requests.要求(
    method=GET,
    网址='https://data.oxylabs.io/v1/info/callbacker_ips',
    授权=('user', 'pass1'),
)

# Print the prettified JSON response to stdout.
pprint(回应.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/info/callbacker_ips");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported

The API will return the list of IPs making callback requests to your system:

{
    "ips": [
        "x.x.x.x",
        "y.y.y.y"
    ]
}

Upload to Storage

By default, Ecommerce Spraper API's job results are stored in our databases. This means that you will need to query our results endpoint and retrieve the content yourself. Custom storage feature allows you to store results in your own cloud storage. The advantages of this feature are that you do not need extra requests to fetch results – everything goes directly to your storage bucket.

At the moment, we only support Amazon S3. If you want to use a different type of storage, contact your account manager to discuss the timeline.

In order to upload job results to your Amazon S3 bucket, you need to set up special permissions. To do that, go to https://s3.console.aws.amazon.com/ > S3 > Storage > Bucket Name (if don't have one, create new) > Permissions > Bucket Policy

Oxylabs ECommerce Universal Scraper API Upload to Storage

You can find the bucket policy in this JSON or in the code sample area on the right. Do not forget to change the bucket name under YOUR_BUCKET_NAME. This policy allows us to write to your bucket, upload files for you, and know the bucket location.

To use this feature and you will need to specify two additional parameters in your requests. Learn more 这里.

The upload path looks like this: YOUR_BUCKET_NAME/job_ID.json. You will find the job ID in the response body that you will receive from us after submitting a request. In this example job ID is 12345678900987654321.

{
    "Version": "2012-10-17",
    "Id": "Policy1577442634787",
    "Statement": [
        {
            "Sid": "Stmt1577442633719",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
            },
            "Action": "s3:GetBucketLocation",
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
        },
        {
            "Sid": "Stmt1577442633719",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::324311890426:user/oxylabs.s3.uploader"
            },
            "Action": [
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/*"
        }
    ]
}

Realtime

The data submission is the same as in Push-Pull method, but with Realtime, we will return the content on open connection. You send us a query, the connection remains open, we retrieve the content, and bring it to you. The endpoint that handles that is this:

POST https://realtime.oxylabs.io/v1/queries

The timeout limit for open connections is 100 seconds. Therefore, in rare cases of heavy load, we may not be able to ensure the data gets to you.

You need to post query parameters as data in the JSON body. Please see an example for more details.

cURLPythonPHP超文本传输协定
curl --user user:pass1 'https://realtime.oxylabs.io/v1/queries' -H "内容类型:应用程序/json" \
 -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"}'
舶来品 requests
from pprint 舶来品 pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
}

# Get response.
回应 = requests.要求(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    授权=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with results.
pprint(回应.json())
<?php

$params = 矩阵(
    'source' => 'universal_ecommerce',
    'query' => 'sofa',
);

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://realtime.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = 矩阵();
$headers[] = "内容类型:应用程序/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# URL has to be encoded to escape `&` and `=` characters. It is not necessary in this example.

https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html&access_token=12345abcde

Example response body that will be returned on open connection:

{
  "results": [
    {
      "content": "<html>
      CONTENT
      </html>"
      "created_at": "2019-10-01 00:00:01",
      "updated_at": "2019-10-01 00:00:15",
      "id": null,
      "page": 1,
      "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
      "job_id": "12345678900987654321",
      "status_code": 200
    }
  ]
}

SuperAPI

If you have ever used regular proxies for data scraping, integrating SuperAPI delivery method will be a breeze. You simply need to use our entry node as proxy, authorize with Scraper API credentials, and ignore certificates. In cURL it's -k 或 --insecure. Your data will reach you on open connection.

GET realtime.oxylabs.io:60000

SuperAPI only supports a handful of parameters since it only works with Direct data source where a full URL is provided. These parameters should be sent as headers. This is a list of accepted parameters:

X-OxySERPs-User-Agent-Type There is no way to indicate a specific User-Agent, but you can let us know which browser and platform to use. A list of supported User-Agents can be found 这里.

If you need help setting up SuperAPI, get in touch with us at [email protected].

cURLPythonPHP超文本传输协定
curl -k \
-x realtime.oxylabs.io:60000 \
-U user:pass1 \
-H "X-OxySERPs-User-Agent-Type: desktop_chrome" \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
舶来品 requests
from pprint 舶来品 pprint

# Define proxy dict. Do not forget to put your real user and pass here as well.
代理 = {
  'http': 'http://user:[email protected]:60000',
  'https': 'https://user:[email protected]:60000',
}

回应 = requests.要求(
    GET,
    'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    授权=('user', 'pass1'),
    verify=假的,  # Or accept our certificate.
    代理=代理,
)

# Print result page to stdout
pprint(回应.文本)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(回应.文本)
<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 错误);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 错误);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI

Content Type

Scraper API returns raw HTML, as well as structured JSON.

Download Images

It is possible to download images via Scraper API. If you are doing that through SuperAPI, you can simply save the output to image extension. For example:

curl -k -x realtime.oxylabs.io:60000 -U user:pass1 "https://example.com/image.jpg" >> image.jpg

If you are using Push-Pull 或 Realtime methods, you will need to add content_encoding parameter with a value of base64. Once you receive the results, you then need to decode encoded data from content into bytes and save it as an image file. Please find an example in Python on the right.


Data Sources

Scraper API accepts URLs, along with additional parameters, such as User-Agent type, proxy location, and others. See this method, which we call Direct, described below.

Scraper API is able to render JavaScript when scraping. This enables you to get more data from the web page and get screenshots.

If you are unsure about any part of the documentation, drop us a line at [email protected] or contact your account manager.


Direct

Oxylabs ECommerce Universal Scraper API Direct

 

universal_ecommerce source is designed to retrieve the contents of any URL on the internet. 职位-ing the parameters in JSON format to the following endpoint will submit the specified URL to the extraction queue.

查询参数

参数 说明 Default Value
source Data source universal_ecommerce
网址 Direct URL (link) to Universal page
user_agent_type Device type and browser. The full list can be found 这里。 desktop
geo_location Geo location of proxy used to retrieve the data. The full list of supported locations can be found 这里。
locale Locale, as expected in Accept-Language header.
给予 Enables JavaScript rendering. Use it when the target requires JavaScript to load content. Only works via Push-Pull (a.k.a. Callback) method. There are two available values for this parameter: html(get raw output) and png (get a Base64-encoded screenshot).
content_encoding Add this parameter if you are downloading images. Learn more 这里。 base64
context: Base64-encoded POST request body. It is only useful if http_method is set to post.
content
context: Pass your own cookies.
cookies
context: Indicate whether you would like the scraper to follow redirects (3xx responses with a destination URL) to get the contents of the URL at the end of the redirect chain.
follow_redirects
context: Pass your own headers.
页眉
context: Set it to post if you would like to make a POST request to your target URL via E-commerce Universal Scraper. 获取
http_method
context: If you want to use the same proxy with multiple requests, you can do so by using this parameter. Just set your session to any string you like, and we will assign a proxy to this ID and keep it for up to 10 minutes. After that, if you make another request with the same session ID, a new proxy will be assigned to that particular session ID.
session_id
context: Define a custom HTTP response code (or a few of them), upon which we should consider the scrape successful and return the content to you. May be useful if you want us to return the 503 error page or in some other non-standard cases.
successful_status_codes
callback_url URL to your callback endpoint.
解析 true will return structured data, as long as the URL submitted directs to an ecommerce product page. Use this parameter in combination with parser_type parameter to use our Adaptive Parser. FALSE
parser_type Set the value to ecommerce_product to access Adaptive Parser.
storage_type Storage service provider. At the moment, only Amazon S3 is supported: s3. Full implementation can be found on the Upload to Storage page. Only works via Push-Pull (Callback) method.
storage_url Your Amazon S3 bucket name. Only works via Push-Pull (Callback) method.

   – required parameter

In this example, the API will retrieve a E-commerce universal product page in Push-Pull method. All available parameters are included (though not always necessary or compatible within the same request), to give you an idea on how to format your requests:

cURLPythonPHP超文本传输协定
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "内容类型:应用程序/json" \
 -d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "render": "html", 
 "context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
舶来品 requests
from pprint 舶来品 pprint


# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://www.etsy.com/listing/399423455/big-glass-house-planter-handmade-glass?ref=hp_prn&frs=1',
    'user_agent_type': 'desktop',
    'geo_location': 'United States',
    'parse': ,
    'parser_type': "ecommerce_product",
    'context': [
        {
          'key': 'session_id',
          'value': '1234567890abcdef'
        }
        {
          'key': 'headers', 'value': 
            {
             'Accept-Language': 'en-US',
             'Content-Type': 'application/octet-stream',
             'Custom-Header': 'custom header content'
            }
        },
        {
          'key': 'cookies',
          'value': [{
              'key': 'NID',
             'value': '1234567890'
           },
           {
              'key': '1P_JAR',
             'value': '0987654321'
           }
         ]
        },
        {
          'key': 'follow_redirects',
          'value': 
        },
        {
          'key': 'successful_status_codes',
          'value': [303, 808, 909]
        },
        {
          'key': 'http_method',
          'value': 'get'
        }
        {
          'key': 'content'
          'value': 'base64EncodedPOSTBody'
        }
    ],
    'callback_url': 'https://your.callback.url',
}

# Get response.
回应 = requests.要求(
    'POST',
    'https://data.oxylabs.io/v1/queries',
    授权=('user', 'pass1'),
    json=payload,
)

# Print prettified response to stdout.
pprint(回应.json())
<?php

$params = [
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'context' => [
        [
            'key' => 'session_id',
            'value' => '1234567890abcdef'
        ],
        [
            'key' => 'headers',
            'value' => [
                'Accept-Language' => 'en-US',
                'Content-Type' => 'application/octet-stream',
                'Custom-Header' => 'custom header content'
            ],
        ],
        [
            'key' => 'cookies',
            'value' => [
                ['key' => 'NID', 'value' => '1234567890'],
                ['key' => '1P_JAR', 'value' => '0987654321']
            ]
        ],
        [
            'key' => 'follow_redirects',
            'value' => 'true'
        ],
        [
            'key' => 'successful_status_codes',
            'value' => [303, 808, 909]
        ],
        [
            'key' => 'http_method',
            'value' => 'get'
        ],
        [
            'key' => 'content',
            'value' => 'base64EncodedPOSTBody'
        ]
    ]
];

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = 矩阵();
$headers[] = "内容类型:应用程序/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is currently not supported with Push-Pull

Here is the same example in Realtime:

cURLPythonPHP超文本传输协定
curl --user user:pass1 \
'https://data.oxylabs.io/v1/queries' \
-H "内容类型:应用程序/json" \
-d '{"source": "universal_ecommerce", "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "user_agent_type": "mobile", "context": [{"key": "headers", "value": ["Accept-Language": "en-US", "Content-Type": "application/octet-stream", "Custom-Header": "custom header content"]}, {"key": "cookies", "value": [{"key": "NID", "value": "1234567890"}, {"key": "1P JAR", "value": "0987654321"}, {"key": "follow_redirects", "value": true}, {"key": "http_method", "value": "get"}, {"key": "content", "value": "base64EncodedPOSTBody"}, {"key": "successful_status_codes", "value": [303, 808, 909]}]}]}'
舶来品 requests
from pprint 舶来品 pprint

# Structure payload.
payload = {
    'source': 'universal_ecommerce',
    'url': 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'user_agent_type': 'mobile',
    'geo_location': 'United States',
    'context': [
        {
          'key': 'session_id',
          'value': '1234567890abcdef'
        }
        {
          'key': 'headers', 'value': 
            {
             'Accept-Language': 'en-US',
             'Content-Type': 'application/octet-stream',
             'Custom-Header': 'custom header content'
            }
        },
        {
          'key': 'cookies',
          'value': [{
              'key': 'NID',
             'value': '1234567890'
           },
           {
              'key': '1P_JAR',
             'value': '0987654321'
           }
         ]
        },
        {
          'key': 'follow_redirects',
          'value': 
        },
        {
          'key': 'successful_status_codes',
          'value': [303, 808, 909]
        },
        {
          'key': 'http_method',
          'value': 'get'
        }
        {
          'key': 'content'
          'value': 'base64EncodedPOSTBody'
        }
    ],
}

# Get response.
回应 = requests.要求(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    授权=('user', 'pass1'),
    json=payload,
)

# Instead of response with job status and results url, this will return the
# JSON response with the result.
pprint(回应.json())
<?php

$params = [
    'source' => 'universal_ecommerce',
    'url' => 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'context' => [
        [
            'key' => 'session_id',
            'value' => '1234567890abcdef'
        ],
        [
            'key' => 'headers',
            'value' => [
                'Accept-Language' => 'en-US',
                'Content-Type' => 'application/octet-stream',
                'Custom-Header' => 'custom header content'
            ],
        ],
        [
            'key' => 'cookies',
            'value' => [
                ['key' => 'NID', 'value' => '1234567890'],
                ['key' => '1P_JAR', 'value' => '0987654321']
            ]
        ],
        [
            'key' => 'follow_redirects',
            'value' => 'true'
        ],
        [
            'key' => 'successful_status_codes',
            'value' => [303, 808, 909]
        ],
        [
            'key' => 'http_method',
            'value' => 'get'
        ],
        [
            'key' => 'content',
            'value' => 'base64EncodedPOSTBody'
        ]
    ]
];

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v1/queries");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($params));
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$headers = 矩阵();
$headers[] = "内容类型:应用程序/json";
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
# The whole string you submit has to be URL-encoded.

https://realtime.oxylabs.io/v1/queries?source=universal_ecommerce&url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2Ftagged%2Fpython&access_token=12345abcde

And via SuperAPI:

cURLPythonPHP超文本传输协定
# A GET request could look something like this:
curl -k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 \
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-OxySERPs-Session-Id: 1234567890abcdef" \
-H "X-OxySERPs-Geo-Location: India" \
-H "Accept-Language: en-US" \
-H "Content-Type: application/octet-stream" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"

# A POST request would have the same structure but contain a parameter specifying that it is a POST request:
curl -X POST \
-k \
-x http://realtime.oxylabs.io:60000 \
-U user:pass1 "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html" \
-H "X-OxySERPs-Session-Id: 1234567890abcdef" \
-H "X-OxySERPs-Geo-Location: India" \
-H "Custom-Header: custom header content" \
-H "Cookie: NID=1234567890; 1P_JAR=0987654321" \
-H "X-Status-Code: 303, 808, 909"
舶来品 requests
from pprint 舶来品 pprint

# Define proxy dict. Do not forget to put your real user and pass here as well.
代理 = {
  'http': 'http://user:[email protected]:60000',
  'https': 'https://user:[email protected]:60000',
}

回应 = requests.要求(
    GET,
    'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    授权=('user', 'pass1'),
    verify=假的,  # Or accept our certificate.
    代理=代理,
)

# Print result page to stdout
pprint(回应.文本)

# Save returned HTML to result.html file
with open('result.html', 'w') as f:
    f.write(回应.文本)
<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_PROXY, 'realtime.oxylabs.io:60000');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, "user" . ":" . "pass1");
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 错误);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 错误);

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>
HTTP method is not supported with SuperAPI

Parameter Values

Geo_Location

Full list of supported geo locations can be found in CSV format 这里.

"United Arab Emirates",
"Albania",
"Armenia",
"Angola",
"Argentina",
"Australia",
...
"Uruguay",
"Uzbekistan",
"Venezuela Bolivarian Republic of",
"Viet Nam",
"South Africa",
"Zimbabwe"

HTTP_Method

E-commerce Universal Scraper API supports two HTTP(S) methods: 获取 (default) and 职位.

"GET",
"POST"

Render

E-commerce Universal Scraper API can render Javascript and return either a rendered HTML document or a PNG screenshot of the web page.

"html",
"png"

User_Agent_Type

Download full list of user_agent_type values in JSON 这里.

[
  {
    "user_agent_type": "desktop",
    "description": "Random desktop browser User-Agent"
  },
  {
    "user_agent_type": "desktop_firefox",
    "description": "Random User-Agent of one of the latest versions of desktop Firefox"
  },
  {
    "user_agent_type": "desktop_chrome",
    "description": "Random User-Agent of one of the latest versions of desktop Chrome"
  },
  {
    "user_agent_type": "desktop_opera",
    "description": "Random User-Agent of one of the latest versions of desktop Opera"
  },
  {
    "user_agent_type": "desktop_edge",
    "description": "Random User-Agent of one of the latest versions of desktop Edge"
  },
  {
    "user_agent_type": "desktop_safari",
    "description": "Random User-Agent of one of the latest versions of desktop Safari"
  },
  {
    "user_agent_type": "mobile",
    "description": "Random mobile browser User-Agent"
  },
  {
    "user_agent_type": "mobile_android",
    "description": "Random User-Agent of one of the latest versions of Android browser"
  },
  {
    "user_agent_type": "mobile_ios",
    "description": "Random User-Agent of one of the latest versions of iPhone browser"
  },
  {
    "user_agent_type": "tablet",
    "description": "Random tablet browser User-Agent"
  },
  {
    "user_agent_type": "tablet_android",
    "description": "Random User-Agent of one of the latest versions of Android tablet"
  },
  {
    "user_agent_type": "tablet_ios",
    "description": "Random User-Agent of one of the latest versions of iPad tablet"
  }
]

Account Status

Usage Statistics

You can find your usage statistics by querying the following endpoint:

GET https://data.oxylabs.io/v2/stats

By default, the API will return all-time usage statistics. Adding ?group_by=month will return monthly stats, while ?group_by=day will return daily numbers.

This query will return all-time statistics. You can find your daily and monthly usage by adding either ?group_by=day 或 ?group_by=month

cURLPythonPHP
curl --user user:pass1 'https://data.oxylabs.io/v2/stats'
舶来品 requests
from pprint 舶来品 pprint

# Get response from stats endpoint.
回应 = requests.要求(
    method=GET,
    网址='https://data.oxylabs.io/v2/stats',
    授权=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(回应.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>

Sample output:

{
    "data": {
        "sources": [
            {
                "realtime_results_count": "90",
                "results_count": "10",
                "title": "universal_ecommerce"
            }
        ]
    },
    "meta": {
        "group_by": null
    }
}

Limits

The following endpoint will give your monthly commitment information as well as how much of it has already been used:

GET https://data.oxylabs.io/v2/stats/limits
cURLPythonPHP
curl --user user:pass1 'https://data.oxylabs.io/v2/stats/limits'
舶来品 requests
from pprint 舶来品 pprint

# Get response from stats endpoint.
回应 = requests.要求(
    method=GET,
    网址='https://data.oxylabs.io/v2/stats/limits',
    授权=('user', 'pass1'),
)

# Print prettified JSON response to stdout.
pprint(回应.json())
<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "https://data.oxylabs.io/v2/stats/limits");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET");
curl_setopt($ch, CURLOPT_USERPWD, "user" . ":" . "pass1");

$result = curl_exec($ch);
echo $result;

如果 (curl_errno($ch)) {
    echo 'Error:' . curl_error($ch);
}
curl_close ($ch);
?>

Sample output:

{
    "monthly_requests_commitment": 4500000,
    "used_requests": 985000
}

Response Codes

Code Status 说明
204 No Content You are trying to retrieve a job that has not been completed yet.
400 Multiple error messages Bad request structure, could be a misspelled parameter or invalid value. The response body will have a more specific error message.
401 ‘Authorization header not provided' / ‘Invalid authorization header' / ‘Client not found' Missing authorization header or incorrect login credentials.
403 Forbidden Your account does not have access to this resource.
404 Not Found Job ID you are looking for is no longer available.
429 Too many requests Exceeded rate limit. Please contact your account manager to increase limits.
500 Unknown Error Service unavailable.
524 Timeout Service unavailable.
612 Undefined Internal Error Something went wrong and we failed the job you submitted. You can try again at no extra cost, as we do not charge you for faulted jobs. If that does not work, please get in touch with us.
613 Faulted After Too Many Retries We tried scraping the job you submitted, but gave up after reaching our retry limit. You can try again at no extra cost, as we do not charge you for faulted jobs. If that does not work, please get in touch with us.

References


Disclaimer: This part of the content is mainly from the merchant. If the merchant does not want it to be displayed on my website, please 联系我们 删除您的内容。

最后更新于 5 月 15, 2022

您推荐代理服务吗?

点击奖杯即可颁奖!

平均评分 0 /5.计票: 0

目前没有投票!成为第一个给本帖评分的人。

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

zh_CNChinese
滚动到顶部